Serverless Needs a Bolder, Stateful Vision
At a grid computing event — GlobusWorld, in 2005 — then Network World editor-in-chief John Dix moderated spokespersons from Cisco, HP, IBM, Intel, Nortel, and SAP through a conversation about Grid and the Future of the Networked Machine. His opening remarks:
“By many accounts, average system utilization across organizations is 15 percent to 20 percent today, while obviously, the ideal would be around 80 percent. What’s more, some 20 percent of IT budgets go to operations, marginally less than the 25 percent earmarked for capital investments. We’ve created large, underutilized, complex environments that are costly to maintain. So there is a huge need to do this better, and the prevailing thinking at this point seems to be that grid is the answer.”
The rest of the conversation that followed is an interesting time capsule of predictions from 15 years ago about what distributed computing and the cloud’s impact would be on the enterprise stack, and a lot of it is accurate.
About 15 years ago it was about resource utilization, and companies moving away from huge capital expenditures on servers and data centers, onto models where pay-as-you-go was possible — that was what the market was demanding, and that’s what shaped cloud computing as we know it today.
However, the programming model for serverless is still mainly limited to stateless functions — the so-called Function-as-a-Service (FaaS) model — which limits the range of use-cases supported.
Now, that new consensus cry for the next computing wave of the future is speed — getting developers what they need faster, getting value from data faster, getting IT out of the way, and shipping software faster, all while maintaining reliability and predictability guarantees.
We’re at another juncture in enterprise computing where there is a large push behind a big vision of the future, the push towards serverless architectures — a world where less human oversight and participation is required in operations.
But While We’re Imagining This Future, How Did the Serverless Vision Get Limited to Functions?
I strongly believe in the serverless movement. The last year there’s been a lot of interesting work (for example around Knative project) expanding on the serverless UX to cover the whole software lifecycle, from build (source to image), to CI/CD pipelines, to deployment, to runtime management (autoscaling, scale to 0, automatic failover, etc.). However, the programming model is still mainly limited to stateless functions — the so-called Function-as-a-Service (FaaS) model — which limits the range of use-cases supported.
At the moment I’m seeing a lot of the conversation confusing FaaS with serverless. Similar to blockchain being used (mistakenly) interchangeably with Bitcoin (Bitcoin is an implementation of blockchain and not equivalent), FaaS is an implementation serverless. While I think that FaaS is a great piece of technology, it is selling the promise of serverless short. Serverless is all about the UX, a UX that can address many implementations, with FaaS being the first.
But why all the fuss about functions in the first place? Functions are extremely useful as low-level building blocks for software. Functional programming is all about programming with functions — working with functions as first-class values that can be sent around, composed, and reused. It’s a great abstraction to use. I would define functions as essentially simple Lego blocks, that have well-defined input and well-defined output — taking data in, processing it, and emitting new data as output. A pure function is stateless, which makes it predictable, you can trust that given a certain input it will always produce the same output.
The output of one function can become the input of the other, making it possible to string them together just like Lego blocks, composing larger pieces of functionality from small reusable parts. Individual functions are by themselves not that useful because they (should) have a limited scope, providing a single well-defined piece of functionality. But by composing them into larger functions you can build rich and powerful components or services. Scala is one example of a mainstream language with a powerful functional side (the other side being OO).
FaaS is building on the idea of composing these small well-defined pieces of functionality into larger workflows, all driven by the production and consumption of events (of data). This data-shipping architecture is great for data processing oriented use-cases where functions are composed into workflows processing data downstream and eventually producing a result to be emitted as an event to be consumed by the user or some other service or system.
But it is not a general platform for building modern real-time data-centric applications and systems. As with all targeted solutions designed to solve a narrow and specific problem well, it suffers from painful constraints and limitations when used beyond its intended scope.
One such limitation is that FaaS functions are ephemeral, stateless, and short-lived (for example, Amazon Lambda caps their lifespan to 15 minutes). This makes it problematic to build general-purpose data-centric cloud native applications since it is simply too costly — in terms of performance, latency, and throughput — to lose the computational context (locality of reference) and being forced to load and store the state from the backend storage over and over again. Another limitation is that functions have no direct addressability, which means that they can’t communicate directly with each other using point-to-point communication but always need to resort to publish-subscribe, passing all data over some slow and expensive storage medium. A model that can work well for event-driven use-cases but yields too high latency for addressing general-purpose distributed computing problems. For a detailed discussion on this, and other limitations and problems with FaaS read the paper “Serverless Computing: One Step Forward, Two Steps Back” by Joe Hellerstein, et al.
Functions is a great tool that has its place in the cloud computing toolkit, but for serverless to reach the grand vision that the industry is demanding of an ops-less world while allowing us to build modern data-centric real-time applications, we can’t continue to ignore the hardest problem in distributed systems: managing state — your data.
State Is the Hardest Part and the Most Interesting Opportunity for Serverless
In the cloud native world of application development, I’m still seeing a strong reliance on stateless, and often synchronous, protocols and design. People embrace containers but too often hold on to old architecture, design, habits, patterns, practices, and tools — made for a world of monolithic single node systems running on top of the almighty SQL database.
The serverless movement today is very focused on the automation of the underlying infrastructure, but it has to some extent ignored the equally complicated requirements at the application layer, where the move towards fast data and event-driven stateful architectures creates all sorts of new challenges for operating systems in production.
It might sound like a good idea to ignore the hardest part (the state) and push its responsibility out of the application layer — and sometimes it is. However, as applications today are becoming increasingly data-centric and data-driven, taking ownership of your data by having an efficient, performant, and reliable way of managing, processing, transforming, and enriching data close to the application itself, is more important than ever.
Many applications can’t afford the round-trip to the database for each data access or storage — as we have been used to do in traditional three-tier architectures — but need to continuously process data in close to real-time, mining knowledge from never-ending streams of data as it “flies by”. Data that also often needs to be processed in a distributed way — for scalability, low-latency, and throughput — before it is ready to be stored.
This shift from “data at rest” to “data in motion” has forced many companies to fast data architectures with distributed stream processing and event-driven microservices — putting stateful state and data management at the center of application design.
This is just the starting point for the many concerns of managing state in distributed applications, and a domain that has to be conquered for the serverless movement to keep making interesting progress against its objectives to raise the abstraction level and reduce the human interaction with operations of systems in production.
Serverless without State Is Just Not That Interesting
I tweeted that almost three years ago, and I think it’s still largely true (even though a lot of progress has been made improving and expanding on the overall serverless UX).
While the 1.0 version of serverless was all about stateless functions, the 2.0 version will focus largely on state — allowing us to built general-purpose distributed applications while enjoying the UX of serverless.
If serverless is conceptually about how to remove humans from the equation and solve developers’ hardest problems with reasoning about systems in production, then they need declarative APIs and high-level abstractions with rich and easily understood semantics (beyond low-level primitives like functions) for working with never-ending streams of data, manage complex distributed data workflows, and managing distributed state in a reliable, resilient, scalable, and performant way.
Examples include support for stateful long-lived virtual addressable components, tools for managing distributed state reliably at scale with options for consistency ranging from strong to eventual and causal consistency, and being able to reason about streaming pipelines and the properties and guarantees it has as a whole.
Conclusion: The Requirements of Serverless Are Evolving from the Infrastructure, up to the Application Logic
The enterprise is broadly replatforming data centers on containers, Kubernetes, and cloud-native frameworks in its orbit, and the adoption and momentum have been remarkable.
One question that I frequently hear is: “Now that my application is containerized, do I still need to worry about all that hard distributed systems stuff? Won’t Kubernetes solve all my problems around cloud resilience, scalability, stability, and safety?” Unfortunately, the answer is “No, definitely no” — it is just not that simple.
While I believe that Kubernetes is the best way to manage and orchestrate containers in the cloud, it’s not a cure-all for programming challenges at the application level, such as:
- The underlying business logic and operational semantics of the application.
- Managing distributed application data consistency and integrity.
- Managing distributed and local workflow and communication.
- Integration with other systems.
In the now classic paper “End-To-End Arguments In System Design” from 1984, Saltzer, Reed, and Clark discuss the problem that many functions in the underlying infrastructure (the paper talks about communication systems) can only be completely and correctly implemented with the help of the application at the endpoints.
This is not an argument against the use of low-level infrastructure tools like Kubernetes and Istio — they clearly bring a ton of value — but a call for closer collaboration between the infrastructure and application layers in maintaining holistic correctness and safety guarantees.
End-to-end correctness, consistency, and safety mean different things for different services. It’s totally dependent on the use-case, and can’t be outsourced completely to the infrastructure. To quote Pat Helland: “The management of uncertainty must be implemented in the business logic.”
In other words, a holistically stable system is still the responsibility of the application, not the infrastructure tooling used — and the next generation serverless implementations need to provide programming models and a holistic UX working in concert with the underlying infrastructure maintaining these properties, without continuing to ignore the hardest, and most important problem: how to manage your data in the cloud — reliably at scale.
 All the terms here are important, so let me clarify them. Stateful: in-memory yet durable and resilient state; Long-lived: life-cycle is not bound to a specific session, context available until explicitly destroyed; Virtual: location transparent and mobile, not bound to a physical location; Addressable: referenced through a stable address. One example of a component with these traits would be Actors (the Actor Model).
 Properties such as backpressure, windowing, completeness vs correctness, etc.