The Best Way to Think about Resilience Is Not to
If you’ve used word processing tools for a long time, you remember the reflexive action of hitting the “save” keyboard shortcut — the fear of losing your work, cursing out loud and bemoaning the amazing work that you just lost.
With modern tools (think Google Docs), this worry doesn’t even come up. In the middle of a word and the power goes out? No problem. Everything is saved in the state you left it and you can move on.
Samar Abbas and his team at workflow orchestration engine Temporal want to bring this concept to your enterprise workflow. You provide the business logic and they handle all the parts that require specialized expertise like persistence and resilience.
Temporal was founded in 2019 by Abbas and his colleague Maxim Fateev while they were at Uber. They had created a development platform for the car-hailing app company dubbed “Cadence.” It’s an evolution of the AWS Simple Workflow Service platform that the duo helped develop when they were colleagues at Amazon in the mid-2000s. Dozens of Uber services and applications adopted Cadence.
Abbas and Fateev left to co-found Temporal and build the fault-tolerant workflow engine successor project to Cadence. In the three years since, the company has enjoyed solid success, with companies like Netflix, Instacart and others using Temporal’s open source software code. Earlier this year, the company secured a $103 million Series B round that put its valuation at $1.5 billion.
I recently had a conversation with Abbas (you can watch the whole thing here) about how he and Fateev built their wildly successful venture. (His words below are edited for clarity and length).
From Uber to Temporal
In 2015, Uber opened an office in Seattle, and I joined their engineering team. Me and Max [Temporal co-founder Maxim Fateev] ended up at Uber within a month of each other. The key project we worked on together was Cadence.
At Uber, engineers spent lots of time stitching together low-level queues, databases and durable timers to build resilience into their applications.
This is what we were trying to solve with a system like Cadence, where we provided high-level abstraction that was still code-based, but we took certain classes of failures off of engineers’ plates and solved them underneath the platform to allow the engineers to focus on the business logic for the application, rather than building resiliency.
That was successful within Uber and, as it was open source, we started seeing a lot of external adoption for the technology. So in 2019, both me and Max decided to take the leap and started Temporal, because we really wanted to focus on the external adoption of the technology.
The Temporal Developer Experience
People sometimes describe Temporal as a workflow engine, or describe its features, but the key value proposition for us is developer productivity: how fast developers can build applications and get them running in production without spending weeks or months testing all sorts of failure situations that can happen in a cloud native environment.
So the way we think about developer experiences is not just the core aspects of what the technology has to offer; we cover the entire software development life cycle from the get-go, how developers are building their application. For instance, a lot of workflow engines typically go the domain-specific language (DSL) route. We are all code-based. We know developers like writing code, and we want them to write code, but just take away a certain class of concerns, like how to make that code resilient if some underlying infrastructure goes down, or how to make code resilient when a network blip happens.
When and How Does Temporal Make a Difference?
Money transfers are one of the key use cases where Temporal is used quite frequently. If you are moving money from one account to another account, typically from a user perspective, yes, I debit from account A and then credit to account B. But a majority of the software development time is spent on system failures between those two calls. And this is where basically engineers are spending all sorts of time.
This is an example of when a system like Temporal can help big time — it even feels magical. We hear this question a lot: What happens if my application fails at this point?
Our response to that question is: Workflows never fail (At Temporal, we call the primitive that we are building a “workflow.”) Then it’s one of those moments when a light switch goes on. We’ve started to call this “durable execution,” where at a high level what we provide is this: Your executions are completely durable. They never fail.
The Business Impact of a ‘Fault Oblivious’ Stateful Workflow
Back in the 90s when I was in school, we used to type all of our assignments in Microsoft Word. You got in the habit of saving your document every time you wrote a few edits. Yet there was a certain class of failures, like the hard disk going down, where you lost all your work.
Now, with Google Docs, kids cannot even relate to this. There isn’t even a “save” button anymore. We believe that there’s a class of stateful applications that are still in this 1990s era, where more than 80% of the code is about handling infrastructure failure to build resiliency for stateful applications. Every time an event happens, you load that state, apply that event, do a bunch of actions and then store that state back. This is where a majority of engineering goes toward: how to make that reliable, fast, performant and protect it against all sorts of failures and corruptions.
Developers shouldn’t have to even think they can ever lose their state. There are just these durable executions that never fail. And I think it’s completely going to change how engineers think about cloud native systems.
Why Managed Apache Cassandra?
My co-founder Max and I come from a background of building messaging systems and middleware. Running storage systems is not our strength. So when we started the company as only two people, a key goal for us was to capitalize on our strengths of providing the best developer experience for Temporal users. Temporal has a server component and client SDKs, which most developers out there use for building applications. But how can people run those servers with minimum operational overhead? This is where the majority of the overhead for running Temporal is.
We have a pluggable persistence model; we support Apache Cassandra, MySQL and Postgres as pluggable adapters. Cassandra is one of the adapters that has very nice scalability characteristics. A key value proposition for our users is the fact that they are running mission-critical applications, and reliability is the key thing that they are looking for. So we do not take it lightly when we bring a new dependency into the Temporal fold. We ran over a month of evaluation for all sorts of persistence options. It was DataStax Astra DB hands down.
Some databases win on some features, others win on other features. But it wasn’t even about the technology in this case; it’s about the people. We believe bugs and failures are a part of life. It’s all about how you respond when an outage is happening. And this is where we believe Astra DB wins. There are so many similarities with the way DataStax treats its customers and the kinds of relationships that they build when it comes to operationalizing their databases. And that gave us confidence that this is a dependency that we want to invest in for a core part of the system.
I don’t think we would be in a place where we are today if a technology like Astra was not there for us to capitalize and build on top of. Things like just operationalizing Cassandra, and getting stuff “done” alone would be at least a year-long project, and that is not even part of our core strength. For a company like us, where the key value proposition is reliability, if we cannot figure out a way to run and operationalize your storage in a reliable fashion, we don’t have a business.