Development / Microservices

Temporal Tackles Microservice Reliability Headaches

3 Nov 2020 3:00am, by

With microservices, developers spend too much time writing code to ensure the reliability of their applications, rather than on creating business value for their companies, according to the startup Temporal.

Founded by the creators of Uber’s fault-tolerant stateful platform Cadence, the Bellevue, Washington-based venture wants to change that with a fork of the Cadence open source project.

Using code, it aims to hide the complexity of building with microservices across distributed systems. It employs durable virtual memory not linked to a specific process that preserves the application state despite a whole range of possible failures.

“Today, developers are forced to anticipate all failure scenarios and corner cases, and they spend a lot of time writing ‘glue code’ from scratch in order to handle these innumerable failure modes,” said Bogomil Balkansky, partner at Sequoia Capital in a blog post.  “That’s where Temporal comes in: The company has created a groundbreaking technology that enables any software application to handle failures gracefully and in a user-friendly way.”

Sequoia recently led an $18.75 million Series A round for the company, founded by Maxim Fateev and Samar Abbas.

In essence, Temporal wants developers to focus on their business logic, while it handles durability, availability and scalability of the application.

Handling Reliability

Monolithic applications used to run on one or two machines, but with microservices, data can be on multiple machines. If one fails, another part of an application might not be updated, “and the usual solution is actually a patchwork of different technologies is that developers use: queues, they use databases, they use Redis caches, they use a timer service,” explained Fateev. Not only can managing all that become extremely complicated, it takes a lot of developer time.

“It’s also very error-prone. And most of those solutions are not usable across different applications. So practically every time developers have to write a new application, most of the time they spend not on the business project, but on making their applications reliable.”

He likened that scenario to the old days with Microsoft Word when you had to hit “Save” so often in order to not lose work, rather than Google Docs, which saves automatically.

“This is what most programs these days do. They load the state, they update the state, then save it back on every request. And most of the code is not about the actual request, but is about saving and updating that. This is a lot of effort and developers are not happy about it,” Fateev said.

Central Brain for State

Temporal consists of a programming framework (or SDK) and a managed service (or backend).

The core abstraction in Temporal is a fault-oblivious stateful Workflow with business logic expressed as code. The state of the Workflow code, including local variables and threads it creates, is immune to process and Temporal service failures.

Temporal supports the programming languages Java and Go, but has SDKs in the works for Ruby, Python, Node.js, C#/.NET, Swift, Haskell, Rust, C++ and PHP.

In the event of a failure while running a Workflow, state is fully restored to the line in the code where the failure occurred and the process continues without developer intervention.

One of the restrictions on Workflow code, however, is that it must produce exactly the same result each time it is executed, which rules out external API calls. Those must be handled through what it calls Activities, which the Workflow orchestrates. An activity is a function or an object method in one of the supported languages, stored in task queues until an available worker invokes its implementation function. When the function returns, the worker reports its result to the Temporal service, which then reports to the Workflow about completion.

“Those [external] services can fail because we don’t control them,” Fateev said. “And for those, you have very broadly defined retry policies. And one big difference from other solutions, there are no limits how long they can retry, you can specify retry policy for a week, for a month, for a year.”

The backend service is stateless and relies on a persistent store. So far it supports Cassandra and MySQL stores, although an adapter can be used to any other database that provides multirow single-shard transactions.

The company touts Temporal as an ideal way to scan big data sets (or multiple Activities for partitioned data sets) in a scalable and resilient way. It has the ability to route tasks to a specific process and reroute retries to a different host, if necessary.

For distributed transaction processing, it employs native Saga Pattern support, which involves compensating transactions in case of failure in one service to undo the impact of the preceding transactions.

It recently released its version 1, with backward capability, improvements in the way shard IDs are hashed and experimental features for archival; cross-data center replication; batch operations; dynamic config and addition; and removal and creation of searchable attributes with ElasticSearch. It’s working on a deprecation policy.

Temporal has proven popular with developers at background-check technology vendor Checkr, according to a case study.

“Modeling things as Workflows and Activities makes inter-team sharing possible, meaning that code is continuously reused and not continuously reinvented,” it states, noting also improved visibility. “Being able to see step by step what is happening, what path a Workflow took, is very valuable.”

File-sharing site Box had built a custom orchestration system to handle updates on big files that often contain millions of files, each with their own permissions and metadata. Each worker and queue required custom logic and state, though, quickly making that system unmanageable.

“In the back of my mind, it’s always been, ‘We need to start looking at a technology that can solve the general workflow pattern.’ We needed a central brain where we can store state,” said senior staff software engineer Steven Cipolla in a case study.

Temporal became that central brain. Another benefit was the ability to map its software architecture with code ownership boundaries within the organization to help identify roadblocks to achieving the velocity it sought.

Temporal plans to launch a hosted version within the next year, a much-requested feature, according to Fateev. It will be adding support for more databases, including Postgres soon, and adding more security capabilities.

A newsletter digest of the week’s most important stories & analyses.