In 2012, when Amazon Web Services launched DynamoDB as a commercial service, it arrived with an extra assurance for reliability. The replication and fault tolerance mechanisms for the distributed database system went the usual gauntlet of design and code reviews, extensive testing and was examined in a series of informal correctness proofs.
But in this case, an additional check was taken. The codebase for these functions was also modeled through a formal specification language built on discrete math, called TLA+. TLA+ found a previously missed bug that could have led to data loss when a particular sequence of failures and recovery steps was interleaved with other processing. This was an extremely subtle bug; the shortest error trace exhibiting the bug contained 35 high-level steps.
Through TLA+, AWS engineers also found other previously undetectable subtle bugs, and, better yet, were able to” make aggressive performance optimizations without sacrificing correctness,” noted a subsequent paper from the cloud giant “How Amazon Web Services Uses Formal Methods.”
“Some of the more subtle, dangerous bugs turn out to be errors in design. We have found that testing the code is inadequate as a method for finding these errors, as the number of reachable states of the code is astronomical,” the engineers wrote.
Formally verifying the correctness of a program has long been assumed to be a costly task, hence its use is restricted primarily for safety-critical applications. But AWS software engineers have asserted it can be used more broadly. In particular, it could be used to solve thorny concurrency problems in distributed cloud native systems.
Microsoft has benefitted from TLA+ as well. TLA+ caught a critical error in the memory design of the Xbox, a bug that would have shut down the console after four hours of gameplay.
Amazon Web Services’ distributed systems engineer Murat Demirbas tested TLA+ in a class he taught at the University of Buffalo and found the language useful. Even though the learning curve was steep, “The students liked TLA+ since it gave them a way to experiment and supported them in reasoning about the algorithms,” he wrote. Citing a paper, industry observer Adrian Coyler suggested TLA+ could be used to check smart contracts.
Still finding bugs despite rigorous testing? It might be time for some formal specification and model checking to help solve difficult design problems in critical systems.
How It Works
Software consultant Hillel Wayne first got to know TLA+ working on an educational system, which was, in his words, an “accidentally-distributed system,” where a lot of reoccurring “weird bugs” kept cropping up. They were often collisions between third-party software that the ops team had little control over.
“I saw how Amazon used it, and thought that maybe it would be something worth trying out. I ended up learning it in a couple of weeks,” Wayne said. Applying at work, he caught a bunch of critical bugs, as well as shaved some time off of the delivery schedules.
Wayne wondered why few others seemed to be using it, aside from Amazon. It turned out there was a serious lack of educational material on using the specification language, aside a few books that were more focused on the formal mathematical side of things. To fill this gap of information around TLA+, Wayne wrote a series of tutorials, a book and now professionally teaches TLA+.
A mathematical language, TLA+ uses temporal logic to reason about concurrent systems. The name is an acronym for Temporal Logic of Actions, as it was defined by its creator Leslie Lamport. A pioneer in the science of distributed systems, Lamport has always advocated the importance of developers modeling their systems before actually coding them.
For any system, it specifies both what a system should do — its desired “correctness properties” — and how that system should work, its design. An accompanying explicit state model checker checks for desired correctness across all possible execution states.
TLA+ is a bit like architectural design. Just as home builders just don’t start erecting a house at random, so too should large-scale software development follow a high-level design. Coders won’t be able how different components may interact in a higher level.
A architectural diagram by itself provides no way of being tested. But TLA+ also provides a place to include properties of each component. With properties, TLA+ can model check the design, looking possible combinations and edge conditions that could cause bad states. One easy example is a queue that could change the order in which it processes messages, which would potentially confuse a transaction between two components.
There are a number of ways to work with TLA+. One can sketch out a high-level overview of the program they want to create, then map to TLA+ principles, then derive the code from that. Or they can go in the other direction and translate an existing code base to TLA+ to get a high-level overview.
Many TLA+ users rely on TLC, a model checker, part of Lamport’s TLA+ Toolbox, that can run through all the possible permutations of a system. Models can also be checked by hand, but this truly is time-consuming.
A typical problem that TLA+ could tackle is that of guaranteeing idempotency, where a message needs to be sent only once. This requirement can be derived from the various vendor guarantees, such as simply never sending a message more than once, and sending it at least once.
“It’s being able to essentially test your architectural diagrams,” Wayne said.
Although there was a bit of buzz around TLA+ about a decade ago, the hype seems to have waned, despite its immediate applicability to distributed cloud native systems. When it surfaces on discussion boards such as Hacker News, most curious parties admit it could help them with their work, but they don’t have time to learn or use TLA+.
There is a bit of a learning curve. In the workshop Wayne teaches, the students are usually able to find actual bugs in complex code in about day three. Most of his students are developers and software engineers, with a few IT architects and project managers as well. On their own, they may reach that level of proficiency after a few weeks, Wayne estimates. Often a tool called PlusCal can be used, which offers a more developer-centric (rather than mathematician-centric) pseudocode-like interface to the specs.
A bigger barrier to adoption is the time it demands from the project engineers. The output is a design, not code, and so the engineer still must make the translation from design to code or vice-versa (Though there is an upside — even the manual work of mapping a program, or proposed program, can make obvious to engineers errors that they may not have thought about before).
Plus, an enterprise’s full adoption of TLA+ goes beyond the coder, necessitating change to how the project work itself is structured. Managers of a time-sensitive project (and what project isn’t?) may not want to factor in an extra step for verification. To get the full benefit of TLA+, a project team should factor in the additional step to incorporate TLA+.
Amazon Web Services is a sponsor of The New Stack.
Image by Andrzej Rembowski from Pixabay.