Distributed Systems Are Hard

This article is an extract from her forthcoming book “The Cloud Native Attitude,” which will be available from Container Solutions in October 2017.

Nowadays, I spend much of my time singing the praises of a cloud native (containerized and microservice-ish) architecture. However, most companies still run monoliths. Why? It’s not because we’re just wildly unfashionable, it’s because distributed is really hard. Nonetheless, it remains the only way to get hyper-scale, truly resilient and fast-responding systems, so we’re going to have to get our heads round it.
In this post, we’ll look at some of the ways distributed systems can trip you up and some of the ways that folks are handling those obstacles.
Forget Conway’s Law, distributed systems at scale follow Murphy’s Law: “Anything that can go wrong, will go wrong.”
At scale, statistics are not your friend. The more instances of anything you have, the higher the likelihood one or more of them will break. Probably at the same time.
Services will fall over before they’ve received your message, while they’re processing your message or after they’ve processed it but before they’ve told you they have. The network will lose packets, disks will fail, virtual machines will unexpectedly terminate.
There are things a monolithic architecture guarantees that are no longer true when we’ve distributed our system. Components (now services) no longer start and stop together in a predictable order. Services may unexpectedly restart, changing their state or their version. The result is that no service can make assumptions about another — the system cannot rely on 1-to-1 communication.
A lot of the traditional mechanisms for recovering from failure may make things worse in a distributed environment. Brute force retries may flood your network, and restores from backups are not straightforward. There are design patterns for addressing all of these issues, but they require thought and testing.
If there were no errors, distributed systems would be pretty easy. That can lull optimists into a false sense of security. Distributed systems must be designed to be resilient by accepting that all possible errors are just business as usual.
What We’ve Got Here Is Failure to Communicate
There are traditionally two high-level approaches to application message passing in unreliable (i.e. distributed) systems:
- Reliable but slow: Keep a saved copy of every message until you’ve had confirmation that the next process in the chain has taken full responsibility for it.
- Unreliable but fast: Send multiple copies of messages to potentially multiple recipients and tolerate message loss and duplication.
The reliable and unreliable application-level communications we’re talking about here are not the same as network reliability (e.g. TCP vs UDP). Imagine two stateless services that send messages to one another directly over TCP. Even though TCP is a reliable network protocol, this isn’t reliable for application-level communications. Either service could fall over and lose the message it was processing because stateless services don’t securely save the data they are handling.
We could make this setup application-level-reliable by putting stateful queues between each service to save each message until it has been completely processed. The downside to this is it would be slower but we may be happy to live with that because if it makes life simpler, particularly if we use a managed stateful queue service so we don’t have to worry about the scale and resilience of that.
The reliable approach is predictable but involves delay (latency) and complexity: lots of confirmation messages and resiliently saving data (statefulness) until you’ve had sign-off from the next service in the chain that they have taken responsibility for it.
A reliable approach does not guarantee rapid delivery but it does guarantee all messages will be delivered eventually, at least once. In an environment where every message is critical and no loss can be tolerated (credit card transactions for example) this is a good approach. AWS Simple Queue Service (Amazon’s managed queue service) is one example of a stateful service that can be used in a reliable way.
The second unreliable approach is faster end-to-end, but it means services usually have to expect duplicates and out-of-order messages, and some messages will go missing. Unreliable communication might be used when messages are time-sensitive (i.e. if they are not acted on quickly, it is not worth acting on them) or later data just overwrites earlier data. For very large-scale distributed systems, unreliable messaging may be used because it is so much faster with less overhead. However, microservices must be designed to cope with message loss and duplication.
Within each approach, there are a lot of variants (guaranteed and non-guaranteed order, for example), all of which have different tradeoffs in terms of speed, complexity and failure rate.
Some systems may use multiple approaches depending on the type of message being transmitted or even the current load on the system. This stuff is hard to get right if you have a lot of services all behaving differently. The behavior of a service needs to be explicitly defined in its API. It often makes sense to define constraints or recommended communication behaviors for the services in your system to get some degree of consistency.
What Time Is It?
There’s no such thing as common time, a global clock, in a distributed system. For example, in a group chat, there’s usually no guaranteed order in which my comments and those sent by my friends in Australia, Colombia and Japan will appear. There’s not even any guarantee we’re all seeing the same timeline — although one ordering will generally win out if we sit around long enough without saying anything new.
Fundamentally, in a distributed system every machine has its own clock and the system as a whole does not have one correct time. Machine clocks may get synchronized loads but even then transmission times for the sync messages will vary and physical clocks run at different rates so everything gets out of sync again pretty much immediately.
On a single machine, one clock can provide a common time for all threads and processes. In a distributed system this is just not physically possible.
In our new world then, clock time no longer provides an incontrovertible definition of order. The monolithic concept of “what time is it?” does not exist in a microservice world and designs should not rely on it for inter-service messages.
The Truth Is Out There?
In a distributed system there is no global shared memory and therefore no single version of the truth. Data will be scattered across physical machines. In addition, any given piece of data is more likely to be in the relatively slow and inaccessible transit between machines than would be the case in a monolith. Decisions, therefore, need to be based on current, local information.
This means that answers will not always be consistent in different parts of the system. In theory, they should eventually become consistent as information disseminates across the system but if the data is constantly changing we may never reach a completely consistent state short of turning off all the new inputs and waiting. Services, therefore, have to handle the fact that they may get “old” or just inconsistent information in response to their questions.
Talk Fast!
In a monolithic application, most important communications happen within a single process between one component and another. Communications inside processes are very quick so lots of internal messages being passed around is not a problem. However, once you split your monolithic components out into separate services, often running on different machines, then things get trickier.
To give you some context:
- In the best case, it takes about 100 times longer to send a message from one machine to another than it does to just pass a message internally from one component to another.
- Many services use text-based RESTful messages to communicate. RESTful messages are cross-platform and easy to use, read and debug but slow to transmit and receive. In contrast, Remote Procedure Call (RPC) messages, paired with binary message protocols are not human readable and are therefore harder to debug and use but are much faster to transmit and receive. It’s around 20 times faster to send a message via an RPC method, of which a popular example is gRPC, than it is to send RESTful messages.
The upshot of this in a distributed environment is:
- You should send fewer messages. You might choose to send fewer and larger messages between distributed microservices than you would send between components in a monolith because every message introduces delays (aka latency).
- Consider sending messages more efficiently. For the what you do send, you can help your system run faster by using RPC rather than REST for transmitting messages. Or even just go UDP and handle the unreliability.
Status Report?
If your system can change at sub-second speeds, which is the aim of a dynamically managed, distributed architecture, then you need to be aware of issues at that speed. Many traditional logging tools are not designed to track that responsively. You need to make sure you use one that is.
Testing to Destruction
The only way to know if your distributed system works and will recover from unpredictable errors is to continually engineer those errors and continually repair your system. Netflix uses a Chaos Monkey to randomly pull cables and crash instances. This needs to test your system for resilience and integrity and also, just as importantly, test your logging to make sure that if an error occurs, you can diagnose and fix it retrospectively — i.e. after you bring your system back online.
All This Sounds Difficult. Do I Have to?
Creating a distributed, scalable, resilient system is tough, particularly for stateful services. Now is the time to decide if you need it or need it immediately. Can your customers live with slower responses or lower scale for a while? That would make your life easier because you could design a smaller, slower, simpler system first and only add more complexity as you build expertise.
The cloud providers like AWS, Google, and Azure are also all developing and launching offerings that could do increasingly large parts of this hard stuff for you, particularly resilient statefulness (managed queues and databases). These services can seem costly, but building and maintaining complex distributed services is expensive too.
Any framework that constrains you but handles any complexity (like Linkerd or Istio or Azure’s Service Fabric) is well worth considering.
The key takeaway is don’t underestimate how hard building a properly resilient and highly scalable service is. Decide if you really need it all yet, educate everyone thoroughly, introduce useful constraints, do everything gradually and expect setbacks and successes.