How to Diminish the Inevitable Failure of Distributed Systems
“In a galaxy far far away, all business apps talked to one database and usually one person understood how to operate it on a company’s own hardware.”
That’s how Pivotal Cloud Foundry Software Engineer Denise Yu kicked off her London DevopsDays talk.
Today’s big data storage and retrieval needs have since become a lot more complicated all because of the immense amount of data we’ve gained and the natural language processing, fuzzy searching, machine learning and artificial intelligence we are using to query and interpret it all.
To meet these evolving needs, Yu says the first thing companies did was to scale vertically, but eventually it became financially and/or physically impossible to open up more CPUs on more servers. In came cloud computing, which allowed horizontal scaling and distribution across many machines. There are three key drivers for horizontally distributing workloads:
- Scalability: One machine cannot handle request size or data size, so database fragmentation and sharding onto many machines helped solve this.
- Availability: If one machine goes down, others can keep working.
- Latency: You can go faster and get better request times when data is geographically closer to end users.
Yu said this eventually led to shared-nothing architecture (SN) — the most popular form of network computing, which Yu says all public clouds run on — where each node is independent and self-sufficient, and there is no single point of contention across the system. Since machines don’t share access to any resources, data can be retrieved quickly.
But this comes with a lot of its own problems. Yu’s talk discussed the philosophy behind distributed systems and how to overcome it, all with her original illustrations. Today, The New Stack shares her insights.
Understanding CAP Theorem and Network Partitions
Yu, a philosophy major turned software engineer, says the myth that the network is reliable all comes down to an epistemic philosophical problem, most commonly explained with the CAP theorem, which she, in part, disputes.
Dr. Eric Brewer’s CAP theorem from the year 2000 — which stands for consistency, availability and partitioning — maintains that, because of network reliabilities and other limitations of distributed systems, there has to be a trade-off where one of the three doesn’t work optimally, so the other two can. CAP states distributed systems trade-offs can be one of these three:
- Consistent and Partition Tolerant (CP) — a system sacrifices availability in order for consistent responses and maintaining network partitions.
- Consistent and Available (CA) — a system lowers the network partition so it is always available and the responses are consistent
- Available and Partition Tolerant (AP) — a system’s responses aren’t consistent but are always available and the partitions are in place.
This is typically represented by a circle or cycle where two of the three aspects depend on each other. Yu presented Coda Hale’s reframing of CAP as an analytical tool, where partition tolerance is unavoidable, but consistency and availability can be traded off against each other. She offers her own seesaw representation illustrated below.
She starts by clarifying the C with an L for linearizability, which requires the most up-to-date data to be presented to all clients on read and write.
“The C in CAP means a really narrow definition of consistency. It means that all other clients must see consistency too. This is really really hard. It demands distance and replication,” Yu explained. “Database engineers try to reduce that replication lag down as close as possible to zero. Eventual consistency doesn’t count as part of the CAP formulation.”
She says one of the main problems with this is that an asynchronous system cannot work in the world of CAP. CAP relies on each type of node having an identical state. Most systems in our much more complicated data sphere can’t handle instant and universal replication which leads to lag times.
Asynchronous in this situation “is known as strong consistency which creates more demand on your system and has a greater frequency of blocking because it must be synchronous,” Yu explained, referencing Jepsen’s Consistency Model.
In modern systems, it all comes down to strong, not absolute consistency.
“We have to be really clear what we mean when we say a system is designed with consistency in mind,” she continued.
Next she attacked the A for Availability because she says availability cannot be examined without latency considerations, which the CAP theorem doesn’t factor in.
“Don’t run a distributed system until you absolutely have to.” — Denise Yu
“We tend to think of it as a binary state but it’s not because of network latency. How do we know if it’s really unresponsive or really just responding slowly?”
Yu says we need to set up a timeout limit.
“The first time you might as well roll some dice. You need monitoring and observation to understand what a reasonable timeout is for you,” she continued, noting that some database management systems have fuzzy timeouts built into them, like Cassandra.
Yu disputed the CAP theorem on the P for Partition Tolerance, that interruption that occurs between a network partition, fault or split across two data centers or clouds. She says everything hinges on the partition because with distributed systems it’s completely unavoidable.
Calling it a whole other “genre of failure” Yu said that “during a partition event, your nodes may as well be on opposite sides of a wormhole. There is no way to know the state on the other side.”
The standard proof of the CAP theorem considers two ways of responding to network partitions:
- Let clients keep reading and writing on both sides of split, which leads to a loss of linearizability.
- Or you can pause one side until the partition stops but that is a loss of availability.
Either way, partitioning is inevitable. You just have to decide how much your systems can handle.
She quoted Google’s Jeff Dean in saying: “In the first year of a Google cluster’s life, it will experience five rack failures, three router failures, and eight network maintenances.”
Yu noted that “Clearer failed states help increase reliability,” using 500-something error codes to clarify for users.
So now that we understand more clearly how things in distributed systems fail, it’s time to discuss why they fail and what can be done to limit that inevitable failure.
What Do We Say to the God of Total Downtime? Not Today!
Why do distributed systems fail? Hardware, routers and network cables will give out eventually and nefarious outside forces will eventually destroy your connectivity.
Yu added to this list that “Software will behave weirdly. Resource isolation is never going to be perfect and static — and you don’t want it to be totally static. VMs [virtual machines] will do what we call ‘bursting’ — briefly spike in CPU usage — will slow your process down for seconds or even minutes.”
She continued that there can be stop-the-world-garbage collection when a container stops for seconds or even minutes depending on the machine or network glitches just randomly happen. In 2009, someone evidently crawled into a California manhole and just started chopping — no theorem is going to help that!
It comes down to preparing for the likelihood of failure. First and foremost we mitigate this by observing and monitoring our systems, and learning which of the following options work best for your systems.
One way Yu offered is what’s called Leader-Follower Failover Pattern, which is more explicitly explained in her infographic below.
She describes the leader-follower pattern as replicating data across nodes in a cluster.
“If a follower fails, it’s no big deal — usually — because other nodes can continue to serve read requests,” Yu said.
If a “leader node” fails than a node cluster automatically initiates a failover to elect a new leader, following these steps:
- Detect that a node is offline. Most database management systems set this to a timeout at 30 seconds by default.
- Elect a new leader from remaining nodes, usually with the most consistent and current data. Yu says some databases are tiny dictatorships where one controller node chooses.
- Traffic is automatically rerouted to the new leader for all future write requests.
Yu then pointed out that sometimes all the embattled timeout nodes may return from war at the same time, ending up with multiple leaders.
“Things get really awkward and data will usually get lost,” she said.
This is when the split-brain strategy can be applied, purposely partitioning leader nodes.
RabbitMQ does this when it detects Mnesia nodes who haven’t communicated in about 60 seconds, which are then flagged as unreadable. These are potential coping mechanisms further described here:
- Pause Minority
In the Kafka Partitioning Strategy, if the leader node drops off, an election is triggered for best node.
Apache Zookeeper is a “thick API” pattern that uses in-sync replica nodes to go back to the most accurate data, sacrificing the most up-to-date consistency.
Finally, both Paxos and Raft are consensus algorithms you can use to elect a new leader.
No matter how you choose to address them, network partitions are distributed systems’ version of death and taxes.
“Network partitions are unavoidable if you want to run distributed systems. You have to decide what makes the most sense operationally for you — you have to choose a data store that chooses availability and consistency options that you’re comfortable with,” Yu said.
Finally, she warned: “And don’t run a distributed system until you absolutely have to.”
Pivotal is a sponsor of The New Stack.
All illustrations courtesy of Denise Yu’s Creative Commons, Feature photo of Yu courtesy of Richard Davies.