The Strange Behaviors of Facebook’s Metastable Failures
If the cloud truly has become the computer, as HP recently suggested, then the types of engineering problems that plagued the first and second generations of computers may rear their ugly heads, or head their ugly rears, within the first and second generations of cloud data centers. The behavioral dynamics of the devices we hold in our hands would scale out, translating into the device that spans the planet.
Metastable failure is a term that Facebook engineer Nathan Bronson has just reintroduced into the vocabulary of data center engineers, particularly those who work on Internet-scale systems. It’s a state of unpredictable outcomes caused by something whose state is actually in transition — in a binary system, such as a switch, that would mean the middle ground between one state and the other.
With many types of electronic devices, for the results of a logical operation to follow a predicted order, they must take place with synchronicity. A clock governs the rhythm of events, and states that may be subject to change are kept on hold until the next “clock edge,” like knowing when the conductor of an orchestra is about to drop his hands on the “down” beat. This creates a challenge for a real-time sensor, which cannot with absolute assuredness signal every event at the moment it happens; it has to wait for the clock edge, so that the setup is in order.
Some of the first computers, especially in the 1950s and early ‘60s, had difficulty achieving this perfect synchronicity. So engineers would see anomalous results — not only that, but chains of subsequent results and continued patterns of bizarre behavior, that required a complete shutdown and restart. The cause was eventually attributed to what engineers dubbed metastability: the result of a set of operations triggered by a circuit whose logic state was in-between 0 and 1, where it’s not supposed to be.
You can imagine the genuine fear that engineers had when accounting for possible metastability for the guidance computers in the Apollo missions.
In microprocessor design, because metastability can happen, engineers have built arbiters to detect such conditions and forestall their possible consequences. This kind of arbitration has been going on successfully since the 1970s, for the entire duration of the x86 era, so we’ve never had to discuss metastability openly as a casual topic of interest.
Until now. cloud data centers and Internet-scale systems are the electronic devices of the modern day. Instead of logic circuits, we have servers. But the engineers of these systems have to deal with timing issues, with ensuring that certain events happen with synchronicity. And they’re doing so using a wire transport protocol that is not specially adapted to the task: TCP/IP is a best-guess system for forwarding packets downstream and letting the recipient sort out the mess at the receiving end. It’s no more intelligent about how or when a message will be received than a copper wire is about the fate of any of the electrons it carries.
Thus it’s not Intel or ARM which is facing the new wave of metastability head-on, but Facebook and Google and Amazon. They’re the ones facing the new behavioral issues for the first time. And very few of them, understandably, recognize a certain peculiar, historical similarity.
A Simple Mental Model
Except for Nathan Bronson. Stanford-educated, having interned in the early ‘90s at the Institute for Defense Analyses, Bronson’s LinkedIn profile speaks to his state of mind:
I believe that to tackle big problems one must factor complexity into pieces that can each fit in someone’s brain, and that the key to such factoring is to create abstractions that hide complexity behind a simple mental model.
The model Bronson adapted to the task at hand was metastability. In a Facebook company blog post last month, he described the behavior Facebook was witnessing when its systems try to preclude the likelihood of traffic bottlenecks. Like many large-scale networks, it resorts to link aggregation — enabling broader switch-to-switch connections using multiple links in parallel, like opening up extra lanes. For parallel operations to happen reliably, as you can imagine, they require a modicum of synchronicity. But they can’t always get it, because in TCP/IP — which is asynchronous by design — there’s no analog for the clock circuit, the part that acts as the conductor in a small device.
For over two years, reports Bronson, aggregated links at Facebook were tending to become overloaded, when hashes would direct packets to be aggregated into a single link. Inevitably, packets would be dropped. Bronson recognized this behavior as a form of metastability, producing the same patterns of anomalous results that you’d find in a simpler, synchronous logic device.
Specifically, these results would be observed when traffic between Facebook’s MySQL databases would be exchanged with its social graph cache servers, which Facebook calls TAO. By any chance, does this look familiar to you?
Months upon months of remaining stymied brought Bronson and the other Facebook engineers to the point of pretending the anomaly was the product of a skillfully crafted malicious exploit. If this behavior were not by coincidence but by design, they reasoned, for what purpose would it have been crafted? A real malicious exploit generates behaviors that are informative about the systems they attack, and the core of the exploit responds to what it learns in kind.
What the TAO “exploit” produced, as Bronson tells the story, were latencies whose patterns serve as signals. A malicious exploit looking for such latency patterns could study the regularities in their patterns and determine not only that the links producing them were congested, but also to what degree. For an exploit to intentionally cause the results they were witnessing, it could conceivably be concluding any other link not generating the same signals was not congested, and therefore shut it down, to increase the burden on the congested link.
As the engineers then realized, their own programming was doing exactly that. With the best of intentions, it was designed to open up as many links as there are database queries over the previous 10 seconds. If a user’s personal graph data is not in the cache, it may take some 100 queries to the database to refresh that cache.
Bronson writes: “This starts a race among those queries. The queries that go over a congested link will lose the race reliably, even if only by a few milliseconds. That loss makes them the most recently used when they are put back in the pool. The effect is that during a query burst we stack the deck against ourselves, putting all of the congested connections at the top of the deck.”
The solution was simple and effective: The team switched from a most recently used (MRU) connection pool to a least recently used (LRU). Attempting to re-create the anomalous behavior with the LRU scheme was unsuccessful.
This is the type of re-engineering work that developers at Internet scale will find themselves doing continually from this point forward, as the communications networks of today continue to exhibit the strange behavioral characteristics of the electronics devices of yesterday. It’s the type of work we’d like to talk more about with Facebook directly. Facebook declined our interview to go into more depth on this topic… exhibiting a pattern that also reminds us of the engineers of the past who, upon realizing that being first to market with a genuine solution gave them a competitive advantage, became more silent about their work. It was a silence which would, in time, give rise to the era of personal computing, whose engineers were more free to talk openly.