An Amazon Anomaly That Metastasized into a Server-Eating Monster
When an application scales linearly across several virtual machines, any peculiarities in their behavior can scale exponentially. Such was the case for Clever, a San Francisco-based startup that offers a mobile login ecosystem for educators and students. And like many startups, Clever runs on Amazon EC2.
Colin Schimmelfing is a 2010 graduate of Swarthmore College in Pennsylvania, who in an effort to win a mere hackathon a few years ago, ended up designing Quantcast’s real-time quality control measurement system. Now a Clever engineer, Schimmelfing wrote on the Clever blog this week about a tiny behavior anomaly that disappeared whenever anyone tried to look for it. When multiple instances of that behavior were spun up in Amazon Web Services (AWS), that anomaly metastasized into a beast that claimed the health of one-third of their worker instances.
“Right when we needed our infrastructure to ‘just work,’” writes Schimmelfing, “we started getting strange errors and boxes started coming up good as dead.” He dubbed this phenomenon a “Heisenbug,” in honor of physicist Werner Heisenberg — who proved that a quantum phenomenon capable of being measured could truly disappear just because someone looked for it.
Clever mainly serves educators, but that customer base is not all that different from most organizations. Like retail service providers, its traffic patterns tend to be seasonal. For Clever, traffic will triple in the early Fall, around the start of school, says Schimmelfing.
The database of choice for Clever is MongoDB, which uses a sharded replica set. In a common MongoDB setup, there can be at least nine nodes. Of those, two are are data nodes to which the replica set is assigned. Clever spins up as many large-scale EC2 instances as it may need to serve as what Clever calls “workers,” which communicate with the replica set nodes, which MongoDB calls “mongods.”
It was last fall when Clever needed to spin up a boatload of new instances to meet peaks in demand from its educator clients. The one-third of workers that failed their health checks could not make contact with any mongods. As Schimmelfing described, even an ordinary netcat command (which abbreviates to nc) would fail to find a connection to a mongod over MongoDB’s default port 27017.
It wasn’t an Amazon availability zone problem, because other nodes could access the same mongods just fine. And it wasn’t an issue with the port, because other nodes in the same security group were using the port successfully.
Enter Amazon’s paid technical support personnel, for whom Schimmelfing has nothing but praise. As a check of the output from tcpdump revealed, when a mongod failed to send its MAC address back to a worker, it wasn’t because the mongod wasn’t receiving. Rather, the worker wasn’t receiving the mongod’s acknowledgment. What’s more, whenever anyone ran traceroute on the ill-behaving worker, the strange behavior immediately vanished. It might not stay gone, and newly spun up instances might exhibit the same behavior again.
A mystery like that might have made a ravenously curious fellow like Heisenberg put down Schrodinger’s cat and take up cloud service administration.
Amazon’s and Clever’s investigation eventually turned up the culprit. It dates back to 1982, long before a concept of security was ever envisioned for an Internet Protocol network.
It has to do with the obvious fact that IP addresses do not directly resolve to MAC addresses.
And when sending packets over Ethernet, the source and destination addresses must be clearly spelled out — by that, I mean the Ethernet addresses, which include MAC. So in early TCP/IP, Address Resolution Protocol (ARP) was created to enable a host at one address to obtain the MAC addresses of all hosts at a given IP.
It’s a broadcast message; a host sends it, and everyone who hears it diligently responds. As you can imagine, address resolution is the sort of thing that takes place over Internet Protocol… um, somewhat frequently.
At the rate the Internet was growing, by 1990, it could have evolved into a cacophony of ARP broadcasts, like a Philip Glass composition played at 10x speed.
So the ARP cache was created, a little table in memory that stores what was then considered “recent” MAC addresses retrieved from the last ARP broadcast. A host only fetches a MAC address from the network if the IP address it’s looking for doesn’t exist in the cache. At first, 20 minutes seemed like a reasonable time to keep ARP addresses cached. Surely a host isn’t going to change its IP address every 20 minutes. Then it became 10, and in recent years, 5.
As it turned out, Clever’s ARP caches continued to contain MAC address data for dead instances, for just enough minutes to make the whole process time out. Whenever the mongod tried to respond back to the worker, it thought it was responding to the correct address… which probably was correct five minutes ago.
ARP is a relic of 30-year-old systems architecture, devised before the Internet became a human right. And we might not have stumbled onto it unless and until a very large scale service from the 21st century began utilizing it in a way it was never intended: as an up-to-the-second directory of MAC addresses.
For now, Schimmelfing says, Amazon suggested that Clever use a cron command — a simple, scheduled instruction — to manually clean out its mongods’ own ARP caches every five minutes or so. And for now, that solution seems to work. But what about next year, when Clever’s business will probably triple once again? Does Colin make it two minutes next time? One minute? Or is it long past time for a Linux engineer to make a trip downstairs, into the dusty basement of old Unix commands, for some heavy-duty fine tuning?
Feature image via Flickr Creative Commons.