Development / Security

A Site Reliability Engineer’s Advice on What Breaks Our Systems

4 Oct 2020 6:00am, by

Black_Swan_2_-_Pitt_Town_Lagoon - Creative Commons via Wikipedia by JJ Harrison at jjharison dot com dot au

Is there a pattern to our problems? One site reliability engineer compiled a list of “What Breaks Our Systems: A Taxonomy of Black Swans.”

Laura Nolan, who has been a software engineer in the industry for over 15 years (most recently as one of Google’s staff site reliability engineers in Ireland) shared her list in a memorable talk at the LISA conference of the USENIX computing systems association. Nolan defined “black swan” events as those rare unforeseen catastrophic issues “that take our systems down, hard, and keep them down for a long time.” But more importantly, she also offered suggestions on “how we can harden our systems against these categories of events.”

The “taxonomy” breaks catastrophic events into six categories.

  • Hitting limits
  • Spreading Slowness
  • Thundering Herds
  • Automation Interactions
  • Cyberattacks
  • Dependency Problems

“Limits problems can strike in many ways,” Nolan stated, citing system resources like RAM, logical resources like buffer sizes and transaction Ids. Nolan suggests load and capacity testing everything, including your cloud services.”

“I once spent a year of my life doing very little else but load testing,” she said.

Use a good replica of your production environment — including write loads, backups, startups, and resharding — with a data set that grows beyond your current load sizes, “so you know if you’re going to hit one of these kinds of limits.” And don’t forget to also test ancillary data stores like Zookeeper.

She also advised monitoring, warning that “The best documentation of known limits is a monitoring alert.” (“Document it,” says Nolan, “explain the nature of the limit and what the proposed solution is.”)  Or, as Nolan put it, “And then if you’ve gone away, and it’s five years down the line, the people who get that alert, later on, will have warning and will be able to do something about it.”

Nolan added one more useful tip. “Lines on your monitoring graphs that show limits are really useful.”

But drawing on industry post-mortems, Nolan documented her advice with some real-world examples. For hitting limits, one example is a 2017 event at Instapaper. The company was running a production database on Amazon MySQL RDS. “Unbeknownst to them, they were backed onto an ext3 filesystem which had a 2TB limit. They hit that, all of a sudden their database stops accepting any writes, and it took them more than a day to get back up…”

And in 2010 Foursquare experienced a total site outage for 11 hours after a MongoDB shard “outgrew its RAM,” seriously degrading their performance. Or, as the Nolan puts it, “Resharding while at full capacity is hard.”

For catastrophe type #2, “Spreading Slowness,” Nolan provided an example of a slowness that spreads by drawing on a post-mortem from HostedGraphite in 2018. “AWS had problems, and HostedGraphite went down. But they weren’t running in AWS,” she said. Instead, the slower connections coming from AWS filled up and eventually saturated the load balancers at HostedGraphite.

And a 2017 incident at Square was caused by a problem with its authentication system. “I think anyone knows, who’s run a site, that if your [authentication] system is bad then everything is bad,” she said. The culprit was “a snippet of code that was retrying a Redis transaction up to 500 times with zero backoff.” Nolan likens this to doing a denial of service attack against your own Redis cluster.

Fortunately, she suggested an appropriate defense, “Fail fast.”

“It’s hard to defend against this completely, but guidelines are that failing fast is better than failing slow,” Nolan explained. “If you have load on your service that you know that you cannot possibly ever serve, don’t just have queues that can grow indefinitely. Cut off. Have a way that your service can say, ‘I’m overloaded,’ stop.”

Nolan also recommended deadlines for all requests, both into and out of a service. “Don’t just let things sit there and tie up resources if they’re not progressing.” Limit retries — as a rule, no more than three, increasing the “backoff” time between retries each time.

Another solution: implement a circuit-breaker pattern. “Instead of dealing with retries on a per-request basis, you put a little widget of software in front that sees all the requests going out from a particular client, which might be serving many many user requests, to a backend service. And if it sees across all these requests that the back end service is not healthy, it’ll fail those requests fast, before they even go to the unhealthy service.”

Also, use dashboards (for utilization, saturation, and errors). “At the root, most of these problems are caused by something getting saturated somewhere. So the key to quick debugging and remediation is to be able to find what is saturated,” she said.

“There’s nothing worse than sitting there looking at your array of microservices that compose your site, your site’s at a crawl, and you don’t even know which service is actually causing it. These sorts of grey, slow dragging errors can be really hard to find the root cause of. So this will really help you.”

“Thundering Herds” refers to those sudden spikes in demand. “A classic old-school way of doing this would be having all the cron jobs in your organization kick-off at midnight,” Nolan explained. But other examples include mobile clients all updating at a specific time, and large batch jobs.

“I worked at Google, so we had to worry about people starting up 10,000 worker MapReduces, especially the interns,” she said.

For example, Nolan referred to an incident at Slack in 2014. Slack uses a WebSockets-based API, “so clients have long-running sessions with the backend. They had to restart one of their devices, a server. It caused something like 13% of their user base to be disconnected — and they all reconnected right away. And in Slack’s architecture, on reconnection the client tries to query a whole bunch of stuff, like what channels are in the Slack team, what users are in it, what recent messages. So all this coordinated demand saturated their database.”

The defenses suggested are testing — and planning. “The real defense is to not fall into the trap of thinking, ‘Oh well, how would I get a thundering herd to my service?’ Because as we’ve seen, there are all sorts of different ways this can happen,” she said.

“So think about degraded modes. If your service is too busy to serve, can you serve something static? Can you serve a smaller dataset size? Are there some requests that you can drop…? Maybe you can queue input into some lightweight cheap queue, and then do the heavy processing later, as opposed to having to do the heavy processing synchronously…

“And test it.”

Another cause of black swans is automation interactions. Nolan warned that while automation is great, “it’s not entirely safe, let’s just say,” she said. Then she explained about the time in 2016 when Google accidentally “erased” its content delivery network.

An engineer had innocently tried to send one rack of machines to disk erase process, but “Due to an unfortunate regExp incident, it matched everything in the world,” Nolan explained. This resulted in slower queries and network congestion for two days until system restored.

“It was quite a lot of machines,” Nolan remembers, but “Lucky enough, Google had basically planned that these machines were reducing latency rather than being something that absolutely had to be there for the site to work. So things were a little bit slower, and there was a bit more congestion for a few days until it was all rebuilt. But everything still worked. So this was sort of a near miss.”

Besides placing some constraints on automation operations (and having a way to turn them off) the real solution here is control — including a log for visibility into automation. “A Slack channel works, your logging system — whatever makes sense.”

When it comes to cyberattacks, “The defense here is to minimize the blast radius,” Nolan told her audience. She advised separating production from non-production workloads as much as possible. “Break production systems into multiple zones. Limit and control communication between them,” she said.

“Google does a really cool thing,” Nolan remembered. “All the requests into production, all the traffic into production, like SSH’s and everything, goes through proxies. So there’s defense-in-depth here. Things are locked down at the network level, and then at the application level only what those proxies allow can get in.”

For example, Nolan looked to a 2017 incident where the global shipping company Maersk got infected by the NotPetya malware. One of its office machines ran vulnerable accounting software. The company couldn’t unload ships, or take bookings for days. The company lost 20 percent of its business to this bug.

Nolan’s recommendation? “Validate and control what runs in production.”

The list’s final source for catastrophic failures is dependency problems.

Nolan issued a challenge: Can you start up your entire service from scratch, with none of your infrastructure running? She warned that “simultaneous reboots happen. This is a bad time to notice that your storage infra depends on your monitoring to start, which depends on your storage is up…”

Nolan also warned that “You can easily get into these kinds of situations in modern microservices architectures.” Her example? A two-hour outage at GitHub in January of 2018. The company “had a power disruption, just a minor blip in their power supply. 25% of the boxes in their main data center rebooted.” But when it turned out the Redis cluster remained “unhealthy,” the company engineers discovered it has also “unintentionally turned into a hard dependency, so they had to go and actually rebuild that cluster before they could get back up.”

She also summarized a Trello outage in 2017. “AWS S3 outage brought down their frontend web app. Trello API should have been fine but wasn’t. It was checking for the web client being up, even though it didn’t otherwise depend on it.” Or, as Nolan put it, “their API services were checking if their frontend was up, for just no good reason at all.”

The solution involves layering your infrastructure. “Decide what layer each service is in, and only let it depend on things in lower layers,” Nolan explained. “Test the process of starting your infrastructure up. How long does that take with a full dataset?”

And of course, beware of soft dependencies that over time become hard dependencies…

And the talk concluded with a few kind thoughts under the heading “Psychology.”

“If you’re managing a team and somebody has just been through a horrible day-long or multi-day incident, try and relieve them and give the on-call shift to someone else, and give them a bit of time off, because they’re not going to be useful for a while.”


WebReduce

A newsletter digest of the week’s most important stories & analyses.