What a Broken Wheel Taught Google Site Reliability Engineers

Google Cloud Reliability Advocate Steve McGhee once shared an essential truth that the company’s site reliability engineering (SRE ) teams have learned.
“At Google scale, million-to-one chances happen all the time.”
Sooner or later that perfect storm of oddball conditions triggers “complex, emergent modes of failure that aren’t seen elsewhere,” McGhee wrote in a post on the Google Cloud blog. “Thus, SREs within Google have become adept at developing systems to track failures deep into the many layers of our infrastructure….” And that sometimes lead them to surprising places.
But it’s also part of the larger phenomenon of treating problems as learning opportunities — to be investigated, analyzed, and
eventually shared with a global community that’s striving to be better. It’s a favorite geek pastime, making learning as simple as sharing stories about the trickiest puzzles ever solved, along with a dollop of advice, and some fun lore and legends about days gone by.
And along the way, you’ll hear some very fascinating stories.
The Bottom of the Stack
Google engineers had experienced a problem on servers for the frequently-accessed content cached on Google’s low-latency edge network. After swapping in a new server, they discovered the old one had been experiencing routing problems, with kernel messages warning of CPU throttling. Google’s hardware team ultimately identified the source of the problem — and it was surprisingly low-tech.
“The casters on the rear wheels have failed,” they wrote, “and the machines are overheating as a consequence of being tilted.” The tilting affected the flow of coolant, which meant that the broken wheels on the rack ultimately led to consequences “resulting in some CPUs heating up to the point of being throttled.”
The blog post’s title? “Finding a problem at the bottom of the Google stack.” Developer Dilip Kumar later joked on Hacker News that “I can’t imagine a better way the phrase ‘bottom of the Google stack’ could have been used.”
But like any good story, there’s a lesson to be learned. “Another phrase we commonly use here on SRE teams is ‘All incidents should be novel’,” McGhee wrote — meaning “they should never occur more than once. In this case, the SREs and hardware operation teams worked together to ensure that this class of failure would never happen again.”
The hardware team then proposed future solutions — including wheel repair kits and better installation and rack-moving procedues (to avoid damage). And more importantly, they knew what to look for in other racks that might be on the verge of their own wheel problems, which “resulted in a systematic replacement of all racks with the same issue,” McGhee wrote, “while avoiding any customer impact.”
But in a broader sense, it shows how issues can also be “teachable moments” — and those lessons can be surprisingly far-reaching. McGhee recently co-authored Enterprise Roadmap to SRE along with Google cloud solutions architect James Brookbank (published earlier this year by O’Reilly).
And at one point the 62-page report argues that SRE is happening now specifically because “The complexity of internet-based services has clearly risen recently, and most notable is the rise of cloud computing,” a world of “architectural choices that expect failure,” since only a subset of components need to be available for the system to function.
This requires a new way of thinking.
The Lessons Continue
McGhee’s story also drew more stories about “novel” issues when the post first turned up in a discussion on Hacker News. One commenter remembered a colocation facility “had replaced some missing block-out panels in a rack and it caused the top-of-rack switches to recycle hot air…
“The system temps of both switches were north of 100°C and to their credit (Dell/Force10 s4820Ts) they ran flawlessly and didn’t degrade any traffic, sending appropriate notices to the reporting email. Something as benign as that can take out an entire infrastructure if unchecked.”
They went on to say they heard stories of even worse problems with infrastructure. “One data center manager recounted a story of a rack falling through the raised flooring…” (They added that even after the disaster, “it kept running until noticed by a tech on a walkthrough.”) And this led to another comment that was more philosophical. “One of the by-products of the public cloud era is a loss of the having to consider the physical side of things when considering operational art.”
“Do people do not visit their data centers often enough to notice a tilted rack?”
In a community that strives to always be learning more, soon the commenters were pondering why Google seemed to be monitoring less for causes of issues and more for “user-visible” problems. One user ultimately tracked down Google’s complete toolset for monitoring distributed systems, including dashboards, alerts, and a handy glossary of commonly-used terms. Sure enough, it included both “black box monitoring” (defined as “testing externally visible behavior as a user would see it”) and “white box monitoring” — that is, “metrics exposed by the internals of the system, including logs…” And later the document explained that black box monitoring can alert you that “The system isn’t working correctly, right now,” eventually delving into how this fits into a foundational philosophy that every page should be actionable.
“it’s better to spend much more effort on catching symptoms than causes; when it comes to causes, only worry about very definite, very imminent causes.”
And Google site reliability engineer Rob Ewaschuk once shared his own philosophy, writing that caused-based alerts “are bad (but sometimes necessary),” while arguing that symptom-based notifications allow SREs to concentrate on fixing “the alerts that matter.”
The Bugs of Legends
Maybe it’s all proof that discussions about corner cases lead in surprisingly productive directions.
System administrator Evan Anderson once fixed an intermittent WiFi connection that only went out during “a particular recurring lunchtime meeting,” he remembered on Hacker News — a meeting “scheduled right when a steady stream of workers were going into the break room across the hall and heating food in a microwave oven…”
And then there’s a legendary ticket complaining that “OpenOffice does not print on Tuesdays.” This was due to the utility detecting file types on the virtual machine using the headers within the files, which often contained the day of week, and would mis-identify a PostScript file’s format — but only on Tuesdays.)
Software architect Andreas Zwinkau has collected dozens of similar stories on his “software folklore” site — and IT professionals are always ready to join in the conversation with more stories of their own.
But it proves more than just that you’ll never know where an investigation will lead. In a business that’s filled with data — with all of its alerts, notifications, pages, and dashboards — there’s still only so much that can be automated. So there’s always going to be a role for the very human faculties of both inquiry and intuition, for competence paired with curiosity.
And then for some all-important celebratory story-telling afterward to share what you’ve learned.
WebReduce
- What Steve Jobs and Tim Cook learned from U2’s Bono.
- Harvard Business Review’s “Women at Work” podcast interviews author Amy Gallo about how to get along with difficult co-workers.
- Gartner predicts a push for happier IT departments, including more in-house development platforms and a virtual workplace metaverse.
- Communications of the ACM editor calls for liability laws to “incentivize” better cybersecurity.