No Grumpy Humans and Other Site Reliability Engineering Lessons from Google
“It’s really about communication, humility and trust,” said Google engineer Liz Fong-Jones of the emerging practice of site reliability engineering, at New Relic’s FutureStack New York 2017 last month.
It’s about building credibility for the long term. That’s the fundamental element that makes the SRE role work, she told New Relic’s Matthew Flaming, in a talk about what others may learn from her SRE experience at Google. “You really won’t get anywhere with a product development software engineering team if you walk in there and say, ‘You’re doing A, B, and C wrong, like you must fix this,’” she said.
Google literally wrote the book on the SRE role. During her ten years at Google, Fong-Jones has worked with eight teams spanning the stack from low-level storage and Bigtable to customer-facing products like Google Play Books. She now is a member of Google’s team of SREs.
Flaming called the SRE role “the purest distillation of DevOps principles into a particular role.” New Relic is working to define and empower SREs both internally and for its customers, he said, and Google is pushing the envelope for industry best practices.
So what can you learn from Google’s SRE practice?
Technical skills are teachable, so Fong-Jones looks for engineers that have the ability to empathize and build trust with other people along with technical skills and curiosity. Curiosity is hard to teach, she said, but it’s critical to have engineers who are curious about how systems break and who really try to understand what happened.
It’s always important to hire the right people, Fong-Jones said but it’s not only the SREs that are important. Hiring production-minded product development software engineers is also key. “If you are in a small organization and you want to get off on the right foot, you cannot have people that are working against your reliability objectives by throwing half-finished stuff over the wall, or not really wanting to write metrics into the software that they’re writing,” she said.
The difference between having a bunch of disconnected islands of teams or lone individuals and having a community of SREs that work together to make things systemically better across the platform or the company is having people who deeply care about reliability and ensure best practices with each other, Fong-Jones explained.
SLO from the Get-Go
It’s important to define Service Level Objectives (SLO) — the metrics used for service level agreements — from the very beginning. “Because if you’re not having that reasoned conversation about what’s an appropriate reliability level,” she said, “it’s harder and harder to do as you go along.”
If you don’t have an explicit SLOs published, Fong-Jones said, then your SLO is whatever your customers are used to seeing. This leads to bad assumptions about your architecture, or it may fail spectacularly.
Her team starts with a helpful attitude, she said. “Hey, you already have risks, let’s enumerate them for you.” Focusing on the quantity of data has been helpful in conversations with their internal customers. For those reluctant to do the work to define new and accurate SLO, Fong-Jones will sometimes deliberately run their services exactly to their SLO. The resulting failure usually brings them around.
Start with a risk matrix, she suggested. Go straight for the team’s engineers and ask them to enumerate the risks. They may not like talking about it, but they know what they are worried about. “Everyone knows where the skeletons are buried,” she said.
Once the risks are defined and MTTD (Mean Time To Detect), MTTR (Mean Time to Recovery), MTBF (Mean Time Between Failures) are set, they can talk about the real business of the SRE. “Is this acceptable risk or not? What’s the cost of enumerating these risks? What do we think that an appropriate level of reliability is?” she asked.
Teams also need to define a service level indicator, which is a key performance metric that represents some facet of the business, she said. For example, the fraction of user queries that are successfully completed within 200 milliseconds without error.
No Grumpy Humans
Getting the balance between having enough visibility into your system and alert fatigue is tricky, Fong-Jones acknowledged. “It’s not just in terms of reliability, but what’s the effect on the humans? Are the humans going to be grumpy because they’ve been paged five times overnight? Because you can’t run a service off of really grumpy humans.”
In order to do this, she recommends turning off as many alerts as you can and focusing on the users experiencing pain. Sometimes the situation is so bad, she said that a group may be constantly failing their SLOs. When that’s the case, re-evaluating your SLOs may be in order.
“You need to either decide, ‘Okay, this is going to be a short-term issue, we know what we need to do, it’ll be fixed in a month, let’s ignore anything except for catastrophic failures,’” she said. Evaluate what is acceptable. “If users are happy and your service is 99 percent available instead of 99.9 percent available, maybe that’s where you should set your SLO. Maybe your business’s requirements were not accurately measured at the start.”
Standardization Is Key
Well, duh. But a large part of this is getting rid of the problem of Shadow IT, where an engineer decides they want to use a shiny new feature that’s outside the approved software. To combat this problem inside Google they use a bottom-up approach.
When they notice there are six different APIs doing the similar functions, for example, she talks to the engineers. “Okay, you talk all among yourselves and figure out how to merge. If you’re developing two competing things, and you say, “Okay, let’s just fold it into one project,” she said.
Google is encouraging their engineers to “look left, look right, see what other people are doing,” she said. It goes back to simplicity. They reward people for shutting down projects. At Google, that’s not something that gets you penalized. That’s something that will get you promoted.
Fong-Jones thinks it’s really important to reward people for doing reliability work. Rewarding people for thinking about systems as a whole, and make sure that your job ladder rewards thinking about what things make the product excellent from a reliability perspective and have a community of practice.
“Are you rewarding people for building a whole bunch of complex stuff that no one can maintain?” she asked. “Or are you rewarding people for doing the simplest thing, even if it means not writing any new software, just integrating something that’s existing. And that’s something that is valuable in software engineers, but even more valuable in site reliability engineers.”
When done right, it can be very rewarding, she said. Like when a Home Depot Vice-President who texted Google’s director of customer reliability engineering on Thanksgiving Day. “The message wasn’t, ‘Oh my god, everything’s on fire.’ What it said was, ‘Thank you, we’ve had a quiet Thanksgiving for the first time in forever.’”
Google is a sponsor of The New Stack.