Experts Weigh in on the State of Site Reliability Engineering
In today’s world, software and systems availability, performance and capacity are essential for delivering customer experience and generating revenue. Studies show that just one bad customer experience can result in churn. As a result, organizations are moving quickly to hire site reliability engineers (SREs) to support operations and boost SRE processes.
Anyone can read the canonical SRE book from Google to understand it in theory, but what does a high-performing SRE team look like in practice? Effective site reliability engineering requires a commitment to two aspects, people and the practice, explained Matt Schallert, senior software engineer at Chronosphere and former SRE at Uber and Tumblr, during a recent Techstrong DevOps.com roundtable titled “The State of SRE.”
“You need people that think about systems holistically, such as how to maintain uptime and measure service-level objectives (SLOs), especially in increasingly complex environments,” he said. “You also need the company to buy in.” This means admitting that the organization can do a better job of managing risk and uptime goals, even being willing to halt rolling out new features. A tall order.
Schallert joined Techstrong Group moderator, Michael Vizard, for a lively discussion on how site reliability engineering has changed from a buzzword to a coveted position, and the role observability plays in the day-to-day life of an SRE. Other participants in the discussion included Uma Mukkara, head of chaos engineering at Harness, John Turner, manager of customer engineering at strongDM, and Nung Bedell, SRE/customer reliability engineer at Fairwinds.
The Evolution of SRE
Ben Treynor Sloss, a senior vice president at Google, is widely credited with establishing the first SRE team in 2003 to address a few challenges:
- Web-scale reliability needs with little to no downtime or latency
- Massive, distributed infrastructure (before public cloud popularity)
- Lack of DevOps for code-driven reliability management
Since then, complexity has accelerated. Cloud computing, microservices, continuous delivery pipelines and automation are central to maintaining competitive business advantage.
“There’s a huge need for maintaining uptime,” Mukkara said. “Systems became complex, so organizations of all sizes are setting up SRE practices and getting what is needed for SREs to be thriving.”
Mukkara has seen a lot of developers try to become SREs, which he believes is a good sign of the SRE function innovating. Turner noted wider interest.
“I’ve seen strong developers that have an interest in infrastructure and also the opposite, including from the IT and operations side, with an interest in how to deploy, how to understand uptime and measure SLAs and SLOs,” he said.
He doesn’t believe SREs necessarily have to be strong developers to start because so many tools exist to help. Rather, he noted that anyone interested in becoming an SRE must have an interest in how things work and how things connect together, as well as a general interest in technology.
Because the journey to the job is somewhat unique, none of the roundtable participants see a need for a specific SRE certification. However, they did acknowledge that adding interview questions about debugging and the handling of contrived failures could improve hiring success.
Schallert said, “It’s definitely a chicken-and-egg problem — you get the experience by seeing it, but you have to get your foot in the door and see it in the first place.”
The experts hoped that more people, including women and underrepresented minorities, with transferable skills would apply.
“Diversity will continue to expand our thinking,” Turner said. Schallert added that “it’s our responsibility as an industry and as a role to take that chance on people and help level them up. It’s the right thing to do and it benefits everyone more so than excluding people.”
Establishing the SRE Function
The Google SRE book definition is popular, but most organizations don’t have Google-sized problems, so they don’t need to copy the Google SRE function.
“That’s actually a good thing,” said Schallert. “Organizations are realizing how to apply SRE practices to their own business and organizational needs to come up with what it means to them.”
The panel offered insights on whether there is a best model for the SRE function, how to keep engineers engaged, and how automation and tools fit in.
“There are different models — embedded SREs versus a dedicated, separate function,” said Schallert. “When I was an SRE at Uber, I saw a mix of both. In all cases, it’s really important to push SRE as a partnership between developer teams and SRE teams or embedded SREs.”
Without that partnership with accountability, it may seem that people are throwing software over the wall and making another team own the deployment, reliability and everything.
“The way to have your cake and eat too is to have those teams be partners and agree on their operating model,” said Schallert, who advised that SREs should be helping teams along, as opposed to taking ownership at some point.
“Have both teams involved in the whole end-to-end life cycle of managing your software,” he said.
Mukkara agrees. “You start out as a separate function [engineering teams and SREs] but in the end, the success will really depend on if the accountability is spread across all the teams or not,” he said.
Turner said he sees an overlap when “we have DevOps and then we have SRE, and they perform two different functions, but there’s a lot of crossover. I think they’re going to slowly start to separate, but then figure out where that middle piece is, and who owns what.”
Like agile development implementations, few organizations will have the same SRE practice or processes. Teams have to figure out what works best for them, according to the experts. Agreeing on, and then documenting or codifying, the SRE vision is a good first step and that includes defining service-level indicators (SLIs) and SLOs.
What about Tools?
Tools support team accountability, the roundtable participants concurred.
“Some of the ways that you can build shared culture are by teams using the same tools,” said Schallert.
For example, your SRE team can develop the deployment software that both the developers and the SREs use. Or they can work together to develop the alerts and build experience with their monitoring systems. With everyone using the same tools and interacting with applications in a similar way, the tool set also helps to build that culture of shared responsibility.
“There’s a lot of internal tooling that’s been created by development teams and SRE teams,” said Bedell. “There’re also a great deal of open source tools released by these companies that benefit the entire tech community.”
The experts mentioned observability and open telemetry tools as useful in supporting goals and kickstarting SRE practices toward success.
“Observability platforms have improved a lot,” said Mukkara. “There’s a lot of focus. There’s [been] a lot of investment in the last few years to get observability to the next level.”
When it comes to incident response, automation is key.
“The function of the SRE is to make sure that you are doing less manual work and eventually doing a lot of automated work,” said Mukkara.
He encourages teams to practice chaos engineering — a concept that involves intentionally causing failures to ensure that distributed systems are designed to be able to maintain high levels of availability.
SREs on Call
At many companies, taking your turn at the on-call rotation is one of the defining characteristics of being an SRE. That doesn’t mean that SREs should be responding to every single page, warns Schallert.
“If SREs are responding to every page, they become professional firefighters,” he said, again advocating for shared responsibility. “That’s not good for them or the organization, and it’s why in the past few years there’s been a shift in developers being on call for their own services as well.”
For example, he said, “If someone is deploying a new release or something they worked on a feature for and that feature is broken when it rolls out, the people that worked on it are better suited to debug it and be involved in that incident than throwing it over a wall and having someone else figure it out.”
Turner added, “In that shared responsibility model, when you have that understanding, that communication, that cooperation between the teams, you also see lower time to incident resolution.”
How to Avoid SRE Burnout
SRE teams can be responsible for solving big and small problems, troubleshooting major incidents and performing routine operational tasks. Because reliability is such a high-stakes issue for many companies, and there’s a tendency for SREs to take on the bulk of on-call work, SRE is one of the positions that is most susceptible to burnout in an engineering organization. The keys to avoiding SRE burnout, according to the roundtable participants, include:
- Effectively understanding and evaluating risk based on business impact.
- Establishing clear goals and buy-in around SLAs.
- Prioritizing issues and deciding which need to be resolved immediately.
- Empowering SREs to push back on timelines and priorities.
- Having a well-staffed team — not a single individual with sole responsibility.
- Collaborating on issues.
Schallert believes it’s important for organizations to begin quantifying the impact on the business because it “makes it really easy, then, to get other teams on board with taking the time to fix those issues.”
“You have to resist the urge to classify everything as a hot, high priority. Burnout happens quickly when every issue is your issue,” said Bedell.
Observability platforms can also help reduce SRE stress and burnout. The experts concurred that logging, metrics, alerts and messaging are important when things are failing or not coming out of the pipeline correctly.
In echoing its importance, Schallert cited a real-world example: “One model that we found really useful at Chronosphere is having those same signals throughout the pipeline. As a commit is making its way to production, it goes through environments where we have the same alerts and SLO monitoring that we have in production setup on these environments. It doesn't page a human if something goes wrong, but it tells the release tooling to stop letting that build make it through the pipeline until the fix is added.”
Tooling can empower SREs to better know, triage and understand issues.
“The difference between what monitoring versus observability is,” said Schallert, “is the difference between helping you get more and more narrowed down to the root cause of an issue and showing you what components were involved, as opposed to just knowing something's wrong.”