DevOps / Monitoring / Networking

Google SRE: Site Reliability Engineering at a Global Scale

18 Oct 2021 7:00am, by
Google SRE in a nutshell: SRE is a specialist organization with a principle approach to balance reliability and velocity with maintainability and efficiency in mind.

When DevOps was coined around 2009, its purpose was to break down silos between development and IT operations. DevOps has since become a game of tug-of-war between the reliability needs of the operations team and the velocity goals on the developer side. Site reliably engineering became that balancer.

As Benjamin Treynor Sloss, designer of Google’s SRE program, puts it: “SRE is what happens when you ask a software engineer to design and run operations.”

The SRE team has emerged as the answer to how you can build systems at scale, striking that balance between velocity, maintainability and efficiency.

It was only logical that this year’s DevOps Enterprise Summit would want to invite Google SRE leadership to break down how it works at Google. After all, with more than 2 billion lines of code, Google’s production environment is one of the most complex integrated systems… ever. Its interconnectivity and uptime sets the standard for DORA metrics, but also create challenges at a planet scale. It literally wrote the books on site reliability engineering.

Of course, almost everyone outside of Google will probably not work on anything at this scale, but, because increasingly distributed systems are constantly integrating with others, the challenges of scaling with complexity are universal. As are the ways to tackle them.

One Team to Rule Them All, But Not to Rule

Google’s site reliability engineering team is treated as one central organism, spanning across internal networking and developer tools, as well as customer-facing ones. Each service lifecycle stage has different needs and the types of SRE engagement vary.

“What they all have in common is they are scoped around SRE’s mission: reliability, velocity, maintainability and efficiency. And a shared set of principles,” said Dr. Christof Leng, Google’s SRE Engagements engineering lead. He heads three horizontal SRE teams in Munich and is responsible for maintaining Google’s SRE Engagement Model, that collection of policies and principles around SRE and developer collaboration.

Google has more than 3,000 engineers, grouped into product areas, of between 50 and 300 SREs each. This makes it one of the largest, if not the largest, SRE team in the world, yet it is still asymmetrically smaller than the developer team. Leng says that keeps SREs focused on their core mission. And it limits the amount of work developers can offload onto SREs.

Also SRE support is not automatic or for all dev teams at Google. SRE remains an intentionally scarce resource. SRE teams are funded by the development teams — decided at director or VP level — and is made up of at least six SREs each. Both the dev and SRE teams must agree to start an engagement and either side can end it. But it’s usually intended to be longer term.

“Production excellence is a multi-year investment so engagements are not considered in isolation, but at the SRE-product area level,” explained Dr. Jennifer Petoff, Google’s director of SRE education and co-author of the original SRE O’Reilly guide.

“It takes time to build up that deep understanding of the services that team is responsible for.”

The Specific Scope of Google SRE-Developer Relationships

While managing a service is a shared endeavor with shared goals, service level objectives and error budgets, Petoff noted that, even though the day-to-day production responsibly rests with the SRE team, ultimately the uptime and availability buck stops with the dev team.

“Responsibility for having a reliable service is not off-loaded onto the SRE or thrown over the fence. SRE’s job is to help the dev team meet their reliability and velocity goals and to meet the needs of our users first and foremost,” she said.

In fact, it’s quite clear what the SRE team can engage with and not. The Google SRE team is only able to work on certain projects — and the existence of an on-call team is not seen as a justification. They can only work on what they can do significantly more efficiently than anyone else. If devs can do it, that should remain a dev headcount.

The Google SRE Engagement Model concerns production only, which includes:

  • System architecture and inter-service dependencies
  • Instrumentation, metrics and monitoring
  • Emergency response
  • Capacity planning
  • Change management
  • Performance — availability, latency and efficiency

By design the work of the SREs must also be “interesting, impactful and challenging for the SRE team,” Petoff said. This is not handing off pager duty. “SRE is not to be the ops team. Our mission is not to handle operations, but to improve inherent reliability of systems through engineering.”

The SRE team aims to reduce the ops workload by answering what broke, how to fix it, and then how to make sure it’s fixed for good.

For SREs, there’s always more work to do than there is time, so all their work should have a clear scope and connection to them championing for users. These benefits may not be visible to users like infrastructure updates such as converging toward standard platforms in order to increase feature velocity. Standardization also benefits the SRE team by reducing cognitive load.

Finally, an important role of an SRE is that of a teacher, passing on production knowledge. Petoff says this is the only way to keep the SRE team from becoming a human abstraction layer from production. “You can’t build a wall and then complain about a ‘throw it over the wall’ mentality.”

How Google SRE Functions in Practice

Yes, an SRE team can join a development team at any stage in the app or service lifecycle. But they are most effective from the start, bringing reliability along as you shift left.

Leng says at the design stage, “You make many decisions that are incredibly hard or practically impossible to change later — architecture, technology, failover capabilities.” He continued that “When a production expert has a voice at the table, you can fix problems before they actually happen.”

However, this isn’t often the case. For example, SLOs aren’t usually discussed until the implementation is done, but the architecture that was already chosen should scale to the expected number of nines. Otherwise, either the whole system has to be redesigned, lest you disappoint your users, or you’ve gone the other way and invested in architecture that’s far too complicated than is needed to satisfy them.

Not every SRE engagement will be the same either. Leng groups them into three as-needed buckets, which cover both headcount budget and project time commitment:

  • Baseline support — tactical and reactive ad-hoc support like office hours or consulting projects, where the developers execute based on advice received, or as part of the incident response team in larger-scale outages
  • Assisted engagement — SRE provides strategic, proactive, product-focused consultancy, with a dedicated SRE point of contact and a shared production roadmap; this can be an SRE temporarily embedded on a dev team for a critical product where an SRE can be a force multiplier
  • Full support — SRE is the effective owner of production, does on-call rotations to solve less obvious and complex production problems — the goal is the SRE to automate themselves out of a job in 18 months.

“Higher is not always better. It comes at a higher cost. Especially for the earlier lifecycle phases, with a high rate of change, a lower-tier engagement can be more effective.” Leng said engagements can be scaled up or down over time, but that isn’t needed for all services.

Everything is situational. If an SRE team is focused on core infrastructure, they may be offering full support to a few different engagements, but if they are working on earlier-stage, experimental projects, they could be working on several baseline engagements.

When Things Go Wrong

Just because it’s Google doesn’t mean it’s perfect… by any means. But the Google site reliability engagement model hopes for the best and prepares for the worst. This could be anything from operations overload to a disagreement on direction to the developers just not doing their share anymore.

That’s when they apply the best practices for incident management at a strategic level. Start by looking for the root cause. Start to look for buy-in from both dev partners and critical dependencies. If an agreement can’t be made, escalate it up both the dev and SRE management chains. Then declare “Code Yellow” — that the work required to fix the problem trumps all other project work.

When all else fails, don’t be a hero, don’t be a constant firefighter — recognize when it may be time to turn in your pager. But that’s OK because Leng says mobility among SREs across Google is typically very high.

“This is not what typically happens. Everyone understands the SREs need to be kept happy as well, you can’t throw them under the bus. And the developers understand the value that they get out of it,” Leng said.

In the end, as he addressed fellow SREs, “Whatever you do, remember that heroics are not sustainable. You can’t firefight production forever. Neither can you work day and night, it’s not sustainable. Solve the problem through smart engineering, not brute force.”

United you stand. Divided you fall.