Building SRE Teams with Specialization
As organizations progress in their reliability journey, they start building site reliability engineering (SRE) teams. These teams can be structured in two ways: a distributed model, where one or a group of SREs support the various projects of other teams; and a centralized model, where a single SRE team sits at the helm of process and infrastructure for the entire organization. Many organizations even do a combination of these ideas: Some of their SREs are assigned to projects ad hoc while other projects are owned end-to-end by the SRE team.
Organizations also vary in the way they define an SRE’s job. One perspective is that SREs should be generalists, capable of performing all duties. Arguably, it’s robust: If every SRE can do any given job, one’s absence shouldn’t be an issue. On the other hand, you could run into a “jack of all trades” situation where potential is limited. This is cause for another view — specialization.
In this blog post, we’ll look at:
- The advantages of an SRE team where each member is a specialist.
- Some SRE specialist roles and how they help.
Why Specialize in SRE?
The SRE role is extremely diverse. An SRE may be tasked with contributing to the code base of the service, writing policies and procedures for development practices, spreading cultural values and everything in between. Even if tasks aren’t in the official SRE job description, SREs are often the ones who pick up glue work. This is work that isn’t technically anyone’s job, but is necessary for work to proceed.
It’s virtually impossible to find someone who’s both an expert in and enthusiastic about every aspect of SRE. In fact, certain aspects of SRE seem like polar opposite initiatives. On the one hand, an SRE is the reliability guardian, reigning in teams to make sure they don’t breach service-level objectives (SLOs). On the other hand, an SRE is the champion of failure, encouraging teams to take risks as long as they’re ready to learn from them.
Fortunately, specialization clears up those muddy lines and organizations can pursue reliability without losing the big-picture perspective. Specialists contextualize everything going on in development and operations to benefit the business overall. So even as they focus on one aspect of SRE, they’re aligned with the entire organization.
Specializing in SRE means focusing on your strong suits. If you’re excellent at writing code, work on infrastructure and in-house tools. If you’re great at developing policy, but can’t grapple with the depths of your codebase, be a full-time educator and policy writer.
Think about specialization when you’re hiring, even as far back as when you’re writing the job description. Identify a specific need in your SRE function that a specialist can fill. Jake Englund, senior site reliability engineer at Blameless, says that SRE is a discipline that is “constantly redefining its capabilities as a whole” and specialization doesn’t mean your expertise is kept to yourself. Jake loves learning from specialists on his team and says that learning always happens while on the job — what he calls “learning by osmosis.” Specialists educate naturally, and as the team continues to learn and share, you end up with stronger engineers that are more satisfied with their jobs.
Specialist Roles in SRE Teams
Let’s look at some of the most common ways you can specialize. Of course, it’s not so black and white. A person can wear many hats, so don’t look for an exact 1:1 fit. Instead, you might consider opportunities to grow in a particular role.
Who they are: SRE is all about building policies, processes, cultural values and infrastructure that promotes service reliability for the end user. The educator is someone who teaches and encourages teams to use SRE practices.
What they do: Educators lead info sessions to teach their ideas and get people up to speed. They also track adoption and gather information on why tools or processes are underutilized. If required, they might provide hands-on coaching to specific teams.
Skills they need: Educators need to be able to convince people to adopt new practices. They need to be expert on the tangible benefits of adopting, able to cite specific figures where relevant. At the same time, they need to be personable and empathetic. They need to understand the pains that can come with having to switch to new practices and convey that understanding through the connection.
The SLO Guard
Who they are: A key focus of SRE is the service-level objective (SLO). The SLO identifies a level of unreliability that, once crossed, indicates a negative customer experience. The SLO guard manages SLO adherence across the organization and helps teams prepare for potential SLO breaches.
What they do: The SLO guard helps to avoid SLO breaches by implementing preventative policies. This isn’t the full story, though. They also need to ensure that the SLO is appropriate and measures the right data. They run SLO review meetings, evaluate monitoring tools and research user expectations.
Skills they need: While discussing different SRE roles, Englund mentioned the value of someone who will say no. When everyone is enthusiastic about some new feature push, no one wants to be the dissenting voice. Telling someone that development needs to be delayed to preserve the SLO is a skill in itself, one that requires an unwavering commitment to reliability, the expertise to back up the decision and buy-in across teams to support the plan.
Who they are: This role is focused on building SRE infrastructure across the entire organization. It spans many different types of projects, each with their own subspecializations: internal tools for monitoring or resolving incidents, documentation and runbooks for procedures, processes for completing projects or even cultural values to guide people’s decisions.
What they do: Infrastructure architects are in constant communication with other teams to see what’s needed most. Educators can serve as a conduit for these relationships, compiling what they hear into a big-picture report. Once the priorities are clear and aligned among teams, the architect works at building. Of course, these infrastructure meta-projects are developed along the same workflow and processes as any other project. Therefore, the architect is a sort of SRE-developer and needs to work closely with development teams.
Skills they need: The skills you need depend greatly on the type of infrastructure being developed. In some cases, this is one of the most development-focused SRE roles, so deep knowledge of the organization’s codebase is a must. If focused more on policy and procedure, the architect may not need coding skills, but will still need to understand how their processes will work on the level of development. Either way, this is primarily a technical role, focused on engineering solutions to specific needs. In our discussion, Englund emphasized the idea of SREs existing on a range of socialness – if educators are on the social end, architects can be on the other extreme.
Incident Response Leader
Who they are: Having processes in place to respond effectively and thoroughly to incidents is a major part of SRE. The incident response leader takes responsibility for making your organization as incident-ready as possible.
What they do: The incident response leader plays a role before, during, and after incidents. Before incidents, they lead in setting up runbooks, on-call schedules and other tools to help respondents. Of course, all of this is done in collaboration with the teams that will be responding.
During incidents, they serve as a procedure expert that ensures teams are working effectively. If there’s disagreement over roles and responsibilities, or when to escalate, the incident response leader can serve as a point of authority to keep things moving. After the incident, the incident response leader can drive a retrospective. This document gathers the lessons of the incident and serves as a hub for follow-up tasks. The leader makes sure this document is created, reviewed and acted on.
Skills they need: Incident response leaders need a lot of people skills to understand how people will behave while panicking and empathizing with their abilities as well as infrastructural skills to know how the tools they build will interact with the system. They also need a strong ability to prioritize based on the bigger picture. In their world, everything is on fire, and they need to distinguish quickly between a big fire and a little fire. This means having a perspective that’s zoomed out to the entire organization while still able to see the little issues that each incident can bring.
Having a team of specialists can be a challenge, but it leads to opportunities. Of course, you may have SREs who can embody several of these specializations; they aren’t mutually exclusive. It’s often just a tradeoff, where one invests their time. People also have their personal interests, something we can appreciate and lean into. By allowing people to flourish in their skills without losing the robustness of shared knowledge, you’ll build the strongest possible team.