5 Ways to Build out an SRE Function and Why It Matters

Consumers of digital services are more demanding than ever. Applications must work seamlessly on demand, with a high degree of reliability. Increasingly it is the latter that can help organizations to differentiate competitively in a crowded market. But how, when so often the focus is instead put on rushing out new features? This is where site reliability engineering (SRE) comes in. Where DevOps teams focus on developing solutions to meet business requirements (“what” needs to be done), SRE teams tackle operational challenges with an end-user perspective (“how” it can be done).
By 2027, 75% of enterprises will use SRE practices to better meet customer expectations through optimized design and operations, according to Gartner. That’s up from just 10% in 2022. For organizations considering this shift, they must carefully consider which SRE model is the right fit, and which skills will be required to drive success.
Understanding SRE
When Google first introduced the idea of SRE nearly 20 years ago, it was part of an effort to codify practices that would make it easier to deliver the much-touted promise of DevOps. That’s why the tech giant’s core principles related to DevOps and SRE are virtually identical. SREs reduce silos by sharing ownership across developer and production teams. They use incidents as learning opportunities rather than blaming individuals. They work via small, iterative changes, and leverage automation to eliminate manual, repetitive tasks. And they measure everything, focusing on toil and reliability so both customers and software teams are happy.
SREs typically split their time between developing systems and on-call duties with a laser focus on business continuity and delivering services to the end customer. In short, it’s all about proactive engineering rather than reactive development. It’s also a more flexible approach that has become increasingly popular in a hybrid, post-pandemic workplace.
Five Ways to Approach SRE
To deliver maximum value for the organization, it’s important to consider first which SRE model(s) to adopt. There are five “flavors” of SRE team:
- Stream-aligned: These are the “you build it, you run it” teams. That means no hand-offs to other teams for any purpose.
- Enabling: These help stream-aligned teams overcome obstacles or spot any missing capabilities.
- Complicated subsystem: These teams provide support when significant mathematics or technical expertise is needed.
- Platform: A group of other team types that provides an internal product/process to accelerate delivery by stream-aligned teams.
- Operation centers: They transform L1 and L2 into an SRE to help with process optimization and automation.
Enabling or platform teams are the most common ways to build out an SRE function, although at times a complicated subsystem team may also be needed — for example, to manage a cloud transformation project. However, whatever the model, it’s vital to ensure the SRE team has the right skill set.
Team members should have a background or at least an interest in computer science or app development. They must be able to debug, fix and optimize code and troubleshoot issues across applications, networking (TCP/IP) and systems. Other must-have skills include an interest in thinking about large-scale problems with plenty of moving parts, and in diagnosing/fixing problems. Automation, deployment, configuration management, monitoring, analytics and metrics are also critical coverage areas. And any SRE team member must be comfortable with being on call and able to stay calm under pressure.
Driving Best Practice
To truly optimize SRE in any organization, it will be necessary to adopt a service ownership model built around having engineers take responsibility for their own code. This isn’t always easy as it often requires a cultural shift to one where individuals throughout the organization are empowered to make decisions without fear of blame or reprisals. Start out small in low-risk settings to build confidence, perhaps using agile methodologies to stay the course on the road to long-term goals.
Other guiding principles that may help include knowledge sharing to ensure SRE operational expertise is distributed widely, and well-documented operational standards for all teams to follow. Also consider partnering with engineering teams to define a supportable service architecture, and create self-service capabilities to automatically deliver repeatable services on demand. Opinionated defaults and guardrails can propagate service reliability. And a commitment to continuously improve the customer experience for internal development teams across the full life cycle can also drive success.
A Center of Excellence
In many ways, SRE teams create a centralized resource focused not just on the day-to-day but also the bigger picture, something akin to a center of excellence. However, finding the right talent to populate these teams can be a challenge. Hopefully, as training initiatives are ramped up, this will become less of an issue going forward.
Yet as digital transformation efforts continue apace, the demand for SRE teams will only grow. From setting up automation to building strategies for app production support, they sit right at the heart of any long-term IT modernization effort. In helping to optimize the reliability of digital experiences, the value of SRE to the business cannot be overstated