In this episode of The New Stack Makers podcast, we are with two DigitalOcean alumni and co-chairs of SREcon 2020 Americas conference who have led two very different journeys to become one of the most wanted roles in tech — site reliability engineers. As the name suggests, an SRE is someone focused on the reliability of an organization’s most important systems.
Google coined the term “site reliability engineer” in 2003, but it certainly has existed for decades more in different forms — disaster recovery and production testers, for example — as engineers have always tried to keep essential services like healthcare and finance online. The growing demand for SRE came as we went cloud native and needed these engineers to work in production and on operations, with a heavy focus on automation and observability. As systems become increasingly distributed, this is a role that has evolved from just shoring up uptime for a monolith to a relationship broker who has views into organization-wide systems, a knack for problem-solving, and a love of metrics.
Emil Stolarsky is a front-end turned infrastructure engineer who has built scriptable load balancers for Shopify and an internal Kubernetes platform for DigitalOcean and is now writing a book on how the enterprise SRE role can be adapted to smaller orgs. Tammy Bütow began with disaster recovery testing in banking over a decade ago, then went over to Digital Ocean in incident response, before she joined Dropbox for an official SRE role. Finally, in 2017, she joined chaos-as-a-service Gremlin as its principal SRE.
For Bütow, an SRE is focused on the reliability and durability of your systems and their data. This role is focused on the most important parts of those systems that when they break, everyone — from incident management to business management to devs burning out to customer support to the actual customers — feel the pain. Stolarsky added to this that an SRE treats reliability as a first-class feature that needs special attention, tooling, practices and targets.
Stolarsky pointed to Google’s Service Reliability Hierarchy as a good overview of this role, and a good visualization of what the guests said is most important: “SREs are people who can work across the company.” This makes for a different culture fit than most engineering roles, an SRE is someone who is good at communication and also prioritization — and communicating those priorities. But you still need the tech to back up those relationships.
Perhaps you could call site reliability engineering an offshoot of the DevOps movement. It’s definitely an alternative to the usual sysadmin approach to service management that sees development and operations as two distinct teams. SREs straddle both sides of what is a hopefully disappearing barrier, as engineers who spend half their time in operations.
Our guests said that the difference is that an SRE is focused on the external value the company can reliably offer customers, while DevOps is more about internally increasing velocity. However, both roles share principles like continuous learning and failure embracing, reducing silos for more transparency and shared responsibility, and automating to accelerate innovation. Both DevOps and SRE are very tied to business-level objectives.
SRE in some ways has been around since the start of this century, but certainly it’s growing in-demand, but also seems to be democratized and more and more people are starting to identify as already doing it. Stolarsky says it’s because any size organization can benefit from following SRE best practices and service-level objectives.
Certainly more orgs need more SREs! Listen to this podcast to learn more about SRE best practices and what’s needed to become one yourself.
In this edition:
- 1:58: The difference between small companies and large companies’ SRE.
- 6:58: How to define SRE, and why do companies need it.
- 10:14: The differences between SRE and DevOps.
- 14:14: Basic SRE roles that any company needs.
- 17:27: Recommended tooling.
- 23:54: Diversity in SRE, and discussing SRECon Americas 2020
Tooling Mentioned in this Episode:
- Gremlin chaos engineering
- FireHydrant incident management and post mortem
- DigitalOcean cloud computing at scale
- DataDog monitoring and analytics
- Sentry error tracking
- CircleCI automated CICD
- StatusPage customer incident communication
- Circonus monitoring and analytics
- Honeycomb for observability
- LaunchDarkly feature flag and toggle management