API Management / Culture / DevOps

Today’s Site Reliability Engineers Are More Healers Than Fire Fighters

2 Mar 2021 9:02am, by

The image of the site reliability engineer (SRE) running around with a pager and always on call to put out proverbial data center fires in the middle of the night are becoming a thing of the past. In addition to how SREs’ roles can change, depending on the organization’s size and what it does, the super-charged and always-on-call image is not only inaccurate, but also serves to mislead about how SREs can best support DevOps teams.

Instead, the SRE’s role can and should be compared to that of a healer or medic — and not to a firefighter or soldier,” said noted systems engineer Alice Goldfuss, during The New Stack’s recent “StackPulse Goes Live: How to Build an SRE Superhero” video streaming podcast.

“Especially in the ops tradition and the sysadmin tradition that then became the SRE tradition, there is a lot of that military hyper-masculine self-sacrificing that not only leads to burnout but also just crowds out a lot of people in the space,” she said.

“Whereas, I do believe the SRE should be seen as a very highly skilled support role like a medic, or like the healer on the team, so to speak,” she said.

The type of role that best describes the emerging SRE was one of a number of assumptions examined during this lively discussion. The talk described common misperceptions about what SREs really do and how they can best support the organization.

To a great extent, the size of the organization will determine the SRE’s range of responsibilities. The SRE thus means “a lot of things outside of the official Google context, because the Google SRE only really works for Google-sized companies or perhaps just Google,” Goldfuss said.

“Whereas, sorry, in other companies of various sizes and technical shapes the SRE can mean someone who keeps the databases online, someone who builds reliability tooling for engineers and someone who manages the incidents,” Goldfuss said. “And yes, it can be a roll of many hats, or what have you been working on under the SRE title.”

The Shift to a Blameless Culture

People don’t fail, teams do. When things do go wrong, the SRE, unlike in the past, is considered less as a lightning rod for blame — systems were designed not to fail, therefore it is the SRE’s responsibility that systems never crash, some might have contended in the past — but, as mentioned above, more as the medic to help teams fix things when they go wrong.

In what should be a “blameless culture” and knowing that “just everything that can fail will eventually fail,” the SRE was previously expected to make sure everything “just stayed up,” said Or Elimelech, StackPulse site reliability engineer lead during this discussion.

Elimelech said. “In the previous generation of system admins, if something crashed, everyone was screaming — it was like a war,” Elimelech said. “You weren’t prepared for these scenarios before, because you’d expect that everything just stayed up.”

For SRE culture today, “we’re saying, ‘if we fail, let’s learn from it and make sure the mistakes don’t happen again,’” Elimelech said. This instilled culture helps to foster an environment in which “people learn,” instead of wrongly just blaming the SRE, Elimelech said.

Building a Foundation for DevOps

The SRE’s support of the DevOps team can be compared to building a foundation, Elimelech said. Like the developers, Elimelech said he spends much of his time writing code, but his mission as the SRE is to write and deploy code that will help make the developer’s work life easier.

Instead of writing the business logic, Elimelech develops libraries and selects frameworks to help developers write more resilient code with “stricter APIs” so that the “developer experience is first,” Elimelech said. Comparing himself to a therapist in addition to a medic, the idea is to remove “everything off the developers’  desk to let them focus on the business logic,” he said.

As an example, a developer should be able to just “type whatever they need in order to achieve something in the infrastructure in a sane way,” Elimelech said. The SRE supports this need by, while working in the background, developing, for example, an automated process that configures a database  “in a sanitized way, instead of giving a developer access to the production DB,” Elimelech said.

Empathy First

It is easy to become enamored with a new technology without thinking through the consequences of what its adoption might mean for the end-user, or in the case of the SRE, how a new tool or platform might impact the organization’s DevOps team members. At the same time, Goldfuss noted how it is easy to understand how when “creating a logging system that only uses TCP packets, you could very well get back pressure on the system that takes something else down.”

“But I also think empathy and how systems connections work are also needed for humans, and how humans interact with each other,” Goldfuss said. “And, in my experience, the best SREs are the ones who live in a world where they understand the consequences of their actions on others, and this is the person who when you’re shipping something they’re going to be saying: ‘not only how do these systems interact, but also who are the stakeholders, who do we need to talk to?’”

New technologies can be highly beneficial, of course, but empathy needs to be the first consideration. “There are so many new technologies out there, and you’ll learn them and get there yourself but it really starts with empathy,” Goldfuss said.

Elimelech noted, for example, about how his role as an SRE involves automating as much code as possible to allow the software engineers to remain “focused on the product itself.” “You seek to remove obstacles that the developers might have,” while extending that support to help a developer resolve an issue with a service, such as improving the infrastructures to better enable connections to a MySQL database.

“There’s no pressure on the developer to solve it on their own…so you can escalate stuff to the SRE team,”  Elimelech added. “Empathy, as Alice just mentioned, is everything and is a key point here.”

As a case in point, the question during the live streaming was asked whether service mesh has become an essential SRE tool. While increasingly seen as an essential way to manage Kubernetes environments for many organizations, service mesh is not necessarily applicable to an SRE’s needs, nor to the needs of any organization.

Indeed, a service mesh can offer better performance, load balancing, testing “and stuff like that, and if that is something that you need and you can’t get it anywhere else from a more mature tool that has been out for years, then yes, do a service mesh,” Elimelech said. “Just make sure that you’re evaluating it based on what the tool is actually giving you and that it’s actually adding to the stability and usability of your infrastructure, rather than the fact that it’s shiny.”

The full recorded livestream, hosted by The New Stack Publisher and founder Alex Williams, can be enjoyed here:

A newsletter digest of the week’s most important stories & analyses.