The Changing Role of an SRE in an Event-Driven World
PagerDuty sponsored this podcast.
The role of the site reliability engineer (SRE) has emerged as a key component in new stack development. One part developer, another part operations admin; the SRE is a key player in DevOps teams, tasked with creating applications, following the process through the entire production cycle and then debugging and troubleshooting when things go awry.
Abdullah Siddiqui, an SRE at cloud accounting software provider Xero, described how the dynamics of what he does continues to evolve, for this latest episode of The New Stack Makers podcast, hosted by Alex Williams, founder and editor-in-chief of The New Stack, and recorded during the PagerDuty Summit in San Francisco last week.
At Xero, the SRE team is largely there, in part, to ensure product development teams have the support they need.
“Our purpose at Xero as SREs is to help empower our product teams to be able to own their own products and production. So, we build the tools, create the practices, create the processes that help enable our product teams to do that and really own their products and production,” Siddiqui said. “[We] ensure that we can provide the best tooling possible for our product teams to ensure that they can focus on always be releasing and always be adding value to our customers. So, if they have the ability to release constantly and have a safe production environment as a result of the tools and processes that we’ve created, then we can help ensure that there’s always value going out.”
Previously, before Siddiqui joined Xero’s SRE team, Xero relied more on different operations teams, including separate performance-monitoring teams, to achieve result it does with its SREs today. “We kind of came together to grow into an SRE team out of those various teams and with a new vision in mind, of course,” Siddiqui said. “So, as a result of that, we did start off having a different global organization or different global teams and then from there on we just moved on with that model.”
On a day-to-day basis, Siddiqui’s time is often spent these days developing things like chat bots that manage incidents and using a discovery framework of microservices for better visibility into product development, Siddiqui said. “[I mostly do] development work really. It’s really rare for me to kind of have to take my time out to do things like operational work,” Siddiqui said. “I often go on call, but most of the time, I’m developing internal products for our product teams to use.”
After joining Xero just over a year ago following a stint as a software consultant at IBM, Siddiqui said his SRE team has already seen many positive changes. during the short time since he joined the company.
“We’ve matured a lot in terms of the way we do things like manage incidents, the way we do alerting, the way we do monitoring and the way we consult product teams [and determine] what best practices that they can follow. So, in terms of scaling, we’ve been able to reduce the number of incidents, the length of incidents, the time of resolutions for incidents and how quickly we react to problems in our production environment,” Siddiqui said. “That’s kind of the SRE role where my role comes in.”
PagerDuty tools have been a big help, Siddiqui said. Xero’s SRE team, for example, has used PagerDuty tools to develop a caller’s code management systems, incident measurement chat bots and a microservice to analyze alerts. “All of these things have helped to empower our DevOps process as we go, as we try to help our product teams,” Siddiqui said.
In this Edition:
3:44: The role of an SRE is a new one. How do you define your role?
6:49: What are you measuring then? What are the data points you’re looking for?
10:17: When you think about DevOps, how are you separating the noise, and what are some of the challenges you’ve started to see come from that?
11:32: How Xero is using PagerDuty in a global environment
14:30: Levels of complexity and technical debt
17:38: With the best practices you’ve been learning, is that changing your view on how you think about using distributed architectures?