The Role of Site Reliability Engineering, Today and Tomorrow

Oracle’s Vision of the Role of an SRE Today and Tomorrow
The role of site reliability engineering (SRE) in DevOps has become that much more crucial as software development covers that much more ground than in the past. Since becoming part of DevOps, SREs have been required to play a very active part throughout the production pipeline environment as well as during the post-deployment stage of software and application rollouts. The SRE’s responsibility becomes that much more critical when cloud native platforms and porting legacy systems to the cloud are added to the mix, or when deploying applications to on-premise and cloud environments simultaneously.
Three SRE and engineering thought leaders from software giant and services provider Oracle obviously had a lot to say about the SRE’s role and DevOps tools at their disposal during this edition of The New Stack Makers podcast hosted by Libby Clark, editorial director of The New Stack, during the KubeCon + CloudNativeCon conference held in Barcelona at the end of May. The podcasts guests from Oracle were:
- Dr. Jonathan Reeve, Oracle senior director, product management;
- Mickey Boxell, Oracle product manager, SRE;
- Timothy J. Fontaine, Oracle software engineer and consulting member of technical staff.
While gone are the days when a single “operations guy” or team managed a data center, an SRE can still get that dreaded call or page at 3:00 AM when an applications crashes. But at the same time, many of those operations-like tasks should become automated in the near future, so SREs can spend more time on what they usually like to do best: working directly with code development.
Taking a step back, Boxell offered some context about the observability and other tools that are helpful for an SRE role, such as tools that offer insight into logs into trace data and into just basic metrics. He said he does this by actively determining best practices and a “sort of best of breed open source solutions” that his team builds on top of Oracle’s Kubernetes platforms.
“I’ve basically been looking through the [Cloud Native Computing Foundation] portfolio, seeing which projects are the ones that are most frequently used by people, getting feedback from our users about which projects they’re using and then just going through the process of implementing them,” Boxell said. “We also make sure everything checks out and works properly, and then we try to come up with the scenarios in which people might actually use them in production.”
As an engineer, Fontaine said what he looks for in observability of a system is “understanding some kind of balance between how much information I want.”
“I always want as much data as possible whenever something’s going wrong. But that’s not always feasible from the environment because it there is a cost associated with storing all that information,” Fontaine said. “So, finding the right balance for the tools and being able to figure out, how much can I store in which area [is key]. And then there are other platform services that I can leverage to make it more affordable, but I still have the visibility that I want. So, usually, when I’m when I’m building services for myself, or advising other teams on what they’re trying to build, I look for solutions that allow you to do cardinality and lower retention that you can run in situ, on your cluster in your environment, and then leveraging platform services for longer retentions, like maybe I’ll roll out my logs or my, my telemetry and object storage or something along those lines.”
Automation will also continue to play an ever-increasing role by, among other things, alleviating more mundane tasks from an SRE’s role in the production pipeline as well as for operations tasks. Reeve described how there are more available toolsets, for example, for potentially automating remediation issues.
There is more discussion now about “automating the pipelines and the testing, to try and prevent problems up front. I think it is interesting with things like Istio and service mesh tracing we have a kind of an interesting source of instrumentation now. So again, it’s all about the application, right?” Reeve said. “And we can understand which components of the application are talking to one another, we understand which containers are actually part of this application. So, I think we’ve got an interesting set of data now and relationships to have another level of abstraction where we can potentially drive more automation.”
In this Edition:
1:01: The SRE Workbench
6:39: The current tool stack.
11:37: Istio for debugging, an experiment.
14:05: Automation.
18:28: The SRE of the future.
23:34: Customer response.