Our 2023 Site Reliability Engineering Wish List
Is it us, or were we not **just** ringing in the New Year? And now we’ve already wrapped up January 2023? Wuuuuutttt? Where did the time go?
At the end of 2022, we (i.e. Adriana and Ana) made an SRE wishlist for 2023. We were inspired to create this wishlist after our episode on Stephen Townshend’s Slight Reliability podcast’s “The Future of SRE,” and our appearance on DevOpsTV’s webinar, ”What 2022 Taught Us About SRE’s Future.” As you may have noticed, the future of site reliability engineering (SRE) was top of mind!
These episodes were only the tip of the iceberg, and we wanted to explore some of the items that we touched upon in the podcast and in the webinar, and add some items that we didn’t have a chance to talk about, because, honestly, we could talk about this stuff for hours! So without further ado…presenting our 2023 SRE Wishlist!
The Ubiquity of Observability in SRE
Being on call can suck way less when you’re not sitting at your computer all groggy and in a daze at 4 a.m. scratching your head, wondering why your system is acting all weird. Observability can help with that.
Observability is the ability for a system to emit enough information to allow you to follow the breadcrumbs so you can ultimately answer the question: “Why is this happening?” It has the ability to give SREs some serious superpowers, because, when done right, it can make it much easier for an SRE to troubleshoot gnarly production issues.
But we’re not quite there yet. We believe that observability is still seen as a separate part of SRE, when in fact, it should be the foundation of every SRE practice. We want to live in a world where, when we say “SRE,” observability is implied. By making observability a foundation of SRE, we establish an understanding of our system before we add any chaos. As we do this, we lead by example. Ideally this becomes an incentive for building more instrumentation – the data that is sent to observability backends – into libraries and tooling.
Observability as a Team Sport
While Observability is a great addition to the SRE’s toolbelt, it should also be viewed as a useful tool for developers and QAs alike. How?
Developers instrument code (with OpenTelemetry) so it emits enough information to a given observability backend, which in turn renders that information in a meaningful way to help them troubleshoot their code in prod. You know, for when they’re on call for their own code.
QAs can then use the observability backend to help them troubleshoot issues with code while they’re testing, enabling them to file more detailed bug reports. They can also leverage instrumentation (more specifically, traces) to create trace-based tests (TBT) using tools like Tracetest, Malabi and Helios. For more on this, check out Adriana’s blog post on this topic here.
SREs leverage application telemetry set up by developers, along with telemetry collected from various pieces of infrastructure, to help them understand a system from the outside and how that system talks internally. This gives them a big picture of the systems for which they are responsible. The observability provided by these signals allows them to troubleshoot production issues, gauge system performance and examine overall system health.
Increased Adoption of OpenTelemetry
Observability is only as good as the telemetry behind it. This, of course, means making sure that your systems are emitting and collecting telemetry. However, it also helps to have a standard around said telemetry. This is where OpenTelemetry (OTel) comes into play. OTel is an open source, vendor-neutral framework for instrumenting, generating, collecting and exporting telemetry data.
Started in 2019, OpenTelemetry has come a long way since Adriana first started playing around with it. In 2021, it became a Cloud Native Computing Foundation (CNCF) incubating project. OTel features support for many popular languages. The specifications for traces and metrics are now stable, and a number of organizations are using it in production.
With the backing of all the major observability vendors, it is the de-facto standard for instrumenting applications. What does this mean? As more and more organizations begin to realize the importance of observability, expect to see them use OpenTelemetry to instrument their systems to up their observability (and subsequently SRE) game.
Increased Adoption of Trace-Based-Testing (TBT)
The idea behind trace-based-testing is simple. Since you’re already creating distributed traces in your application code, why not use trace data to create test assertions to validate your end-to-end system flow? This idea was first presented by Ted Young at KubeCon North America 2018, and now it’s a reality, thanks to trace standardization à la OpenTelemetry and trace-based-testing tools like Tracetest, Malabi and Helios.
As more and more companies start to see the value of observability, it means that we’ll see more systems emitting traces, so why not use those already-existing traces and make test cases out of them?
Developers Taking More Ownership of Code
When Adriana first got into DevOps work, she was excited by its promise to melt the barriers between dev and ops, making this whole “throw your code over the wall” mentality a thing of the past. Instead, many organizations have misused the DevOps name, inserting a whole new layer between dev and ops, calling it DevOps, but that’s not really what DevOps is all about.
Sadly, in many cases, developers still throw their code over the wall (a couple of walls now), and SREs (rebranded ops, in many cases) often get stuck troubleshooting production code that they didn’t write. As Adriana stated in a previous article:
“Software engineers need to take ownership of their own code. It’s no longer OK to just throw your code over the wall and be done with it. Taking ownership of one’s code encourages accountability and encourages developers to up the quality game with their code.
When you know that you’re on the hook for your own code, you’re much more likely to put that little extra bit of care into your code, because guess what? You probably don’t want to be woken up in the middle of the night to deal with a production issue if it can be avoided?
Choosing Your Own SRE Adventure
One of the recurring themes we saw in “On-Call Me Maybe” (OCMM) Season 1 was that the Google SRE Book should be treated as a guide, and not as The Way to do SRE. (See OCMM episode 1 and episode 9.) If there is anything that you should take from the Google SRE book, it’s this:
- Honor DevOps principles by Codifying All The Things (who has time to do things manually, anyway?)
- Improve system reliability through service-level objectives (SLOs — more on that later)
- Reduce toil – the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical – as much as possible
How you get there is up to you. Doing something just because “that’s the way Google does it” is a surefire recipe for disaster and should be avoided at all costs. The best way to have a well-functioning SRE team is to have a team that’s aligned in its goals and ways of working.
SLO, SLO, SLO
Service-level objectives (SLOs) are an SRE’s best friend. SLOs let you answer the question, “What is the reliability goal of this service?” Examples of SLOs include: number of successful requests over total requests, latency <1,000 milliseconds and availability.
It’s not easy to come up with SLOs. Maybe your initial set of SLOs might not even be that great. But one thing we can say for sure is that the more you SLO, the better you and your team will be at it.
SREs should use SLOs to drive system alerts. This ensures that your alerts are meaningful, and it means that your on-call SREs don’t jump at every little system hiccup that occurs.
And in keeping with the SRE principle of “codifying all the things,” there is a movement to codify SLOs, thanks to projects like OpenSLO. We’d love to see greater adoption of OpenSLO as a standard for defining and codifying SLOs, and for SRE teams to integrate that into their workflows.
Greater Adoption of Policy-as-Code
“Codification of all the things” is an important SRE principl. It ensures consistency, maintainability, reproducibility and reliability. We already codify our build and deployment pipelines, our infrastructure and even observability. So naturally, it makes sense to codify our policies.
Given how much of our lives have shifted into the digital realm, there are more and more ways for nefarious characters to wreak havoc. Wouldn’t it give you peace of mind knowing that policies are standardized (through code!), and have a source of truth (version control!) and that it’s not done through a shared doc on the InfoSec team’s Google Drive?
Fortunately, it is possible to codify policies, thanks to the likes of Open Policy Agent (OPA). We’re hoping that in 2023, we’ll see more policy-as-code in action, and to take it even further, implement these policies through self-serve tooling (more on that later).
Sharing Common Tasks and Flows
One of the best ways for modern developers to share code is through shared libraries and frameworks. It’s great because you don’t have to reinvent the wheel, since someone else has already solved that problem for you. Why can’t we do more of that in the SRE realm?
Although SRE roles across companies have their own unique flavor, many organizations do have a number of SRE tasks and flows in common. These may include: Kubernetes cluster admin and maintenance tasks, GitOps flows with Argo CD or Flux CD or <insert_your_favorite_GitOps_tool>, infrastructure automation, defining SLOs and SLO-triggered workflows.
Wouldn’t it be nice if there was a way to share common codified tasks and/or workflows across teams? Or, better yet, we could share common scripts and workflows across organizations?
There are already some tools on the market that tackle different aspects of this problem, including RunWhen, Kratix (packaging/sharing APIs-as-a-service), Temporal.io and RunDeck. We’d love to see tools like these become more widely used by organizations.
The Rise of Self-Serve Provisioning
One thing we absolutely hate is having to rely on a team to provision things for us. Which is why we’re super happy to see a movement toward self-serve provisioning coming from platform teams. Again, this falls in with our SRE theme of codifying all the things, and really makes perfect sense. Think about it: The SRE team has automated provisioning tasks, but they still have to manually trigger these provisioning tasks to fulfill user requests. Ugh. This is honestly a waste of an SRE team’s time. They could be doing other, cooler things with their time, like improving system reliability.
As an added bonus on the self-serve provisioning thing: By implementing it alongside policy-as-code, it helps to ensure that teams don’t do the things that they’re not supposed to do.
Some of the technologies that are helping to make this possible include Kratix (packaging/sharing APIs-as-a-service), Loft Labs’ self-service for Kubernetes Namespaces and RunWhen (workflow sharing and automation).
Embracing Failure and Learning From It
Things fail, whether we like it or not. It’s how we recover from failure that really matters. As developers, our collective hearts sink when we see the same nasty error popping up over and over and over again, in spite of trying all sorts of different things to get the code to work.
Although it’s frustrating, it’s important to force yourself to take a step back and view this whole thing as an experiment. Yes, trying X failed, but look at all the other things that were learned as a result! For example, we now know that X is not a part of the solution.
At the end of the day, we end up learning more about the thing that we’re trying to do, and in the process, we also learn how to troubleshoot better. This, in turn, makes us better developers.
The same thing can be said for system failures. Yes, they suck. A lot. But failures give us a wealth of information. They can tell us that there are problems with our processes, designs, code and even hardware. All of these are opportunities for improvement. But we can’t allow ourselves to get hung up on the fact that it failed. Instead, we must focus on why they failed, so that we can make our systems better and more resilient.
By shifting our perception of failure from something negative to something positive, it helps to improve the psychological safety of those affected by the failure. We’re sure that many folks would be a lot more willing to have open and honest conversations about failure, and how to recover from it, if they knew that they could talk about failure without any repercussions.
We shouldn’t run away from failure. We should seek failure, embrace it and practice it. Only then, will we get good at dealing with failure.
Let’s normalize positive conversations around failure. For more on this, check out this great conversation that we had with Nora Jones, CEO of Jeli.io.
SRE has come a long way since its early days, and there are some exciting things ahead in 2023!
We’re excited to see continued growth in observability as more organizations standardize on instrumentation with OpenTelemetry. Instrumented data is fed into observability backends, which helps SREs troubleshoot those gnarly system problems. And as more organizations start using OpenTelemetry, trace-based-testing will be a big one to look out for in 2023, as it moves to improve the quality and reliability of our systems by bridging the gaps between dev, QA and SRE.
We’re also excited to see SRE really honor its core principle of codifying all the things, through policy-as-code, codified SLOs, self-serve provisioning, and workflow sharing and automation.
And finally, we’re excited to see our industry continue to make progress on prioritizing mental health and wellness, to ensure that our SREs can be at their best.
As we settle into 2023, may SREs continue to drive innovation in tech through collaboration, thoughtfulness and empathy.