Demystifying Service-Level Objectives for You and Me
You might know what they are at a high level. You might know that they’re important, especially if you’re a site reliability engineer (SRE). But what is the big deal about them? Why are they so important for SRE practitioners? What are some good SLO practices?
Well, my friend, you’ve come to the right place! Today, all of your burning questions about SLOs will be answered.
Why Do We Need SLOs?
When Adriana graduated from university in 2001, having internet access on your mobile phone was barely a thing (shoutout to flip phones — hey-o). The Google search engine had been launched only three years earlier, and we were just starting to go from 3.5-inch floppy disks to USB flash drives for portable personal storage. Monolithic apps were a thing. Java was the hot language. Cloud? What cloud?
As the technology industry continues to evolve, the way we build applications has become more complex. Our applications have many moving parts that are all choreographed to work beautifully in unison, and for this to happen, they must be reliable.
SLOs are all about reliability. A service is said to be reliable if it is doing what its users need it to do. For example, suppose that we have a shopping cart service. When a user adds an item to the shopping cart, they expect that the item is added on top of any other items already in the shopping cart. If, instead, the shopping cart service adds the new item and removes prior cart items, then the service is not considered to be reliable. Sure, the service is running. It may even be performing well, but it’s not doing what users expect it to do.
SLOs help to keep us honest and ensure that everyone is on the same page. By keeping reliability in mind and, by extension, prioritizing user impact/experience, SLOs help to ensure that our systems are working so that they meet user expectations. And if SLOs don’t meet user expectations, then they set off alerts to notify engineers that it’s time to dig deep into what’s going on. More on that later.
What the Heck Is an SLO?
First, let’s get back to basics and define SLOs. But before we can talk about SLOs, we need to define a few terms: SLAs, error budgets and SLIs. I promise you’ll see why very shortly!
An SLA, or service-level agreement, is a contract between a service provider and a service user (customer). It specifies which products or services are to be delivered. It also sets customer support expectations and monetary compensation in case the SLA is breached.
An error budget is your wiggle room for failure. It answers the question, “What is your failure tolerance threshold for this service or group of services on the critical path?” It is calculated as 100% minus your SLO, which is itself expressed as a percentage over a period of time, as we’ll see later. When you exhaust your error budget, it means that you should focus your attention on improving system reliability, rather than deploying new features. They also create room for innovation and experimentation to happen. We will talk about that later when we discuss chaos engineering.
A service-level indicator (SLI) is nothing more than a metric. That is, it’s a thing that you measure. More specifically, it’s a two-dimensional metric that answers the question, “What should I be measuring and observing?”
Examples of SLIs include:
- Number of good events / total number of events (success rate).
- Number of requests completed successfully in 100 milliseconds/total of requests.
And this is important because we need SLIs for SLOs.
SLIs are the building blocks of SLOs. SLOs help us answer the question, “What is the reliability goal of this service?” SLOs are made up of a target and warning. The target is usually a percentage, based on your SLI. Your warning represents the time period that the target applies to.
With that information in hand, let’s go back to our SLIs from before and see if we can use them to create SLOs.
Our first SLI was a number of good events/total number of events (success rate).
This means that our SLO could be something like: 95% success rate over a rolling 28-day period.
You might be wondering where a 28-day period comes from. More on that later.
Our second SLI was the number of requests completed successfully in 100ms/total of requests.
This means that our SLO could be something like: 98% of our requests completed successfully in 100ms out of our total requests over a rolling 28-day period.
Cool. That wasn’t too bad, right? But why do we need SLOs again?
Great. We get why SLOs are important, but what are some guidelines on defining them?
1. Don’t set your SLOs in stone.
When defining SLOs, the most important thing to keep in mind is that if you don’t get SLOs right the first time, you keep iterating on them until you get them right. And then you have to iterate on them some more, because every time you make changes to your systems, you need to re-evaluate your SLOs to ensure that they still make sense.
Remember: They are living, breathing things that require iteration.
SLOs should always be revisited after an outage. This allows you to see if the SLO caught the incident (for instance, was the SLO breached?) or if an SLO is missing for your application’s critical path. If the SLO was not breached, then it should be adjusted so that it does catch the incident next time. As you adjust your SLO, you’ll want to think about how the SLO change affects your team. For example, would your SRE team have gotten paged more under the previous SLO, compared to the revised version, and would those extra pages have been false alarms?
2. Speak the same (time) language.
Remember our 28-day SLO period from earlier? What’s the deal with that? Why not use 14 days or 30 days? By using 28 days, it standardizes the SLO over a four-week period that you can compare month to month. This allows you to see if you’re drifting into failure as you release your features into production. In addition, this sets a good example as a proper SLO practice and standard to follow within your SRE organization.
3. Make SLOs actionable.
SLOs are no good to us if we don’t do something with them. SLOs need to be actionable. Suppose that you have the following SLO: 95% of service requests will respond in less than 4 seconds.
All of a sudden, you notice that 90% of service requests are responding in less than 4 seconds. It means that the SLO has been breached and that you should do something about it. Normally, the SLO is triggering an alert to an on-call team so that it can look into why the SLO was breached.
4. Ditch the wall o’ dashboards in favor of SLOs.
Traditional monitoring is driven by two things: querying and dashboarding. When something breaks, engineers run queries against these dashboards to figure out why The Thing is breaking. Yuck.
What if instead of using metrics from dashboards, you used SLO-based alerting. This means that if your SLO is breached, your system fires off an alert to your on-call engineers. By using this approach, it’s like your SLOs are telling you, “Hey, you! There’s something wrong, customers are impacted, and here’s where you should start looking.” Oh, and if you tie your SLOs to your observability data — telemetry signals such as traces, metrics and logs — and you should, the data serves as breadcrumbs for your engineers to follow to answer the question, “Why is this happening?”
Now you’ve gone from digging for a needle in a haystack to doing a more direct search. You don’t need a whole lot of domain knowledge to figure out what’s up. Which means that you don’t always have to rely on your super senior folks who hold all the domain knowledge to troubleshoot. Yes, you can lean more on the junior folk.
5. Make your SLOs customer-facing.
SLOs should be customer-facing (close to customer impact). This means that rather than tie your SLOs directly to an API, you should tie them to the API’s client instead. Why? Because if the API is done, you have nothing to measure and therefore you have no SLOs.
Another way to look at this is to look back at our definition of reliability: “A service is said to be reliable if it is doing what its users need it to do.”
Since users are at the heart of reliability, defining your SLOs as close to your users as possible is the logical choice.
6. SLOs should be independent of root cause.
When writing SLOs, you should never care how or why your SLO was breached. The SLO is just a warning. A canary in the coal mine, if you will. An SLO tells you that your reliability threshold isn’t met, but not why or how. The how is explained by your observability data emitted by your application and infrastructure.
7. Treat SLOs as a team sport.
It might surprise you to hear that building SLOs and learning about them isn’t the hard part. Figuring out what you want to measure is the hard part, and that’s where collaboration comes in. Having SRE teams work with developers and other stakeholders across the company (and yes, that means the business stakeholders, too) to understand what’s important to the folks using their systems helps drive the creation of meaningful SLOs. But how do you do that?
This is where observability data (telemetry) comes into play. SREs can use observability data to understand how users interact with their systems, and in doing so they can define those meaningful SLOs.
Also, I hope that you’ve noticed a recurring theme here, of SLOs and observability going hand in hand. Just sayin’ … 😎
8. Codify your SLOs.
In keeping with the SRE principle of “codifying all the things,” there is a movement to codify SLOs, thanks to projects like OpenSLO. Some observability vendors even allow for the codification of SLOs through Terraform providers. My hope is that we see greater adoption of OpenSLO as a standard for defining and codifying SLOs, and for SRE teams to integrate that into their workflows.
9. Be proactive about SLO creation.
We can and should be proactive about failure and creating SLOs from chaos engineering and game days.
Chaos engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. We want to inject failure into our systems to see how it would react if this failure were to happen on its own. This allows us to learn from failure, document it and prepare for failures like it. We can start practicing these types of experiments with game days.
A game day is a time where your team or organization comes together to do chaos engineering. This can look different for each organization, depending on the organization’s maturity and architecture. These can be in the form of tabletop exercises, open source/internal tooling such as Chaos Monkey or LitmusChaos, or vendors like Harness Chaos Engineering or Gremlin. No matter how you go about starting this practice, you can get comfortable with failure and continue to build a culture of embracing failure at your organization. This also allows you to continue checking on those SLOs we just set up.
Remember that error budget we talked about earlier? It can be seen as your room for experimentation and innovation, which is perfect for running some chaos engineering experiments. Having this budget set aside helps reduce the risk of failure for your organization by ensuring that these experiments are accounted for. Remember that budgets are meant to be spent! Check out Jason Yee’s SLOConf talk for more on this topic.
How to SLO?
So, how do you create your SLOs? Well, you’ve got options:
- Observability tools with SLO capabilities, such as Lightstep, Honeycomb, New Relic, Datadog, Dynatrace, to name a few.
- Third-party SLO tools, such as Nobl9, which integrates with existing observability tooling.
I hope that this has helped demystify SLOs.
Keep in mind that this is barely scratching the surface. SLOs are an extensive topic, and there’s tons to learn. I suggest you check out the following to further your SLO journey:
- “Implementing Service Level Objectives” by Alex Hidalgo
- The Google SRE Book
- The Google SRE Workbook on SLOs
- SLOs are not just a checkbox item in your work/SRE process. Make sure to think and strategize around them.
- SLOs are not worth anything if they’re not telling you something.
- You don’t just set up SLOs and walk away. You iterate on them constantly because things change, and your SLOs need to reflect those changes.
- Defining SLOs is an iterative process. You might not get them right the first time, so you need to keep tweaking them.