Cloud Native Computing Foundation sponsored this post, in anticipation of the virtual KubeCon + CloudNativeCon North America 2020 – Virtual, Nov. 17-20.
Service Level Objectives (SLOs) are an increasingly common tool for software reliability. Popularized by Google, SLOs are usually characterized as a tool for service owners to balance the risks versus rewards for making changes to a given application. Should we ship this new product feature, given that we just had an outage? How do we quantify that risk and have a conversation about it with all stakeholders?
Less well-known is that SLOs can also be a powerful tool for platform owners. For a Kubernetes operator, SLOs can provide a way of characterizing the health of the services running on their clusters, that can be interpreted without any knowledge of the underlying application or its operational history. This means platform owners can use SLOs to sort through a huge set of applications and rapidly determine if anything needs immediate attention — especially critical as the number of applications grows.
SLOs in a Nutshell
At its most basic level, an SLO is simply a metric, a goal for that metric, and a time period. For instance: “the success rate for service A must be at least 99.7% percent over the past 30 days.” The metric is known as the “service level indicator” (SLI) and the goal is the “objective.”
The output of an SLO is the error budget, which is a measure of how the metric is doing relative to the goal over that time period. For example, if your SLO is defined as 99% successful over a 30-day period, and the success rate over that period is 99.75%, your error budget is 75%.
The error budget is a measure of how much leeway is remaining before the objective is violated. For a service owner, the error budget represents a way to quantify the amount of risk they can incur — an indicator of whether you should hold off on new deployments until things cool off, for example.
But for a platform owner, the error budget acts as something else: a kind of context-free judgment of the health of the service. If the error budget for an SLO is 100% and steady, then we know things are going well for that service. If it’s close to 0 (or below 0!) and dropping, then we know things are going poorly. It doesn’t matter what the underlying metric is, what the application does, or how it performed last month — the error budget is a universal number.
This universality and context-free nature of error budget values is the key to the value an SLO provides in the context of the Kubernetes platform.
SLOs for Kubernetes Platform Owners
The Kubernetes platform owner may be responsible for hundreds or thousands of applications running across tens or hundreds of Kubernetes clusters. And they may understand none of them. (Arguably, this lack of understanding is the mark of a healthy platform!)
In this context, the utility of metrics starts to break down. If a given service currently has a 97% success rate, is that good or bad? If it drops to 95%, is that cause for concern? If its success rate is 100%, but the 99th percentile of latency is slowly raising to 1200ms, should anyone be paged? Without context about how this service is supposed to be behaving, there’s no way for the platform owner to know.
SLOs provide a way out of this situation. In contrast to metrics, the universality of error budgets actually does give platform owners a way to make value-based judgments about the health of those services. In other words, by wrapping these metrics in SLOs, the platform owner gains a universal way of assessing service health, observing trends, and identifying which services need immediate attention.
The Challenges of Using SLOs
Despite their many benefits, implementing SLOs for a Kubernetes platform can be difficult. As a first challenge, consistent SLOs require consistent metrics — what are the success rates, latencies, etc, of your Kubernetes workloads at any point in time? Next, you must formulate the SLOs with appropriate SLIs, objectives, and time periods — what is the “right” parameterization of SLOs that you want to track? Finally, you must actually compute the error budgets. While the math is simple, selecting the correct metrics data points from the correct workloads during the correct time periods can be non-trivial, especially when services and workloads change over time.
For the metrics challenge, at least, there are some simple options. A service mesh like the open source CNCF project Linkerd can provide a consistent and uniform layer of metrics for all HTTP and gRPC services on your Kubernetes clusters, without requiring any configuration.
Formulating the SLOs on top of these metrics is the next step. Here, there are a spectrum of options — ranging from “get all stakeholders in a meeting and hammer it out from first principles” to “just use the current metric value as the objective and see what happens.” Tooling here can help immensely, especially with the latter approach, by providing suggestions based on historical data.
Finally, computing the error budget. The Kubernetes ecosystem provides good options here in the form of open source tools like Prometheus and Grafana — with Linkerd metrics in place, for example, SLOs can be expressed as Prometheus queries and error budgets plotted as Grafana dashboards. Alternatively, hosted tools like Dive can make use of these same Linkerd metrics and allow you to set up and track SLOs with the click of a button, across arbitrary numbers of clusters and workloads.
No matter which approach you take, adopting SLOs can play a vital role in helping platform owners understand the state of their applications in a way that’s both uniform and context-free, which means they can prioritize their efforts and ensure that both the applications — and the platform on which they run — remain reliable.
To learn more about Kubernetes and other cloud native technologies, consider coming to KubeCon + CloudNativeCon North America 2020, Nov. 17-20, virtually.
The Cloud Native Computing Foundation is a sponsor of The New Stack.
Feature image via Pixabay.