SLOs in Kubernetes, 1 Year Later
Roughly one year ago, The New Stack published an article titled “Service Level Objectives in Kubernetes” by William Morgan. We thought it would be an excellent time to explore the progress in SLOs over the past 12 months. Looking back on this article, we can see many areas that have evolved, mainly in the industry’s understanding of how SLOs work and why they are helpful.
Let’s start with the ubiquity of SLOs today. We see customers and community members using SLOs in various ways, from reducing pager fatigue to SLOs defining expectations across infrastructure, platforms and Kubernetes environments. They can also improve developer culture, enable chaos engineering, act as early warning for technical debt and even help with setting reminders for SSL certificates.
Calculating SLOs Correctly
The mechanics of SLOs and error budgets are better understood now too. In my opinion, last year’s article wrongly defined error budgets, so I want to clarify this: Error budgets are the gap between 100% and an SLO over a time period. In the article’s example, the goal was 99% over 30 days, implying a 1% error budget, not 0.75%, as the article asserts. If the service hit 99.75%, as the article says, only 25% of the error budget was consumed (100 minus 99.75 is 0.25, which is 25% of 1%).
Assuming there were no customer complaints or external indicators that the service goal needed adjustment, the team that owns this service could safely ignore reliability and work on other things.
State of the Community
Over the last year, the community of practitioners around SLOs has ballooned. SLOconf had more than 50 speakers and offered seven hours of recorded content about the whys and how-to guides for SLOs. Over 2,200 people registered, and over 450 filled out the post-event survey. OpenSLO, an open source project we started with GitLab, Dynatrace and others, has continued to attract SLO practitioners and platform providers to create a vendor-agnostic declarative SLO-as-code specification and corresponding validation tool, Oslo. The conceptual framework and the tooling for SLOs have gained significant traction in the last 12 months.
Since Nobl9 offers an SLO platform, we have a unique window into the challenges of software teams today. They want to stay productive, by shipping more features and staying ahead of their end-customer needs. They also want to be efficient, making sure that the way they build and deliver is not wasteful. Meanwhile, they need to ensure an excellent customer experience, even as they constantly shift the bits and adjust their deployments.
SLOs are a critical enabler of engineering velocity. SLOs highlight business risks that require attention, which means better alignment to business stakeholders around priorities and investment requirements. SLOs also encourage dialog and a feedback loop around the granularity of goals to the service. For example, you could delineate between your “checkout” experience on Black Friday from your “causal browsing” experience and set individual goals for each. This better matches engineering decisions, like change management and capacity planning, to business reality.
Tooling to the Rescue
One of the surprises of the last year has been the rise of AIOps as a proposed solution to managing operations. It seems that many providers are skipping right over the step of defining goals and delegating this responsibility to intelligent digital agents.
At Nobl9, we are investing in sophisticated AI-based tools to propose better SLOs. We are not entirely on board with handing the keys to our infrastructure to the robots yet, but adopting SLOs is step zero on the journey to achieving the vision of AIOps.
As organizations move from reactive to proactive, from manual to automated, and through the many technological shifts in their infrastructure, they need some grounding in user expectations. SLOs become a yardstick to measure the effectiveness of these investments and changes. We strongly agree that, as Morgan’s article states, “Tooling here can help immensely, especially with the latter approach, by providing suggestions based on historical data.” Recommending SLOs based on existing data is what we have built at Nobl9.
The Platform Owner Point of View
Kubernetes adoption continues to grow “up and to the right,” as they like to say at Google. And platform teams within organizations are becoming more sophisticated in offering common compute services to their application teams. If you run an internal platform, you need to have some sort of agreement or understanding with your app teams to ensure proper use of the platform and its expectations. Instead of treating SLOs as anonymous contracts, SLOs are useful to create a feedback loop and a dialog about the right level of service for each workload.
Better understanding is critical when you are part of the same organization. In today’s world, seeing the changes in real time and adjusting through both automation and deeper collaboration is crucial in meeting the conflicting tradeoffs of customer excellence, engineering productivity, and business efficiency. Finer-grain goals described in a common language means clarity across the layers of the stack and for the stakeholders of the service.
As a platform provider, you know not all workloads are the same. Differentiation of priorities and making clear choices are part of your growth strategy. Everyone needs clear rules and goals for the service to drive their engineering decisions. Better decision-making is how you scale your digital services to the demands of our modern world.
Software systems are not uniform, and your platform is not an anonymous pile of computational power. Now more than ever, context is critical to providing excellent, efficient service. And SLOs provide precisely that context.