When 99% Service Level Objectives Are Overrated (and Too Expensive)
The collective wisdom goes that 99% site reliability might be the standard but Alex Hidalgo, a principal reliability advocate at Nobl9, says such high standards can be often unnecessary. Sometimes 80% will work just fine!
When is it worth pulling out of the race for site reliability of even 99.9%? How important is it to really understand a user’s sweet spot so resources can be put elsewhere? These are the questions Alex Hidalgo addressed in his 2022 P99 Conf talk, “Throw Away Your Nines.”
That service level objectives (SLOs), which are a vital number for site reliability engineering (SRE), require 99+% site reliability is a myth. An SLO specifies the degree of uptime the service is guaranteeing to its users. There are many instances where 99% isn’t necessary and offering services with such high reliability will quickly burn through the budget.
Consider that a company is looking for “95% of all API requests to return with a non-error state” overall. Add in the SLO target and the request morphs into something more like “99% of requests to be good every 30 days.” Hidalgo explained that more often than not, percentiles are attached because it’s those percentiles that inform SLOs so that original request transforms once again and is now “99th percentile of all requests to complete within 500 milliseconds 95% of the time, every 30 days.” Here is where Hidalgo says the growing problem exists, “We’re really starting to stack things on top of each other and more and more nines are being involved.”
What Does Latency Look Like?
In a perfect world, the latency graph looks like the drawing below. Divide the graph into 100 pieces because the past tells us that a 1% failure rate is baseline acceptability, and this is what where the 99th percentile (P99) falls.
Rarely does perfect happen. More common is the long tail distribution. Small ramp-up made of quick request completions, followed by the averages, and then of course the long tail gradual ledge on the right-hand side made of slower requests thanks to all sorts of issues happening while computer APIs and networking services talk across the internet. P99 is still the standard.
But what happens in the event of any of these latency events?
A standard formula doesn’t work every time. Latency graphs don’t always look the same. Not all P99s are created equal.
Throw Away Your Nines
Stacking these numbers up side by side is pretty wild. If an SLO promises five nines reliability, 99.999%, that leaves 0.9 seconds per day or a total of five minutes and 15 seconds every year of unreliability time. Perfection on paper but in reality, incredibly difficult to achieve. No matter how robust, resilient, or redundant a site is, think about the human logistics alone as they relate to five minutes and 15 seconds of unreliability time a year.
Even if only one event happens once a year, if the incident happens to occur at 3 a.m., it will take longer than five minutes and fifteen seconds for the engineer on call to log into their computer and check longs. It’s a brilliant way to burn through a budget. Dropping some nines to 99.9% will give a site 8 hours and 45 minutes of unreliable time per year which looks a whole lot better in reality but is that even necessary? And what would it mean to step away from needing the absolute all mighty all nine approach? Where would this leave SLOs?
Set Intentional Objectives Based on User’s Needs and Realistic Goals
What to do when third-party dependencies keep reliability down? Hidalgo discussed two clients of his who got creative.
One client, Company A, was a web-facing API where every call to it relied on a call to the database behind it. Unfortunately for Company A, the database was constantly returning errors to the tune of a 20% failure rate. Company B relied on “just about every messaging vendor imaginable,” and each vendor had varying error rates.
Both companies wanted that 99.9% but it wasn’t necessary in either case. Company A took the approach of keeping the 80% SLO target but instituting much better retry logic so users didn’t notice there was a problem. They will have to fix the database issue down the road but in the interim, the user’s needs were identified and met, and isn’t that the point? Company B also put better retry logic in place to even the varying latencies out while they did a deeper dive into what worked best for their users. It landed on an SLO target of 97.2%. Anything over that and the users didn’t notice, but under that and they sure did.
The common tie between Companies C and D are that they couldn’t hit that 99.9% due to “downtime.” Company C performed long-running batch jobs that took hours and hours and one in every five times. Company C set its SLO at 80% because one in five failures was what they came to see as a success.
Company D, though in the process of migration, had a primary code repository system that went down for an hour daily to complete the backup process. Company D built its SLO from a time outside of its backup process. Both Companies C and D were far from the 99% reliability but still had what their users considered successful metrics.
In conclusion, just be intentional. Not all 9’s are bad but when creating SLO targets, “There are nine numbers besides nine that you can in fact use.” There’s nothing wrong with the 99th percentile or trying to hit 99.99999% reliability when it’s needed as long as these metrics aren’t just being it because it’s what everyone else is doing. “What works for one site doesn’t need to work for everything else,” Hidalgo said.