Make Sure Your Application Comes Correct with Correctness SLOs
Sometimes 100% is a failing Service Level Objective (SLO).
If I see a coin toss return heads nine times out of ten, I’m going to feel unjustifiably confident that the tenth time the coin stops flipping in the air, I’m going to see its tail. But really, the odds are the same: 50/50. Now, if this happens 100 trillion times on a coin-toss software program, why didn’t anyone or anything alert a site reliability engineer?
Correctness SLOs measure application actions to ensure the application is behaving as expected. This article provides the blueprint on how to create Correctness SLOs for any application.
Introducing Correctness SLOs
Correctness SLOs are service-level objectives that focus on what an application is doing rather than traditional SLOs which focus on how long it takes for the application to do it. In the coin-flipping program example, the pair of healthy service level indicators (SLIs) are 50% rather than the usual 100% that everyone constantly strives for.
Correctness SLO Quick Facts
- Correctness SLOs aren’t measured by clear pass/fail conditions. Heads is a valid passing condition, but a data set with heads 100% is not. Values matter in relation to the dataset rather than on their own.
- 100% is no longer the crown jewel. In fact, 100% can be very very bad.
- Deviation from the ideal value is negative, regardless of direction. If 50% is ideal, 48% and 52% fail with equality.
But obviously, no one is logging on to a coin-flip program that’s probably masking a phishing site. Correctness SLOs are also useful in the real world.
Online Gaming – Correctness SLOs provide metrics to ensure one character doesn’t have a significant advantage over another. Game engineers might initially design Fire Warrior and Water Warrior with equal power. But if Water Warrior always beats Fire Warrior, regardless of the player, Correctness SLOs can reveal this issue.
Recommendation systems – Correctness SLOs are another way to monitor the click-through rate (CTR). If the normally steady CTR drops, it’s another indicator of less relevant content suggestions. If the CTR surges, it’s a potential indication of a too-good-to-be-true offer and potential harm to profitability.
Constructing Correctness SLOs
A coin-toss program is a perfect example of the 50/50 goal. The two SLIs are:
- Number of coin flips that land on heads
- Total number of coin flips
Building from the SLIs, the product owner can specify:
- The ideal success rate for heads in a coin flip is 50%.
- Deviations beyond +/- 2% breach the SLO
This triggers the error budget over a predetermined time period (e.g 30 days) if:
- Heads appear < 48% of the time
- Heads appear > 42% of the time
The SLO breach formula will need slight modifications:
- Introduce IDEALTARGET: The optimal target percentage (usually 100% for standard SLOs, 50% in this example)
- Define SLODIFF: the acceptable deviation from IDEALTARGET before SLO breach (2% in this example)
- Use an absolute value function to handle deviation, positive or negative, from IDEALTARGET
Here is an example of a standard SLO formula for unavailability over a trailing 30-day period (Using the Prometheus Query Language [PromQL] syntax):
Here is an example of a potential correctness SLO measuring deviation from an ideal target over a trailing 30-day period (PromQL syntax):
Consider these scenarios with our coin flip example:
IDEALTARGET 50%, SLODIFF 2%, heads flip rate 47%: Abs(0.50 – 0.47) > 0.02 = True, SLO BREACHED
IDEALTARGET 50%, SLODIFF 2%, heads flip rate 54%: Abs(0.50 – 0.54) > 0.02 = True, SLO BREACHED
IDEALTARGET 50%, SLODIFF 2%, heads flip rate 51%: Abs(0.50 – 0.51) >= False, SLO NOT BREACHED
Adapting Burn Rate Alerting
Burn Rate Alerting formulas can effectively mirror our SLO calculations by incorporating the Burn Rate and Multiwindow Formulas.
Newman’s original article strongly advocates for the creation and use of Correctness SLOs. He strongly believes they align with the objectives of SREs which are to improve system reliability. He provided the blueprint to soften the learning curve and introduce others to Correctness SLOs. Correctness SLOs only inform SREs that an issue is occurring. Troubleshooting and debugging are outside the scope of the Correctness SLO.