TNS
VOXPOP
What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
0%
Super-fast S3 Express storage.
0%
New Graviton 4 processor instances.
0%
Emily Freeman leaving AWS.
0%
I don't use AWS, so none of this will affect me.
0%
Operations

Make Sure Your Application Comes Correct with Correctness SLOs

Correctness SLOs measure application actions to ensure the application is behaving as expected. USAA Adam Newman explains how to create Correctness SLOs for any application.
Oct 23rd, 2023 3:00am by
Featued image for: Make Sure Your Application Comes Correct with Correctness SLOs
Feature Image by Caity from Pixabay.

Sometimes 100% is a failing Service Level Objective (SLO).

If I see a coin toss return heads nine times out of ten, I’m going to feel unjustifiably confident that the tenth time the coin stops flipping in the air, I’m going to see its tail. But really, the odds are the same: 50/50. Now, if this happens 100 trillion times on a coin-toss software program, why didn’t anyone or anything alert a site reliability engineer?

This is where Adam Newman, one of the founding members of the Site Reliability Engineering organization at USAA, makes the case for Correctness SLOs in his recent Usenix article.

Correctness SLOs measure application actions to ensure the application is behaving as expected. This article provides the blueprint on how to create Correctness SLOs for any application.

Introducing Correctness SLOs

Correctness SLOs are service-level objectives that focus on what an application is doing rather than traditional SLOs which focus on how long it takes for the application to do it. In the coin-flipping program example, the pair of healthy service level indicators (SLIs) are 50% rather than the usual 100% that everyone constantly strives for.

Correctness SLO Quick Facts

  • Correctness SLOs aren’t measured by clear pass/fail conditions. Heads is a valid passing condition, but a data set with heads 100% is not. Values matter in relation to the dataset rather than on their own.
  • 100% is no longer the crown jewel. In fact, 100% can be very very bad.
  • Deviation from the ideal value is negative, regardless of direction. If 50% is ideal, 48% and 52% fail with equality.

But obviously, no one is logging on to a coin-flip program that’s probably masking a phishing site. Correctness SLOs are also useful in the real world.

Online Gaming – Correctness SLOs provide metrics to ensure one character doesn’t have a significant advantage over another. Game engineers might initially design Fire Warrior and Water Warrior with equal power. But if Water Warrior always beats Fire Warrior, regardless of the player, Correctness SLOs can reveal this issue.

Recommendation systems – Correctness SLOs are another way to monitor the click-through rate (CTR). If the normally steady CTR drops, it’s another indicator of less relevant content suggestions. If the CTR surges, it’s a potential indication of a too-good-to-be-true offer and potential harm to profitability.

Constructing Correctness SLOs

A coin-toss program is a perfect example of the 50/50 goal. The two SLIs are:

  1. Number of coin flips that land on heads
  2. Total number of coin flips

Building from the SLIs, the product owner can specify:

  • The ideal success rate for heads in a coin flip is 50%.
  • Deviations beyond +/- 2% breach the SLO

This triggers the error budget over a predetermined time period (e.g 30 days) if:

  • Heads appear < 48% of the time
  • Heads appear > 42% of the time

The SLO breach formula will need slight modifications:

  • Introduce IDEALTARGET: The optimal target percentage (usually 100% for standard SLOs, 50% in this example)
  • Define SLODIFF: the acceptable deviation from IDEALTARGET before SLO breach (2% in this example)
  • Use an absolute value function to handle deviation, positive or negative, from IDEALTARGET

Here is an example of a standard SLO formula for unavailability over a trailing 30-day period (Using the Prometheus Query Language [PromQL] syntax):

Here is an example of a potential correctness SLO measuring deviation from an ideal target over a trailing 30-day period (PromQL syntax):

Coin Toss

Consider these scenarios with our coin flip example:

IDEALTARGET 50%, SLODIFF 2%, heads flip rate 47%: Abs(0.50 – 0.47) > 0.02 = True, SLO BREACHED

IDEALTARGET 50%, SLODIFF 2%, heads flip rate 54%: Abs(0.50 – 0.54) > 0.02 = True, SLO BREACHED

IDEALTARGET 50%, SLODIFF 2%, heads flip rate 51%: Abs(0.50 – 0.51) >= False, SLO NOT BREACHED

Adapting Burn Rate Alerting

Burn Rate Alerting formulas can effectively mirror our SLO calculations by incorporating the Burn Rate and Multiwindow Formulas.

Now What?

Newman’s original article strongly advocates for the creation and use of Correctness SLOs. He strongly believes they align with the objectives of SREs which are to improve system reliability. He provided the blueprint to soften the learning curve and introduce others to Correctness SLOs. Correctness SLOs only inform SREs that an issue is occurring. Troubleshooting and debugging are outside the scope of the Correctness SLO.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.