CI/CD / Cloud Native / Monitoring

Automate User Satisfaction with This GitOps-Friendly Spec for Service Level Objectives

18 May 2021 11:14am, by

Organizations looking to tighten up their ops with some site reliability engineering (SRE) should take a look at the recently-released OpenSLO specification, a GitOps-friendly template for establishing Service Level Objectives (SLO) to specify and even enforce the range of reliability required (and afforded) for a system.

Software reliability platform company Nobl9 kickstarted the project, which it released this week under the Apache 2 (APLv2) license, just in time for the company’s SLOConf, taking place this week, virtually.

The practice of site reliability engineering (SRE) is built on metrics, a numerical refinement of traditional service level agreements. SLAs have historically measured minimally acceptable performance for systems. Routinely failing to meet SLAs is the point in which refunds are offered or even lawyers are evoked.

SLOs measure the opposite: the point in which the customer (either external or internal) are satisfied with the service, explained Kit Merker, Nobl9’s chief operating officer, in an interview with The New Stack. It establishes the Delta between user satisfaction and complete system infallibility — the cost of which would prohibitive for any service provider. Understanding the reliability level that the user would be satisfied with gives the service provider a known overhead, or “error budget,” to try new features, and preserve the profit margin in general.

With this spec, the idea is to make SLOs “a first-class citizen in the modern DevOps lifecycle,” said Nobl9 CEO Marcin Kurc, in a statement.

In the OpenSLO box is a standard YAML specification format for SLOs, along with a parser to check the completed YAML files against the syntax. By using a declarative YAML format, OpenSLO provides a way to embed SLO into operational processes, setting the stage for not only observation but even automation. The SLO file can be checked into git or another code repository along with the actual code and the files with the Infrastructure-as-Code settings, “so you validate that as part of the release pipeline,” Merker said.  System designers can then automate behaviors such as autoscaling, alert messaging and resource provisioning, all based on the SLO numbers coming in.

The metrics can be drawn directly from application performance monitoring software. Nobl9 itself has partnered with observability providers such as Splunk, New Relic, Datadog, Dynatrace and Lightstep. Writing to an open standard will help end-users  switch across these toolsets with minimal disruption, Merker argued.

Nobl9 first created the specification for its own platform, but opened it up for others. A common schema across different organizations could help ensure community support, pointed out Niall Richard Murphy, a former SRE at Google and Microsoft who contributed to OpenSLO, in a statement.

Other contributors to the project include: GitLab‘s Andrew Newdigate, Dynatrace‘s Juergen Etzlstorfer, and Alex Nauda and Ian Bartholomew, both from Nobl9.

For Gitlab, Formulating SLO error budgets engaged different departments to participate in the reliability engineering process, Newdigate noted.

The effort is looking for additional input, particularly from cloud vendors, application lifecycle management vendors, consultancies and other open source projects. 

Background feature image by Hannah Reding on Unsplash.

A newsletter digest of the week’s most important stories & analyses.