DevOps / Monitoring

USENIX: The 3 Measures of Successful Site Reliability Engineering

13 Jan 2021 1:47pm, by

Citing an economic insight from the 1970s, AppDynamics Technology Evangelist Marco Coulter warned attendees of SRECon20 not to get too hung up on specific metrics, because they may not offer complete guidance as to the overall success of the system being measured.

“Whenever a measure becomes a target, it ceases to be a good measure,” he said during his presentation at the USENIX virtual event last month, paraphrasing British economist Charles Goodhart, who was writing about managing U.K.monetary policy.

“Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”

— Charles Goodhart.

Instead, the SRE must work to take into account the entire system, particularly in terms of customer satisfaction. “As technicians, we focus on the measure as the target, the goal,” Coulter said. Instead, the SRE should work with the end user to define the overall success.

In his presentation, Coulter tells a story about working for a hospital service provider, specifically managing a system that would insert new lab results into the patient record, which was managed by a mainframe system. Hospital nurses complained of the time it took to update the patient records, and a quick analysis found that messages were getting caught in a queue.

To address this concern, the dev team formulated a service level agreement (SLA) with the hospital that if the queue grew to more than 100 messages, the hospital would get a refund. The messages must get processed within 10 seconds. Coulter coded a script that would set off an alert if the queue grew close to 100, so admins could take action, and capacity planning was rejiggered so that queue processing would have all the server power it needed.

The trouble was, however, the system still lags, angering the busy nurses who relied on it, even though the message queues were empty. “The transactions were timing out even before they hit the message queue,” he said. The message queue wasn’t necessarily the bottleneck that led to the dissatisfaction.

The dev team was managing the application to the metric, not the outcome.

Site Reliability Engineering in 3D

The trick of SRE is to balance the need to please the customer against the unnecessary expense of over-provisioning operations, or stifling innovation. Three key dimensions can cover this, according to Coulter.

“You need to consider all three dimensions for success,” Coulter said. Roughly, they are:

  • Service Level Indicators (SLIs): These are the numbers that describe the state of the running system. SLIs are defined at system boundaries or team boundaries. SLIs should measure system slowdowns, not outages, which happen less often these days. The numbers could be captured by an Application Monitoring Platform (APM) such as AppDynamics, DataDog or New Relic, or any one of a number of new observability tools like Honeycomb.io of IBM’s Instana.
  • Service Level Objectives (SLOs): These are the benchmarks that the SLIs numbers need to hit, as agreed upon between the service provider and the end user. They can be expressed in terms of performance curves.
  • Service Level Agreements (SLAs): These are the agreed-upon actions that the provider must adhere to should the SLOs go unmet. It could be a refund, or perhaps the development cycle gets suspended for 28 days to address the ongoing issues.

“In a perfect world, [the SLA] is defined by the business or the customer and then you build the SLOs and SLIs underneath it,” he said.

In the case of the hospital, the cause of the slowdown were malformed packets — messages that did not meet the HL7 standard for hospital data — that were emitted by a proprietary application. The dev team had no control over this application, beyond filing a bug reporter to the vendor, but they did have control of how success was defined by the SLO, and the expectation of the end user.

In many cases, the engineering team doesn’t need to set SLOs to the highest possible performance level. In fact, such a level could be unduly expensive for the service provider to maintain. Rather, they should be set to customer expectation (One exception to this rule are financial institutions where the speed of a transaction is a fiercely competitive differentiator).

The most difficult part of the measure is understanding the end-user. In the case of the hospital, this involved “observing behavior in the wards and talking to nurses,” Coulter said. In this case, they had found out that the nurses had an “instinctive expectation” of when the lab results would come back — in about five minutes or so — though some nurses would hit the submit button repeatedly, particularly when the system was slow, dragging down the average response time even further.

With this knowledge, the service provider would be able to set an SLA that centered on returning the full results within five minutes, rather than the 10 second processing time.

“The SLAs are not there to beat each other up. They are there to capture the mutual understanding. You reach that mutual understanding through negotiation,” Coulter said. “Negotiating is a key skill for any SRE person.”

Enjoy the full presentation here:

Feature image by National Cancer Institute on Unsplash.

A newsletter digest of the week’s most important stories & analyses.