A New Definition of Reliability
It’s no surprise that organizations with software products are prioritizing reliability as feature #1.
In the software space, when we talk about “reliability” we’re referring to site reliability engineering. Google invented this practice as a way to implement DevOps with a better approach to software engineering. In truth, reliability is much more far reaching.
Even within engineering, if you ask 10 engineers to define reliability, you’d likely get 10 different answers. Despite this variety in opinion, we at Blameless believe that getting aligned on what reliability means is the first big step to achieving it. We’re excited to provide what we think is the most helpful perspective on reliability, one that we’ve built up through our experiences with clients of all sizes.
We’ve found that as organizations build up their own reliability practices, this definition is the one that leads them to impactful priorities and consistent results.
Is Reliability Just System Health?
The first thing you might assume is that reliability is synonymous with availability. After all, if a service is up 99% of the time, that means a user can rely on it 99% of the time, right? Obviously, this isn’t the whole story, but it’s worth exploring why.
For starters, these simple system health metrics aren’t really so “simple.” Starting with just the Four Golden Signals, you’ll end up with the latency, resource saturation, error rate, and uptime of all your different services. For a complex product, this adds up to a whole lot of numbers. How do you combine and weigh all these metrics? Which are the important ones to watch and prioritize?
Judging things like errors and availability can be difficult too. Gray failure, or when a service isn’t working completely but hasn’t totally failed either, can be hard to capture with quantitative metrics. When do you decide when a service is “available enough?”
What about a situation where your service performs exactly as intended, but doesn’t align with your customers’ expectations? How do you capture these in your picture of system health?
Clearly, there needs to be another layer to this definition of reliability!
Reliability as Users’ Subjective Experience with the Service
The answer to all these questions is “it depends on user happiness.” Taking into account user expectations and perspectives allows you to prioritize your many service metrics. Here’s a classic straightforward example. Imagine you have two incidents:
- A service that 99% of your users depend on experiences a slowdown
- A service that 1% of your users access occasionally experiences a total outage
Which one should receive your attention first? Despite incident two impacting system health more, incident one makes your customers more unhappy and should be your higher priority.
This new form of reliability, espoused by Google in its SRE book, motivated the creation of SRE tools like SLIs, SLOs and error budgets. These tools allow you to quantify customer happiness from system health metrics. You weigh each metric based on how much it matters to your customers’ experiences.
Getting to this picture of your reliability isn’t easy. You need to really understand your users:
- What different groups of users do you have? How do they use each service, and what do they depend on most?
- How big is each group of users? How much do they contribute to your bottom line?
- How satisfied are these groups of users with the services as-is? What sort of degradation of system health would make them dissatisfied?
- What expectations do they have for the service to evolve in the future? How is their confidence in this expectation affected by incidents?
This deep, multi-faceted understanding of users takes a holistic effort to achieve. It isn’t entirely within the scope of your DevOps or SRE teams to have answers to these questions. The whole organization, including customer success, marketing, product, and sales teams will have to collaborate to build this big picture.
What we’ve come to understand, though, is that this picture still isn’t quite big enough.
A New Socio-Focused Definition of Reliability
The Google SRE book brought a revolution in reliability, but it made certain assumptions that don’t apply to every org. Not everyone is a Google — in fact, only Google is! Unlike them, you probably have limited resources to spend on improving your reliability. We think this limitation should be fundamental to your understanding of what reliability is.
Let’s try another simplified example of two services
- A service used by 99% of users with 70% availability
- A service used by 50% of users with 80% availability
Imagine that you have a team that can spend 10 hours improving the reliability of one of these services. Which one do you tell them to prioritize? Based just on customer happiness, you might think service one is the easy pick. But there’s another factor to consider:
- This service has a codified playbook that walks on-call engineers through how to diagnose, experiment with and resolve common incidents.
- This service is new, and on-call engineers are unfamiliar with it.
Your team isn’t just a bunch of robots that can flawlessly execute any plan with perfect prioritization. They won’t always have the resources they need, and you won’t always have the time to create those resources. Their resilience and adaptability in dealing with new problems will vary significantly depending on all sorts of other circumstances.
Thinking about the potential impact on customer happiness based on your socio-technical resilience, you’d likely want to have your team prioritize service two. This understanding also teaches you what it really means to “improve reliability:” what processes, tools and resources will cover gaps and make them confident in handling incidents?
Finding Gaps in Socio-Technical Reliability
To answer this question, let’s break it down into other questions. Picture a service that you feel is very reliable (users are happy with it, it performs up to their standards often enough, they tolerate occasional failure in it, etc.) To judge its socio-technical reliability, ask yourself these questions about it:
- Are engineers equipped with a plan to deal with it failing? Have they practiced it?
- Do engineers know when and where to escalate if the plan isn’t working to resolve the incident? Do they know how to judge the severity of the incident?
- Are engineers aware of how the feature is marketed to customers/prospects/stakeholders and how that influences user expectations?
- Are engineers aware of product plans to evolve the service, how this plan is conveyed to prospects, etc., and how this plan could change the health of the service?
- What other failures might accompany this service failing? What is a holistic picture of how engineers would respond to these coinciding failures?
- How burnt out is the team that would be on-call for this service? How balanced are their workloads? How much failure are they prepared to handle?
- How tolerant is this team to disruption? Could they still handle incidents in the service if a given engineer was suddenly absent?
- Do your on-call teams understand the boundaries of service ownership? Do they know what types of incidents they’re responsible for?
By answering these questions, you can identify where additional resources are required and where they’d be most impactful. But how do you judge things like “which teams are most ready?”
Quantifying Socio-Technical Reliability Challenges
Your best guide to finding objective answers to these questions is to learn from experience. Rather than trying to quantify things like “preparedness” or “tolerance to disruption” directly, look at how incidents are playing out. Track how effectively incidents were resolved, looking at MTTx metrics and the time to complete each step. Then compare the stats before and after different policy changes or resources were developed for each step to see their impact.
You can observe how often specific team members get pulled to on-call duty every month. Look at how often a specific service experienced Sev0 incidents in the past quarter. You can even keep track of how often teams have bandwidth to complete follow-up actions from incidents. This information gives you the insight to decide when it might be time to slow down and address tech debt. Or perhaps you can use it as a warning sign that your team is experiencing burnout — or on the verge of it.
Of course, numbers won’t tell the entire story. Every incident will have mitigating circumstances that will make comparing apples to apples difficult. Instead, you can look at general trends, controlling for similar incidents, to see where improvement is happening and where it’s still needed. Tracking these statistics will also highlight outliers, which you can investigate further in retrospective reviews.
Welcome to Comprehensive Reliability
In essence, this new reliability is:
- The health of your system
- Weighed based on customer expectations and happiness
- Prioritized based on your current capabilities
You’re not just looking at what’s in the service, or the users in front of the service, but also the humans behind the service. This is a definition that guides you to grow more reliable in the most meaningful and impactful way with your limited time and resources.
SRE principles and incident management practices are further enhanced by this more complete picture. For example, with incident retrospectives, you’re encouraged now not to just dig into the systemic causes of the incident, but the process of incident management itself. Error budget policies can account better for your team’s capabilities in different service areas, implementing strategically targeted code freezes and bug bashes.
This definition of reliability is one that can be shared by all teams, as a more universal language. Teams can align on goals while also understanding capabilities and where resources are lacking.
We hope that this foray into a new world of reliability inspires you to think deeper about the concept in your own organizations! Getting on the same page for the definition, and developing practices to measure and optimize, is the best way to improve on your #1 feature: reliability.