Key Metrics for DevOps Teams: DORA and MTTx
In a complicated, modern environment, how do you make sure that your application and business is performing the way that it should? How do you know your customers are having a positive experience? How do you know when it’s time to acquire more infrastructure or maybe even refactor your app? One of the easiest to understand and implement ways is to use metrics.
Metrics provide a way to track the health of your application and infrastructure over time. They let you determine if your software development practice is healthy and provide suggestions for improvement. Using metrics also lets you quantify the cost of outages, including the stress to the team members responsible for troubleshooting and fixing errors, and the damage to your customers when issues do happen.
In this article, I’ll discuss two of the most common sets of metrics in DevOps, DORA and MTTx.
DORA stands for DevOps Research and Assessment, a team that conducted research over several years and then created four key metrics that indicate the performance of a software development team. These metrics and their meanings are as follows:
How often do you release to production? This one is pretty simple, you just count how many production releases you have in a given time period and track that number over time. Successful DevOps teams practice “continuous deployment,” where there are many deployments a day, sometimes even many an hour. This is the “gold standard” for DevOps teams, but even if you aren’t there now, tracking deployment frequency is your first step.
Lead Time (for Changes)
What’s the time lag between a commit to your codebase and that commit going live in production? This is related to deployment frequency, but is not quite the same metric. Many organizations deploy code into feature branches, and so even if you are deploying quite frequently, new features can linger in branches while other changes (bug fixes, perhaps?) are being done in the main branch. Determining how long the average commit spends in limbo before it’s live and in the hands of users can help you determine the overall velocity of your software development practice. How quickly can you get new features in front of your users?
Change Failure Rate
What percentage of deployments caused a failure in production? Another easy one. How many of your deployments did you eventually have to roll back, patch or otherwise manipulate as a result of that deployment causing a production issue? Obviously, the goal for this is zero, but strangely enough, a zero percent failure rate may mean you’re being a little too conservative in your development practice. There is always a delicate dance among DevOps teams to balance stability with innovation.
Time to Restore Service
How long, on average, does it take to recover from a failure in production? From a release going awry, a backhoe cutting the fiber to even your data center or us-east-1 having a bad day, how long until your users are no longer affected? It’s critically important to minimize this number. Problems are inevitable in any architecture, but the robustness of your infrastructure can reduce their impact and ensure that your application continues to deliver business value.
In addition to DORA metrics, which are tilted toward software development uses, there are more operational metrics that are called MTTx — mean time to (something). Here are some of the most common ones:
This one’s easy, it’s the last DORA metric. On average, how long is the system degraded when an error occurs? With the acronym standing for “mean time to resolve,” what it measures is pretty self-explanatory — how long, on average, is it from the very start of a problem to when its user impact is fully mitigated?
The last letter in this one stands for “detect,” and is a measure of how long it takes you to know that something is wrong — from the start of a problem, how long does it take before you are alerted or you otherwise become aware of it? This can be a few seconds if the whole app falls over and is throwing out 503s, or it could be a few weeks if the problem affects only one user who doesn’t bother to complain until they can’t deal with it anymore. On average, minimizing this metric will lead to improvements in all of your other metrics.
A is for “acknowledge,” and this metric has a little nuance. On its face, you’re just measuring how long it takes from a problem being detected to the problem being acknowledged by a network operations center operator, site reliability engineer or someone else responsible for starting the triage and resolution process. However, there are a few different opinions on when you can consider a problem acknowledged — is it just when anybody sees the alert, or is it when the person who ultimately fixes it sees the alert? Which one you use is up to you, but my opinion is the latter is a far more accurate and useful measure.
C is for cookie, but not today — in our context, this C is for “clue.” Once the incident is acknowledged, how long does it take before the person who acknowledged it actually knows what’s wrong and how to fix it? High MTTC times suggest your environment or application is too complicated, you don’t have enough visibility into the deployment or your engineers are responsible for too many things and can’t quickly narrow down problems when they’re paged.
I stands for “innocence,” in this case, and is maybe a little tongue in cheek, but it’s a very important MTTx metric in modern environments, where there are many different teams that are responsible for delivering and operating software. MTTI is a measure of how long it takes, on average, for you to recognize that a particular problem is not your team’s fault. Anybody who’s deployed a modern application knows that a popular whipping boy for issues is whoever provides the network infrastructure. Immediately going to the network operations team and waiting for them to respond when the issue isn’t related to the network just delays the overall troubleshooting process. A fast MTTI for infrastructure teams reduces overall MTTR and also helps the team members on those teams spend less time firefighting.
How Do I Generate All These and Use Them?
Observability and monitoring tools support the calculation of many of these metrics, either out of the box or through plugins. They can also be calculated manually, though that defeats a lot of the point of having them.
Tracking these metrics over time gives you important insight into how your team, infrastructure and applications are performing and can help you make the case for additional resources or time.
Finally, focusing on both MTTx and DORA metrics epitomizes a true DevOps viewpoint — by tracking the success and performance of both the Dev and Ops sides of the house, a common language and goals can be set up, thus increasing teamwork and overall software delivery quality.