Re-Evaluating MTTR as Key Metric for Operational Performance
While the technology for monitoring systems and applications has changed dramatically over the years, the way we measure performance and availability hasn’t changed much at all. But it might be time to think differently about the metrics we use when it comes to managing our IT systems.
Most IT organizations use fairly standard metrics to assess operational performance: application performance and availability, service-level agreement (SLA) fulfillment, incident number and severity, and mean time to repair (MTTR).
When these numbers perform well, we know that our systems are generally stable, our teams and their workflows are well-balanced, we are managing issues competently, and we are recovering quickly when there are problems.
With these numbers in hand, IT can effectively demonstrate its value to the business, the business can better plan its workload and deliverables, and both can look for ways to make changes and improvements backed by data.
Within IT teams, these numbers are frequently used to set benchmarks and reward those who surpass them, because if we are making continual improvements in how quickly we respond and problem-solve, we are surely improving the customer’s experience and their impression of our business.
But with the growing scope and use of artificial intelligence for IT operations, or AIOps, at least one of these metrics may soon be seen differently.
AIOps Enables Effective Use of Data
Though you may not yet have adopted it in any obvious way in your own organization, in its most elemental form, AIOps was developed to help better manage today’s astounding volumes and varieties of data.
What’s the problem with so much data? As with anything, too much of a good thing isn’t actually a good thing. Too much data means more time to comb through the data to find any sort of actionable insights.
If you have an outage and 100 alerts are triggered, how much time are you wasting investigating 99 false alarms before you get to the one that can tell you what really went wrong?
Enter AIOps. Combining big data and machine learning to automate the kinds of IT operations processes that have up to now required massive time and effort, AIOps creates efficiencies at scale, enables visibility across your infrastructure, and helps your team derive the insights needed to make powerful, data-driven business decisions more easily.
When event correlation, anomaly detection and root cause determination are essentially taken off your team’s work docket, thanks to the analytical capabilities of AIOps, IT teams will find themselves with more time to dedicate to more interesting, and more productive, projects.
Oh, the Irony
But, there’s a catch. Think about the powerful problem-solving abilities you gain with AIOps. With the greater efficiency, visibility and insight provided by the machine learning capabilities of AIOps, your MTTR numbers may actually go … up.
So, if you’ve been evaluating your team’s performance based on incremental reductions in the time it takes to restore services, you may soon want a new measure. Here’s why:
As AIOps-enabled solutions automate routine testing and proactively find, suggest fixes for, and potentially even remediate the issues — all without human intervention or oversight — these disruptions will actually cease to exist. Your AIOps solution has stopped that outage before it even happened.
But what’s left? The bigger, more complex service and operations issues that can’t be automated. The ones that may indeed require the talent of your operations staff, and possibly a lot more time.
All isn’t lost though. While these remaining types of challenges may be gnarlier, they’re also the kinds of problems that engineering minds love, that you actually want to pay those competitive wages for — and that ultimately leads to innovation.
Metrics Moving Forward
If MTTR isn’t going to accurately portray the success of an operations team, then what is a metric to watch in an AIOps-enabled future? Size of the problem resolved? Complexity index? A clever ratio between problem severity and time to fix? Or have we truly entered an era where the “if you can’t measure it, you can’t manage it” axiom no longer fits?
Maybe this is your team’s next puzzle to solve. No matter how you frame the parameters of progress in this next era, the beauty is that we all win: fewer small and mundane problems, more interesting big ones and greater overall efficiency. Those are the numbers that count.