Fast, Focused Incident Response: Reduce System Noise by 98%
Today’s organizations are stuck in a bind. Overwhelmingly, they want to embrace digital transformation to work more efficiently and deliver the experiences customers and employees crave. But the IT complexity this kind of project ushers in can stretch technical teams to the limit, leaving them exhausted and despondent.
This is where AIOps and automation can generate some big wins. However, it’s not always easy to know where and how value can be delivered, or which tools should be deployed.
Noise reduction is one area where digital Ops teams can start gaining some quick wins. Applying machine learning capabilities effectively to correlate alerts can help to suppress noise and dramatically enhance the ability of responders to get the job done quickly and efficiently.
A Noisy World
The basic goal of AIOps is to help developers and engineers easily discover and quickly resolve issues to minimize IT downtime. But they can’t do so when overwhelmed by a flood of alerts. The bottom line is that incident responders are drowning in information. Research shows that 69% of DevOps and ITOps teams are struggling with alert noise on a daily basis.
To help, organizations can turn to several tools and techniques. Fairly well understood today is deduplication (“dedup”), which works on services with API integrations. It allows users to easily group multiple incidents that trigger the same issue, using a dedup key.
Then there’s suppression, which is effectively front-of-pipe rules that can be used to suppress any nonactionable events. Service routing is another useful tool, ensuring events coming in are actionable and mapped to services that each represent a specific area or application.
Perhaps the least understood area of noise suppression is the use of machine learning and heuristics to group multiple alerts into one incident. Taken together, these capabilities could reduce system noise by as much as 98%. Let’s take a closer look at how it works.
Detecting and Pausing Transient Alerts
Transient alerts are frustrating. Responders are often forced to switch what they’re doing to undertake a review, only to find the alert soon auto-resolves via an integration. They may even have woken up in the middle of the night to take a look. Yet historical data can be a good predictor of transient alerts.
In line with this assumption, we designed a prediction model for transient alerts. It began with definitions and discovery — deciding what transient alerts are, and then creating a labeled data set with historical data to train and validate the model. Any alerts resolved via integration were assumed not to have required human action.
Next came phase two: testing the prediction model offline and online. This led to the development of two models — a prediction model and a real-time rolling-count algorithm that were run in A/B tests during the early-access program.
Based on performance and accuracy, we chose a winner: the prediction model. It significantly outperformed the real-time rolling counts, recording a higher accuracy for 66% of services. This solution can help users to automatically eliminate unnecessary noise from flapping alerts. But it’s not the only way machine learning can help under-pressure incident responders.
Intelligently Grouping Alerts
Noise from duplicate or very similar alerts is arguably even more common than the issue of transient alerts. It means responders are pinged over and over for what is essentially the same issue. But it can be mitigated with capabilities that use machine learning to look for text similarities in incoming alert summaries.
It will then cluster these alerts into the same incident. Additionally, user feedback on errors can be ingested and learned from to improve grouping activity in the future.
Duplicate alert noise can also be reduced by analyzing the time that alerts arrive. Machine learning is used to assess the optimal cutoff point after which no more alerts can be added to a particular group. Again, it’s based on historical data crunching to check how far apart chronologically alerts tend to arrive for particular services.
Of course, such settings can also be applied manually, and in some cases, responders will have good insight into what works best. But the power of intelligent algorithms is to spot the data patterns that human eyes usually miss, helping to optimize things like alert compressions.
This is how PagerDuty’s Intelligent Alert Grouping solution works. But organizations can supercharge their use of such tools further, with a few simple steps.
Because they work partly by analyzing text similarity, organizations should try to be as consistent as possible when naming service resources and entities. For example, one resource named “login database” in one alert and “login db” in another may not immediately be recognized and will decrease long-term accuracy. Human-readable names for service resources and entities can also help improve grouping accuracy.
These suggestions are not an exhaustive list of AIOps capabilities in alert noise reduction, but they do hopefully illustrate the kinds of wins incident response teams can generate. It ultimately boils down to more productive, effective responders, fewer distractions and a better customer experience.
With less time spent firefighting and more on innovating, AIOps can empower engineers and developers to drive bigger strategic gains for their organizations.
In a digital era characterized by fierce competition, that is a compelling reason to take a look.