Modal Title
Cloud Services / DevOps Tools / Observability

Tale of 2 Responders: How Automation Can Save Time and Toil

Platforms with intelligent, automated and centralized event orchestration and noise suppression can greatly reduce time and effort in incident response.
Feb 11th, 2022 6:55am by
Featued image for: Tale of 2 Responders: How Automation Can Save Time and Toil
Feature image via Pixabay
Sean Scott
As chief product officer of PagerDuty, Sean is responsible for its multiproduct digital operations management platform. He has more than 20 years of experience in the technology industry, with the majority of that time at Amazon. Sean holds a bachelor’s degree in computer science and an M.B.A. from the Red McCombs School of Business, both from the University of Texas at Austin.

Running an effective digital operations team has never been more critical to long-term business success. Some 96% of customers say they’ll leave a brand after a bad experience.

Yet it’s also becoming harder than ever to stay on top of spiraling incidents and provide the service that customers expect.

It’s not just brand reputation and the bottom line that’s at risk. When incident response is characterized by too many manual processes, interruptions and escalations, team morale suffers, and burnout rates can surge.

This is where AI-powered automation can provide tremendous value to digital operations teams and first responders, intelligently reducing noise and driving more efficient event routing.

The Pressure Is On

Our research shows that nearly three-quarters (72%) of large enterprises are doing more with digital today. But a similar number (78%) are facing extra pressure due to mounting incidents. Nine out of 10 senior IT and development leaders who responded to our survey admit that current ITOps approaches just aren’t cutting it anymore. Teams are spending nearly half of their time each week dealing with incidents rather than innovating for future growth, amounting to a financial hit of over $3 million per company per year. A quarter have lost customers to rival services as a result, and many admit losing money because of incidents.

Mean time to resolve (MTTR) and mean time between incidents (MTBI) have never been more important. Yet siloed incident response systems and an overreliance on manual processes is making life much harder than it should be for digital ops teams. One 2020 study revealed that for 50% of development teams, workloads have increased as a result of disparate event data coming from multiple monitoring tools. Alert noise is another constant in many organizations, distracting and disrupting responders who could be spending their time more productively.

The result in many cases is an increased likelihood of employee burnout. Research indicates that the average team spends 17 hours per week dealing with incidents alone. That can add up to weeks of extra work in the average year.

In the Eye of the Storm

Incident responders like these are at ground zero when events come in. Let’s imagine two responders logging on at 7 a.m. to start their day. They are about to find out that a core dependency failure just started affecting the entire business, triggering incidents and alerts across multiple tools and siloed systems. Confusion is rife among their colleagues, but the clock is ticking. With a global customer base, the company knows that every second lost could have a significant financial and reputational impact.

In short, it’s time to get moving to find out what’s going on and fix it. Here’s how that journey might pan out with and without the right tool sets.

Filtering Out the Noise

Production systems generate a huge number of events. Not all of these are flagged as alerts that indicate that something has gone wrong. But when a major incident such as a core dependency failure hits, there could be hundreds or thousands of alerts triggered by various monitoring and/or event processing systems. Without noise-suppression capabilities, Responder A is bombarded with signals, many of which may be irrelevant or duplicate alerts for the same event.

Now consider Responder B, who works at an identical company and has also logged on at 7 a.m. to find a major dependency failure has occurred. The difference is that they have a range of tools in place to filter out anything irrelevant, nonessential or duplicated. This could include a function to automatically add incoming alerts to relevant open services and group them according to a specific time window. Responder B’s organization may have gone further still with machine learning-powered algorithms capable of looking at patterns in alerts and grouping them accordingly.

Or they may have the ability to manually “pause” flapping incident notifications for a predefined amount of time while they work on the problem. Such capabilities can also be automated by intelligent algorithms, helping to overcome the challenge responders face of interruptions for non-urgent incidents.

The bottom line is noise-suppression technology enables Responder B to work quickly and efficiently, removing the extraneous to focus on what’s important. While they’ve found and remediated the failure and returned to a high-value development project, Responder A is still toiling away under the weight of alerts several hours later.

Event Orchestration from a Single Location

Responder A also is forced to use a variety of manual monitoring tools. The management overhead for these is high. Rule configurations must be maintained, increasing the effort needed to process event data, and that data also needs to be aggregated and orchestrated across multiple siloed systems. During the morning of our critical system failure, they’re forced to waste valuable time manually running health checks, monitoring CPUs and memory caches and other possible root causes. Then they’ll need to take further remedial action to fix it or escalate to someone who can.

However, Responder B has a unified platform to handle all event data and optimize how events are processed. By having already added business logic and contextual rules to process all incoming events, they can trigger automated routing of events to the right teams based on event conditions at scale. And they can automatically trigger diagnostic and remediation actions, such as a server restart or clearing memory caches via runbook automation. These teams can handle most of the commodity, repetitive incidents that occur, only getting developer or engineer subject matter experts (SMEs) involved when escalations are absolutely necessary.

Fast forward several months and a demotivated, burnt-out Responder A has left the company, or worse, the industry, while their employer continues to bleed customers following lengthy service outages. However, with AIOps solutions in the form of an intelligent, automated and centralized event orchestration and noise-suppression platform, Responder B is able to focus more of their time on the projects they care about. Platforms such as PagerDuty can help organizations achieve 44% fewer incidents through noise reduction and event orchestration, freeing up much-needed time and enabling teams to focus more on innovative new products to drive competitive advantage for the business.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.