Modal Title
DevOps Tools / Observability

How to Build Past Basic Automated Incident Response

Rather than flagging incidents to whoever’s on call, AIR solutions should link to the right people at the right time.
Apr 4th, 2022 11:09am by
Featued image for: How to Build Past Basic Automated Incident Response
Featured image via Pixabay.

Over the past 18 months, two important trends have emerged in digital operations. First, as organizations doubled down on digital during the pandemic, the volume of incidents teams have to resolve has exploded sevenfold.

Michael Cucchi
Michael is the vice president of product at PagerDuty. He has over 20 years of engineering, product management and marketing experience in the high-tech and software industries. Michael creates and drives PagerDuty's overall product and ecosystem positioning, product strategy, community advocacy and competitive intelligence.

At the same time, customer expectations continued to rise thanks to the continued demand for near-instant response and new dependency on digital ways of doing business.

This double whammy has made manual incident response processes unmanageable for digital operations teams. A new lens on the technologies used to meet this challenge has been dubbed “automated incident response” (AIR) by industry analysts.

Unfortunately, when it comes to AIR, current thinking is too narrow.

Automating the human processes that define incident response is one thing. But to drive much greater value, solutions need to connect these processes with machine learning and automation in real time to deliver operational maturity, continuous improvement and superb customer experiences.

The Push for Automation

While cloud migration and adoption of containers and microservices deliver the agility, scale and speed development teams crave to drive business strategies, it also means more change and complex service dependencies, which in turn causes an exponential increase in the volume of alerts and incidents and the difficulty of resolving them.

Legacy processes are a block on this kind of innovation. In fact, most (91%) organizations agree that traditional IT operations functions were not built for the digital era. Gartner articulates well the challenges this presents from an incident response perspective, in its July 2021 “Hype Cycle for Monitoring, Observability & Cloud Operations” report. First, responders often spend too long trying to identify and contact subject matter experts (SMEs) because operations teams use different methods of managing the on-call roster. And beyond this, contact information is often inaccurate. The distributed nature of teams, complex on-call schedules and different notification preferences make rapid triaging even more challenging. Often, a single source of incident data is also lacking. That’s a recipe for long incident response times, poor outcomes and angry customers. In fact, in the 451 Research report “Practitioners Weigh In: Tips for Modernizing Incident Response,” 75% of organizations agree they spend too much time on IT operations and maintenance. Process and task automation is increasingly viewed as the answer to many of these problems. That same study finds that a third of organizations believe their IT is “mostly automated” and another fifth want to achieve full automation in the near future.

What the Analysts Say

So what should automation in incident response look like? Gartner, in its aforementioned Hype Cycle report, rightly points to manual processes and poor collaboration between teams as the main roadblocks to improvement. AIR solves this by automating most of these incident response steps, in their words:

“AIR solutions automate incident response processes by enabling centralized alert or incident routing. Using a policy or rule-based engine, on-call scheduler, or streamlined collaboration, this can improve operational efficiencies with action-oriented insights.”

This is certainly an important element of AIR, but it must go further than just managing the human-to-human process. Machine learning-powered capabilities exist today that offer much more by first reducing noise and false alarms, and then automatically notifying not just responders on call, but also the specific SME who is best placed to remediate. Combined with task and process automation, escalations can be completely avoided and hours cut out of resolution. Today, event orchestration can even make decisions automatically in real time to accelerate or automate the whole remediation process without needing a human.

Digital Operations under Pressure

Why is real time important? It comes back to those two overarching trends: pressure on digital services and heightened customer expectations. Research shows that time spent, and wasted, on inefficient incident response can have a serious impact. In 2021, 40% of organizations say they have lost revenue because of incidents, 25% have lost customers to rivals and on average they spent $3.4 million in staff time firefighting.

It’s not just the immediate impact of slow incident response that can put customers off. If teams are tied up resolving incidents, they have less time for innovation that could differentiate the brand in an increasingly competitive environment. Some 89% of millennials expect brands to use technology to shape their customer experiences, no matter what kind of business it is. And 60% of American consumers believe online experiences will become more important than in-person ones. One could argue they already have.

The only way to innovate at pace is to solve incidents rapidly and efficiently or avoid them altogether. That means optimizing automation with real-time operations that solve problems automatically, and if needed, mobilize response teams in seconds, drive collaboration and give deep context on digital incidents. Two-thirds of IT and development decision-makers agree that only with real-time digital operations can they reduce the cost of ITOps and accelerate innovation.

Right Expert, Right Time

However, it’s not just about speed. It’s also about joining up human processes with machine automation and adding the intelligence to proactively drive optimal outcomes. As 451 Research explains, when you do need a human, “automatically identifying the correct responders, attaching the appropriate automation … and sending status updates can streamline the incident response process and drive major time savings.”

Rather than automatically flagging incidents to whoever’s on call, as Gartner suggests, AIR solutions should be linking their incident monitoring service with the right people at the right time. Where necessary, they will automatically and proactively connect relevant stakeholders together via an operations hub or cloud. By the time a human is interrupted, automated workflows will already have been initiated diagnostic and remediation steps at the first-responder level, so SMEs don’t even need to get involved.

This isn’t just about delivering an exceptional customer experience and minimizing operational overheads, as important as these outcomes are. It’s about freeing up the time of in-house experts to work on innovation projects crucial to future growth. In so doing, organizations will create a working environment in which the brightest and best want to stay and do exceptional work for them. In a new era of intense competition for coding expertise, that in itself will be a major win.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.