Modal Title
Machine Learning / Observability

AI-Powered Automation Is Critical to IT Resilience and Adaptability

Organizations able to harness AI, ML and automation will unleash the talent on their incident response teamswhile improving IT resilience and adaptiveness.
Mar 28th, 2022 7:45am by
Featued image for: AI-Powered Automation Is Critical to IT Resilience and Adaptability

Heath Newburn
Heath is the senior solutions specialist for AIOps at PagerDuty. He has a long background in monitoring, event management and operations in many organizations and is focused on enabling the personal success of individuals and teams across IT. Heath lives in Georgetown, Tex., and is passionate about cooking and finding great Texas barbecue.

The modern world runs on code, and with every company now a software company, it’s become more important than ever to move quickly when things go wrong. That’s why incident response has become such a critical endeavor for organizations.

Unfortunately, traditional manual approaches are riddled with inefficiency. This leads to excessive mean time to repair (MTTR), which damages not only customer loyalty and the bottom line, but also employee morale.

Fortunately, leaning on automation and machine learning (ML) capabilities can help organizations plot a better path. Teams are looking to reduce repetitive work and human error, optimize responder productivity and drive all-around better outcomes as they adopt automated incident response.

In order to take advantage of this trend and build a culture of resilience, teams must look for opportunities to improve and upgrade manual operational processes with technology that can remove toil, save human cycles and give them an edge.

How Manual Processes Affect Resilience

Many organizations have accelerated their digital transformation plans, in some cases by several years. But we’ve learned that running fast can break things, and it’s not uncommon for greater velocity to also introduce more exposure to operational risk.

The infrastructure supporting new digital services could contain hundreds of millions of lines of code and billions of dependencies, so digital incidents are inevitable. Research shows that there was a 19% rise in critical incidents from 2019 to 2020.

To keep up with the pace of innovation required to maintain high availability and deliver on customer experience, organizations need to invest in best practices and develop robust processes to streamline incident response to proactively address and resolve issues when they arise.

Infrastructure and operations won’t magically attain the adaptive resilience Gartner talks about with current manual and reactive incident response.

Looking for Opportunities to Harness Automation in Incident Response

In many organizations, the tools, scripts and manual commands that responders use to get to the bottom of incidents exist in the heads of just a few subject matter experts (SMEs). They may also require manual intervention. This does not make for rapid or effective incident response. All too often, organizations waste previous resources by swarming the problem with maybe dozens of responders. This won’t fix the underlying issue.

Manual processes can also lead to copy-and-paste errors, unnecessary repetition of steps, limited collaboration between technical and customer support teams, and use of incorrect documentation. The result is slower MTTR, angry customers and frustrated employees.

Instead, organizations should automate as much of their incident response as possible — driving resilience and enhancing their ability to learn from events, and proactively improve on a continuous basis.

Machine learning-powered runbook automation is a great example. At a very basic level, incident response is all about completing repetitive tasks, such as restarting servers, copying artifacts, running scripts and manipulating files. By intelligently capturing these processes and documenting them into runbooks, they can be automatically executed by responders other than SMEs.

Democratizing incident response in this way could have a major impact on MTTR. First responders spend an average of 15 minutes triaging an alert when it first comes in before escalating to a SME who spends another 15 minutes running diagnostics. But by running automated workflows from the outset, first responders could collect that information straight away and potentially fix recurring problems using automated remediation. If not, they can escalate to the SME with the information they need to start working on fixing the issue immediately.

In the most mature organizations, automation and artificial intelligence (AI) can even be used to remediate commonly occurring incidents before responders are even paged. In this scenario, escalations to SMEs and developers only occur for unusual and complex problems.

Step by Step

This is not an overnight journey. Yes, the right tools will go a long way to achieving these goals, but organizations might also have to overcome cultural barriers, which can take longer. The key is to start small with achievable goals, learning as you go. Organizations need to walk before they can run.

That could mean starting with simple, low-risk automated diagnostics that have no impact on service performance or availability, and which require little processing. With automation that runs commands, gathers log information and tackles other common troubleshooting steps, teams can reduce MTTR and potentially avoid mobilizing some responders if nothing out of the ordinary is discovered.

From there, organizations could move to reflex actions for the most common problems (for example, removing temp files to clear up disk space). Once those simpler problem signatures are codified, they can move to automating multistep sequences for remediating common problems. And only automate complex actions with a potentially major impact on performance or availability after successfully working through those earlier stages.

The bottom line is that machines are faster than humans at some tasks, and they don’t mind work that is boring and repetitive. Organizations able to use this to their advantage through AI, ML and automation will unleash the talent on their incident response teams while improving IT resilience and adaptiveness. That’s the way not only to happier customers and a burnished brand reputation, but more motivated staff with more time to spend on innovation. And in the post-pandemic digital world, innovation will be key to survival.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.