How to Fight Toil with AI-Led, Automated Incident Response
Toil — the repetitive, manual work that bogs down teams and hampers productivity — is a major obstacle in organizations’ quest to optimize costs and streamline operations.
It’s particularly pervasive in incident response, reducing an organization’s ability to effectively respond to and resolve major incidents. This slowdown can degrade brand reputation and customer trust. If your mission is to grow and protect revenue for your business, this is a huge obstacle.
And, the situation is getting worse. The volume of change and data hitting teams has grown to a point where teams can no longer manually process it all. There is an enormous number of signals coming in from across your infrastructure. Add to that all the manual steps involved in coordinating response across distributed teams. The result? The processes teams undergo to handle unplanned work has become incredibly complex.
It’s no wonder that they’re turning to solutions like AIOps and automation for help. With the advent of no-code/low-code workflows, AI-powered orchestration and automated scripts, the goal is to scale operations with the machine as the first line of defense so your people are only required when they’re absolutely needed. It’s time to start automating and simplifying processes to reduce toil and give time back to your teams for value-add work.
The Toil Trap
Toil doesn’t just diminish productivity; it also adds unnecessary stress and cognitive load. When a major incident strikes, teams are overwhelmed by alert storms, bombarded by bottomless noise with no clear signal in sight. Even if they are able to figure out where to start looking, teams are riddled with manual tasks. Seemingly mundane work — such as creating incident channels, notifying responders, setting up communication bridges and updating stakeholders — takes time away from troubleshooting. And all of this can drive up mean time to resolution (MTTR) since context switching and stakeholder interruptions distract teams from resolving the issue at hand.
4 Tips to Combat Toil with AI-Led, Automated Incident Response
The most sustainable way to target toil in incident response is to lower the threshold to adopting automation. Empower teams with a flexible toolkit that they can test and leverage to start automating their own processes. Making it as easy as possible for teams to get started will open the door to experimentation into how to shortcut manual processes. Every little step adds up, meaning more time and focus for faster resolution. Here are four tips for getting started with automating the steps of your response process to set you up for success:
Tip No. 1: Start Small and Specific
Getting started is always the hardest part. Examine your organization’s documentation for different types of incidents. Maybe they’re already codified in runbooks offline. These are perfect places to create your first workflows. Audit all the steps that are executed and look for patterns to identify steps where you can start.
One way to get started is designing incident workflows that can automatically orchestrate specific, dynamic sequences of actions during incident response. Do you always open an incident-specific Slack channel? Do you create a Zoom link and share it with the entire responder group first? Start with something lower priority and test it out. Then move up with severity from there. Use conditional triggers to define workflows for various priority levels, and be sure to give your teams the creative freedom and flexibility to build different types of workflows to address different issues.
Another example is automating diagnostics. Does your responder always conduct the same set of tests before starting to triage? How much of this can you delegate to a lower-level responder (or even the machine) before you escalate to a subject-matter expert? You can set up scripts and automations to chain together specific actions for the machine to run so that the results are ready by the time your responder even gets to their desk.
BONUS TIP: A great way to swarm the potential for automation is with a hack week. If you have one coming up at your organization, consider pulling together a team or a group of teams to brainstorm ways to start using workflows as a part of incident response.
Tip No. 2: Continuously Iterate and Learn
Automation is not meant to be “set it and forget it.” As processes evolve and personnel changes, you’ll want to have a regular review cadence to ensure that workflows stay up to date and effective. For instance, new steps may be introduced with learnings from a recent major incident. This iteration can be applied to the specific sequence for handling in incident workflows. It can also be used to refine and improve event rules further upstream in the incident life cycle.
The best incidents are those that don’t happen at all. Are there ways to deflect an alert or incident completely from your human responders? There are machine learning (ML)-based solutions that can target transient noise and silence it to avoid unnecessary interruptions. The more an incident and sequence of events are understood, the more you could target it with automation to take machine-led actions before you bring in a responder.
Retrospectives and postmortems are great places to revisit whether steps were missed or not required. Assign actions to go back to the workflow or automation and tweak it to improve it for next time.
Tip No. 3: Experiment with Chaining Automation Together
To push things to the next level, explore opportunities to expand and chain different types of automation together. Would it make sense to have event-driven automation tied together with incident workflows to trigger automation at ingest? Is there an opportunity to chain together processes for handling stakeholder communications to get it all done in one fell swoop? Whatever your tech stack, consider ways to link automation together and think about the overall process. Pair with other teams for fresh eyes and look for opportunities to consolidate tools. When evaluating new solutions, make sure they integrate with other pieces of your automation toolkit. Or better yet, see what they interoperate with across their own portfolio for even more seamless handling.
Tip No. 4: Consider Generative AI
As the industry continues to develop more AI-powered technology, we’re seeing more use cases for generative AI emerge. Keep an eye out for pragmatic applications that can help you and your team further drive efficiency in your processes and workflows. The best technology augments and assists your most valuable resource: your humans. Automation and AI should scale your subject-matter experts and free them from toilsome work so that they can focus on more critical tasks at hand. Could that status report be drafted for you? Could you autogenerate a postmortem summary? Removing friction and making shortcuts available to your team where it makes sense ultimately returns time and focus.
Reaping the Rewards of Efficiency
Automation is a great way to codify well-understood processes and return time to your teams. The efficiency gains and time diverted back to value-add work is critical to driving your business toward growth in a tough economy. Whether it’s AIOps early in the event stream, no-code/low-code workflows or simply automated team mobilization, introducing and experimenting with AI and automation today can help your team unlock new productivity going forward.