What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Observability / Operations

5 Ways to Supercharge Incident Remediation with Automation

Automate common incident remediation tasks to enable faster response, fewer errors, and greater productivity across your organization.
Nov 21st, 2023 9:30am by
Featued image for: 5 Ways to Supercharge Incident Remediation with Automation
Featured image by Israel Palacio on Unsplash.

For today’s digital-first organizations, software problems often become business problems. As companies’ revenue and customer experience increasingly move online, incidents and disruptions — and their associated downtime — will have a bigger impact on revenue, customer satisfaction and employee productivity.

The fact of the matter is that many IT disruptions are reasonably well understood, both in how to triage and remediate — even when you’re only temporarily fixing the problem. Diagnosing alerts from noisy services usually begins with the same steps. “Fix it for right now” remediation steps are also often the same, involving simple service reboots and failovers.

These repetitive actions are good candidates for applying automation to allow faster response, avoid interrupting subject matter experts (SMEs), decrease errors and increase productivity.

The Downsides of Not Automating Incident Response

IT operators must resolve severe outages as quickly as possible, which is why they track metrics such as mean time to resolution (MTTR) and error budgets. In these cases, service restoration is the highest priority, regardless of whose work is disrupted.

Once you meet service level objects (SLOs), driving IT support efficiency becomes a concern. All the less-severe incidents, IT events and monitoring alerts can drive up support costs and interrupt senior engineers from their primary work, reducing the velocity of new features. Unfortunately, the situation in many organizations is far from ideal. Research reveals that a fifth of organizations suffer a “high impact” (equaling a 25% or greater loss of productivity) from being interrupted by unplanned work stemming from IT incidents and outages. For 47% of organizations, the impact is “significant,” meaning a 10%–25% productivity loss.

Much of this toil can be traced back to operators without the knowledge or access to fix problems on their own needing to escalate to senior engineers for resolution. The reason is many first responders in operations centers lack knowledge of the many systems an enterprise runs and likely the skills to diagnose and remediate an issue unless clear instructions are available, such as in a runbook. They also may not have the requisite access privileges to run tests or make changes to production, whether because of lower skill levels or companies needing to keep their environments locked down for compliance reasons.

Often, these responders are left drowning in signals and alerts, unable to filter out the noise from huge data volumes and unable to do anything other than escalate for help. As a result, senior engineers are called to help even with basic triage tasks simply because they have access privileges to the impacted systems. These interruptions can consume hours each week, distracting engineers from development projects. Incidents end up involving far too many engineers, doing basic things like running tests to show their code is not causing the problem.

Automating Incident Response with Artificial Intelligence Operations (AIOps)

Automating predictable, repeating steps in incidents can reduce needless escalations to experts, empower first responders to take more actions and (ideally) eliminate calling any humans at all. Consider a typical incident response workflow:

Typical incident response workflow

Employing AIOps to detect problems from alerts and label incidents is a major way to increase speed and efficiency. You won’t need responders staring at glass to find problems; AIOps can filter through a lot of repeated noise and false alerts to find real problems that need action. With AIOps in charge of triggering your incident workflows, you can automate tasks through resolution, closure and even the final fix by developers.

Getting Started with AIOps

The diagram above shows there are many opportunities to improve incident response with automation. But where should you start?

It’s a balance between your confidence in the automation, the value or cost of the incident and the frequency the task occurs. Common incidents with proven automated steps for diagnosis and remediation are good opportunities to trigger with AIOps. From there, follow a similar process to prioritize your incident response.

Automate diagnosis and remediation steps for serious outages to speed resolution. Then focus on increasing efficiency by automating recurring diagnostics and remediation actions that occur across many kinds of incidents. You can safely automate and trigger lower-risk actions such as read-only diagnostic pulls with AIOps, giving downstream personnel the information they need, even when they are paged.

You can automate common remediation actions and make them available to responders to use. This automation can utilize secrets management tools such as Vault to enable privileged actions in production environments without sharing credentials, making it safer to delegate to responders. When the likely cause of an incident is obvious, and the remediation automation is proven, you can have AIOps trigger the remediation to enable self-healing without needing to call any responders.

What you choose to automate first comes with an opportunity cost. So finding the tasks that can generate the biggest financial impact is your path to success.

5 Design Principles to Automate Incident Remediation

Here are five key design principles that will help organizations automate incident remediation to dramatically reduce worker toil, free talent for innovation and optimize how they resolve incidents.

  1. Start simple: Don’t create overly complex bespoke automation for each kind of incident. Build primitive tasks and actions, and reuse them as building blocks for more sophisticated flows. Start with lower-risk actions such as ones that don’t change production or can be performed with lower access privileges. Keep execution times short. Avoid hopping across too many technology domains.
  2. Build with guardrails: Whenever an automated task starts or stops, have it send notifications — through emails, your incident response control panel, Slack messages or another method. Avoid loops, at least within component automation. Perhaps your incident management workflow runs several retries, but you want these rules to be visible, not buried in a different automation. The same is true for conditionals like if/then/else statements. Leave these to visible business rules in your incident response workflow. Always report the status of the execution: success, failure, error.
  3. Deliver meaningful results: Automation must make sense to end users so that they can derive maximum and immediate value from outputs. Consider who your users are and what skills and contextual knowledge they have. Remember, many first responders may not have the deep technical systems knowledge that SMEs possess. You might need to simplify diagnostic information to improve and accelerate decision-making for routing support tickets, for example. It’s about taking raw data and converting it into the information your end user actually needs.
  4. Promote consistency: Building on the previous point, the key to ensuring a wide range of people can support more operations is to deliver a consistent and simplified end-user experience. This could include standardizing code styles, input parameters and catalog presentation. Your organization may be using Ansible for its Azure environment, or CloudFormation and Terraform for AWS, and even a mix of BASH and Python scripts. The key is to make the end-user experience consistent across all environments or tooling, so you do not need individuals to specialize in various tools. How many apps can your responders support?
  5. Document the process: Always document automation at the point of invocation. Some processes probably won’t run often, so you shouldn’t expect a responder to remember training from months ago. Similarly, some organizations see high turnover in first responders, so documentation helps shore up where training may be incomplete. This documentation should guide the user in a standardized way to boost understanding and maintainability.

Where Humans Step In

Automation isn’t a panacea for incident response. The idea is to let machines take on manual and repetitive tasks where possible. When incidents are complex or novel, humans need to get involved. Even in cases where SMEs are required to step in, automated processes can speed up their work by proactively gathering the detailed diagnostic data they need to determine root causes and the right remediation steps.

In a digital-first world, automation should be on top of the to-do list for every IT function.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.