PagerDuty sponsored this post.
The COVID-19 pandemic has put IT incident response teams under a level of pressure they have never seen before, illustrated most pointedly by the recent surges in incidents. As shelter-in-place orders were given, some industries like retail, online learning and collaboration experienced intense strain on their digital operations, seeing incidents more than double compared to pre-pandemic levels. And while these teams are being asked to handle more time-critical work than ever, IT budgets are under severe scrutiny as some companies attempt to cope with contracting revenue streams.
One obvious response to this situation is to invest in automation. Automating repetitive and manual tasks gets them done faster, reduces labor costs, and in many cases provides a higher level of accuracy. However, when it comes to remediation — taking action to mitigate or fix an incident — the idea of automation has traditionally been met with considerable resistance.
Risk vs. Reward
IT organizations are notoriously risk-averse, and with good reason. One hour of downtime can cost an organization literally hundreds of thousands of dollars, and COVID-19 has raised the stakes even higher. Right now, about a third of Americans are working at home because of the pandemic. According to one study, 54% want to keep working at home even when the health crisis has passed. This means that if a collaboration system is down, employees can’t communicate by simply walking to a cubical on the other side of the office. Their work simply grinds to a halt.
There’s more pressure on customer-facing applications as well. US e-commerce sales jumped 49% in April and many are speculating that the habit of online shopping is here to stay — even after COVID-19. For many businesses, this means that broken shopping carts or pages that fail to load are a disaster. And disaster means revenue lost. Costco’s website was down for a few hours during Thanksgiving and experts estimated the retailer lost $11 million in sales.
The Path to Automation
With so much at stake, is it safe to automate remediation? What if it works perfectly for 95% of the incidents, but makes things catastrophically worse for the other 5%? This is a fair question, but it’s based on a false premise. It assumes that automation is an all-or-nothing proposition — either fully manual or fully automated, where a machine does everything. In fact, automation often can and should be implemented in stages.
The safe path to automation has several steps. Each step in the automation evolution builds on the last, although some steps may be omitted in some circumstances.
- Phase 1: Identify candidates for automation. The first step is determining the types of incidents that teams encounter on a repetitive basis. These are the incidents where automated remediation makes the most sense and delivers the highest rewards. Not everything will make the cut: stick to manual processes for situations where the investment doesn’t justify the outcome.
- Phase 2: Human-initiated automation. The next step is automating runbooks and making those automated scripts available to all the individuals who might be notified about an issue. Then they can remediate the issue by simply pressing a button. Scripts can be refined as necessary. The goals here are to resolve repetitive incidents in exactly the same way every time, regardless of who may receive the alert or when it may occur, and verify that the scripts actually work.
- Phase 3: Human and machine co-existence. In this step, a human is paged but the automation initiates simultaneously. The human function is to make sure that the automated remediation worked. For projects where the remaining human steps are particularly complex, this is often the final phase.
- Phase 4: Machine-initiated with human fallback. In this step, remediation is automatically initiated upon incident detection, and a human is only paged if it failed to resolve the problem. Keep in mind, even if your automation is succeeding, it’s important to regularly observe its results so that you’re able to course-correct if necessary. You may also want to make an investment in your infrastructure to correct the root cause of the failure if it happens frequently enough; even a short downtime may be unacceptable to your business.
This approach to automation has several very important benefits. First, there are gains at every step in terms of time saved. It is absolutely not necessary to complete all the steps to obtain the benefits of automation or establish cost justification. Second, this is a very low-risk approach. The efficacy of the automated remediation gets tested again and again under production conditions, and human supervision is never entirely eliminated. Finally, this approach to automation can be implemented at a gradual pace, with no disruption to existing processes.
Making Automated Remediation Work Best For You
For all its benefits, some individuals resist the idea of automation, sometimes because it’s simply a new way of doing things and sometimes because they don’t trust it. For these reasons alone, making automated remediation successful always starts with people. Organizations need a plan developed by humans, along with human interactions throughout the journey. Automation can make it easier for remediation teams to cope, and it can be implemented without a lot of risk if companies take a step-by-step approach rather than trying to jump immediately from manual processes to full automation. That said, automation is going to become increasingly more important over time, both for keeping the lights on and avoiding employee burnout.
Feature image via Pixabay.
At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: firstname.lastname@example.org.