Fighting Incidents with End-to-End Event-Driven Automation
The volume of incidents today’s technical teams face is unprecedented, as is the pressure to perform. Companies want to protect revenue and customer experience. And customers across industries have high expectations for digital customer experiences. They want it fast, flawless and highly available, and have low tolerance for gaps in service. According to PWC, one in three customers would stop doing business with a brand they loved after one bad experience.
The teams that have the task of keeping these services available are inundated by alert noise. Responders are confused about what information they need to resolve an incident and where that information is located. And finding this information, plus completing the same manual, repetitive tasks for each incident means they’re wasting too much time.
To reduce mean time to resolution (MTTR) and keep customers and response teams happy, organizations need to leverage automation. But this isn’t a one-and-done ordeal, or something that can be accomplished and scaled within a single sprint. It’s a commitment to better incident-response practices, complete with challenges to overcome and stages of the journey.
Challenges We Hear from Our Customers about Automation
From our time working with customers, from small startups to Fortune 100 companies, to help drive better incident-response best practices, we’ve heard the most common challenges of adopting automation.
Here are the top three:
Too busy firefighting: When incidents are coming in fast, all teams can feel like they’re being pulled into crisis mode. They can’t get ahead of the issues fast enough to complete their assigned work, much less tackle initiatives to improve incident response.
No buy-in: Leaders across industries are looking at how to be the most competitive on the market and how to do so with as little cost as possible. Long initiatives like crafting automation can be seen as a distraction if it doesn’t have tangible benefits to an organization’s bottom line.
Can’t scale: Some organizations are working toward deploying automation but are reaching a stumbling block. They can’t scale. Some teams have detailed auto-remediations built for their services. Others are still stuck doing manual work. There’s no standardization.
When these challenges are at play within an organization, it may be time to employ a crawl, walk, run approach to creating and deploying automation.
How to Employ a Crawl, Walk, Run Approach to Automation
The first step is to determine who is part of the team and at what level you plan to execute. One of the best ways to get an organization to buy in to automation is to start with a small pilot team automating some low-hanging fruit that improves the day to day for a specific team, group or service. Share that automation with other teams and see adoption spread. This will drive interest in building more automation, helping a grassroots initiative succeed. And, with better MTTR, you’re more likely to get executive buy-in as well with proven results and less customer impact.
If the event stream is too overwhelming for your team, start at the source and stem the flow. Crawling toward better incident response automation starts with two things: suppression and pausing transient alerts. Compared to other forms of automation, these are relatively easy to execute. Plus, they immediately help responders gain back time and reduce alert fatigue.
Suppression is used to stop an incident from sending a notification to a responder for an event that’s known to have little to no value. According to AIOps customer data, 50% of noise compression comes from suppression. Suppression can reduce incident volumes via broad rules targeting those events that the team never needs to know about.
For example, a developer team at PagerDuty suppresses events until a certain number of them have arrived, at which point they turn suppression off and allow Event Orchestration to start creating incidents.
Pausing notifications allows users to suspend the creation of an incident for a predefined period. Once that time period lapses, the incident will be created normally. This automation is best used for flagging incidents with clearly defined conditions. An example of this could be a company that pauses certain high CPU usage incidents for 5 minutes, only creating an incident if high CPU turns out to be long-lasting/durable.
Once you’ve decreased the noise in your environment and your teams are getting fewer incidents, it’s time to make those incidents easier to resolve with the proper data. You can do this by enriching events, alerts and incidents.
Event enrichment allows you to speed up triage by ensuring responders have incidents populated with relevant contextual information. Teams can normalize event data so incidents look the same across an organization. This is especially helpful for network operation centers (NOCs) or other L1 response teams who want consistency across the events that come in and don’t have the time to learn the nuances of the hundreds of teams that they support.
Alert enrichment goes a layer deeper. Once the event officially becomes an alert, responders can define the severity with which an alert should be created. This ensures that notifications are routed to the correct escalation policy, saving time during response.
For the alerts that are grouped into an incident, incident enrichment allows users to define the priority and notes that an incident has when it is initially created. This means that you’re more certain when an incident is a P1, and all hands need to be on deck, versus a P4, which you don’t need to interrupt your dinner for. It’s a quality-of-life improvement for anyone on call. Notes are also useful for populating knowledge-base articles, internal wikis or providing information on how a responder should proceed.
The last step of this journey is auto-remediation. Incidents resolve themselves with automation as the L0 responder. No humans are required to respond. One way to achieve this is with webhooks that can be triggered on incident creation. Or you can call in other forms of automation, whether that’s through PagerDuty or another vendor. While some organizations can arrive at this level of sophistication on their own, this automation is difficult to build, and scaling it across an organization can pose many challenges. In fact, this is one of the top reasons why people turn to PagerDuty. Partnership during this phase can help take some of the strain off individual teams that are responsible for developing their own automation or site reliability engineering teams that are responsible for creating it organizationwide.
Looking to Automate on a Global Scale across Your Technical Ecosystem?
Whether you’re just starting the crawl stage of your automation journey or are already running with auto-remediation, PagerDuty AIOps can help you achieve fewer incidents with faster resolution. And our new feature, Global Event Orchestration, can help you create and scale automation across even the most complex technical ecosystems. For more information, you can take our product tour or register for our webinar.