AIOps Done Right: Automating Remediation and Resiliency
Over the course of two previous articles, I’ve been breaking down how DevOps and site reliability engineers (SREs) can leverage AIOps for the better and get the full value out of these solutions. As IT environments become more complex, more dynamic, and more difficult to manage manually, AIOps will take on even greater importance in driving business value and empowering teams to create better, more secure software faster. But many organizations are still stuck in the “Gen 1” phase of AIOps solutions, and these solutions cannot keep up with the speed of today’s environments and production deployments. Getting the most value out of AIOps means moving past these traditional solutions and adopting new use cases and applications.
I’ve previously highlighted a couple of these use cases, including shifting AIOps left to create more test-driven operations, and scaling and improving delivery automation to put higher-quality code into production and increase the delivery pipeline’s throughput. In this final segment, I’d like to cover one more key use case for DevOps and SRE teams looking to do AIOps the right way: using AIOps solutions to build operational resiliency and automate remediation.
Building Resiliency Through Automated Operations
How does your organization’s IT system react to changes in user behavior? How about during load and stress tests? Or if a component breaks after an upgrade or a dependent system becomes suddenly unavailable? Resiliency and adaptiveness to changes like these are key hallmarks of production quality. Ensuring resilience in IT systems is a focus for SREs, and AIOps solutions can offer a capacity for automating manual operational tasks that, in turn, facilitate continuous resiliency, availability and system health.
Integrating your AIOps solution with your delivery automation feeds critical contextual information around configuration and deployment changes directly into the solution. That added context empowers the AIOps solution to:
- Pinpoint the root causes of an abnormal behavioral change in your system more quickly and precisely.
- Alert the relevant teams if an ongoing load test in production begins to affect the overall health of the system.
- Alert application teams if the rollout for a new version of a critical backend service is inadvertently creating a high failure rate for that service.
- Provide a detailed rundown of both the root cause and its ultimate impact on users.
Using AIOps Postmortems to Find the ‘Critical Path’ of Resiliency
The ultimate goal with this application of AIOps is to automate as many of these notification and remediation steps as possible. Detailed information on root causes and impacts provided by AIOps solutions help to do just that. These postmortems enable SREs to study where the “critical path” of resiliency is in their systems. By identifying the application, process or behavioral changes that caused a system to become unstable, as well as which actions ultimately resolved the problem, architects and engineers can use those insights to create an altogether more resilient system. Increased resiliency leads to fewer behavioral abnormalities, more consistent and reliable performance, and overall better digital experiences.
That feedback isn’t just good for creating more resiliency; it’s also useful for automating runbooks, which ensures that if a similar behavioral problem occurs in the future, the remediation steps for addressing and resolving it will be done automatically. Auto-remediation and a faster time to repair mean shorter system downtimes. The shorter the downtime, the less disruption to the user experience.
Here’s an example of what that kind of workflow might look like:
Here we can see that issues are escalated to the relevant teams right away, so they can take the appropriate responses (notifying customers of a potential issue, fielding incoming complaints, etc.). Meanwhile, the auto-remediation script identifies the root cause, resolves the problem and alerts the teams of the outcome, who can then let the customers know that everything is back to normal.
While this is just one example, of course, it hopefully provides SREs with a useful guidepost on how to leverage AIOps solutions to their benefit. AIOps can draw on system feedback to build more automated and resilient operations — improving mean time to repair, reducing downtime and alerting the relevant groups about when an issue has occurred and when it has been resolved — all in real time.
When done right, AIOps makes problem-solving faster and life easier for SREs, who can now spend more time and effort on innovative, value-adding projects, rather than chasing down every single alert that is flagged their way.