Artificial intelligence for IT operations (AIOps) is in the process of transforming how IT ops teams manage alerts and remediate incidents, but it’s also on a path to reimagine the DevOps pipeline through continuous alert and incident management. AIOps not only uses data science and computational techniques to automate common and routine operational tasks, it also ingests metrics and uses inference models to extract actionable insights from data. IT operations teams get both a contextual view of service health, but the introduction of automation makes end-to-end service uptime, from monitoring to alerting to remediation, much easier and turnkey. And this automation is almost custom-made for a DevOps-enabled organization, where continuous deployment and operations management is standard operating procedure.
It Starts with Easier Alerting
A typical alert management workflow includes the following:
- Correlation of ingested alerts to a common cause,
- Triage and prioritization of alerts for resolution,
- Integration with ITSM tools for improved problem management, including the creation of incident tickets for issues that need further root cause(s) analysis.
It seems pretty straightforward, but the problem here is that each step in this workflow consumes human time to execute. The average time spent across task flows, per ingested alert, increases disproportionately with the number of incoming alerts. AIOps delivers intelligent alerting and automation of these tasks which means alerts don’t have to be dealt with by a human. And that human time savings translates into cost savings, speed, efficiency, and all the other values typically associated with DevOps.
AIOps Workflow: From Alerts to Inferences, Inferences to Models
As shown in Figure 1 below, AIOps continuously learns patterns and applies learned models against incoming alert streams to make sense of cascading and parallel impacts. It groups related alerts into inferences based on the learning models.
Inference models are how these inferences are driven within the system. They allow the users to set filter criteria and apply an analytical model to a particular type of IT resource across applications and infrastructure stacks. IT and DevOps teams can then manage these inferences instead of addressing individual alerts, reducing the “noise” that users need to sift through in everyday operations. And they can build these inferences to operate continuously and contextually, supporting a continuous CI/CD pipeline.
Some common inference models include:
- Topology. Topology inference models use the relationships between IT services and the underlying infrastructure to build inferences. They identify the root cause alerts for an incident with the right situational context and impact analysis.
- Clustering. Cluster-based inference models use attributes to drive insights by analyzing similarities and correlating different alerts into one inference alert.
- Co-occurrence. These models use alert sequence patterns for existing alerts to correlate alerts and identify the root cause(s) for an incident.
Policy-Driven Escalation and Remediation
Every alert represents a problem condition that must be fixed. AIOps addresses incidents based on a well-defined sequence of actions. For example, if a server becomes unavailable when a key application process stops running (e.g. Apache), restarting that process can be a well-defined, automatable action to get the server working again. AIOps invokes scripts on alert triggers and executes remediation actions, totally unsupervised.
AIOps can also automate workflows for alerts that require escalation, human attention and/or investigation. For example, alerts on devices supporting business-critical IT services require notification of Level 1 support staff within five minutes of alert receipt. If the alert is from a server and for a specific application, an IT or DevOps user will need to create an incident and route it to the relevant application team. AIOps takes care of this immediately with alert escalation workflows that help program first-response actions for notification and incident creation. Again, this can occur completely unsupervised – no human interaction required – once these policies are established.
What’s more, policy-driven AIOps correlates dependencies based on downstream resources or establishes an algorithm-based correlation to address groups of alerts continuously. This drastically frees up time that is typically spent sifting through alert floods, figuring out what to do with them, and then doing it. Advanced AIOps tools use native instrumentation to determine how frequently specific alert sequences occur. Plus, alert escalation policies can auto-assign incidents using prior alert, incident, and notification data.
The Overall Impact Is Enormous
AIOps proactively monitors system health, reduces alert storms, remediates issues quickly, and escalates automatically. It’s the latest and greatest technology for IT Operations teams. But DevOps teams can also use AIOps to analyze event streams in real time, extract meaningful insights from events for continuous improvement, drive faster deployments and better collaboration, and reduce downtime with proactive detection.
Feature image by Prawny from Pixabay.