The Path to Smarter AIOps: 3 Things ITOps Teams Should Know
IT teams have played a critical role this past year, by enabling rapid digital innovation and empowering organizations to adapt quickly to changing customer and employee needs. This transformation is visible across every vertical — from healthcare’s rapid shift to telehealth, the service sector’s adoption of eCommerce, to the widespread need to support remote work. Demand for digital transformation continues, and it will drive long-term success for enterprises that embrace it.
The digital services behind these innovations reside in dynamic, web-scale, open source environments characterized by multicloud complexity. As digital experiences become the norm, consumers now have options and are increasingly prepared to walk away if they encounter problems. As a result, ITOps teams are now under tremendous pressure to support efforts to drive faster innovation while “keeping the lights on,” which is becoming increasingly difficult amidst the growing complexity of their cloud environments.
AIOps has been touted far and wide as the answer to many of these challenges, helping to reduce alert noise and accelerate incident management. However, AIOps doesn’t have to be limited to these traditional use cases and can drive far greater value in accelerating digital innovation for organizations that adopt it with a more ambitious mindset.
Today’s multicloud, open source environments are increasingly defined by their complexity — a tangled web of interconnected cloud platforms, microservices, containers, serverless architecture, and orchestration platforms. The digital services they power often contain hundreds of millions of lines of code and billions of dependencies.
It’s simply beyond human capacity to manage these effectively in real time when it takes just a small issue in a single line of code to trigger a storm of alerts. How can ITOps teams manually find the root cause of a problem that affects multiple services and raises thousands of events per second? Legacy monitoring systems only add to the confusion with false positives that can distract teams from focusing on the issues that matter.
AIOps provides a different approach to IT operations because it is designed to intelligently identify any issues as they appear, highlight their impact to the business and automate their resolution. When done right, AIOps can cut through the noise caused by voluminous alerts to locate the root cause of problems and take appropriate action before users even realize there’s been an issue. However, not all AIOps are created equal.
Identifying the Type of AIOps That’s Right for You
There are two distinct types of AIOps. Traditional AIOps follows a statistical approach that correlates metrics, events and alerts from multiple infrastructure monitoring tools, application performance management and other tools, to build a multidimensional model of the system that’s analyzed. This machine learning-style approach produces a set of correlated alerts, which ITOps teams still need to address manually to identify the root cause of events and resolve them.
This approach can be slow because it needs to collect a sizable amount of data to train the algorithm. It also takes time to “learn” the rules and understand the actions that need to be taken in response to events. It can take weeks, if not months, for a system to be trained to get it to the stage where it can be trusted with monitoring business-critical applications in production. That may work in static on-premises environments where the rules are always the same, but this approach struggles to keep up with dynamic multicloud environments where change is the only constant.
If the dataset is in constant flux, machine learning AIOps systems can easily misread the signs and make the wrong call. Plus, as ITOps teams still need to get involved to make final decisions and act on insights, many of the efficiency benefits that AIOps is supposed to deliver, such as the ability to work automatically in the background while teams focus on higher value tasks, are lost.
A Deterministic Approach
Machine learning AIOps is most useful for helping ITOps teams manage multiple events from different monitoring solutions, within a single UI, and focus on the most critical alerts. However, it falls short of the real-time root-cause analysis and self-healing capabilities that fully automated AIOps aims to achieve.
This is the promise of the second key type of AIOps, which is built around deterministic AI. In this approach, the AIOps platform ingests raw observability data from across an entire multicloud environment, performs a step-by-step fault tree analysis and provides precise, actionable insights in real time.
Here are three key factors that contribute to successful deterministic AIOps:
1. Fault Tree Analysis
Deterministic AIOps should use the kind of fault tree analysis commonly found in safety engineering. If, for example, an application is returning search requests too slowly, the deviating metric — in this case, response time — will trigger the fault tree analysis. The monitored entity — in this case, the application — will be the starting node in the tree. The system will then analyze and investigate all the dependencies this entity has — such as third-party calls or backend requests — for anomalies.
Any dependency that has been cleared will form a leaf node, and those showing anomalies will be investigated further. This process continues, with vertical (service to process to host) and horizontal (service to service, or process to process) dependency analysis, until a root cause has been found. That is more likely to yield the most appropriate remediation action, enabling ITOps teams to automate the response and enable a self-healing process.
2. Topology Mapping
Deterministic AIOps requires a topology model of the organization’s cloud infrastructure and application deployment to ensure it can make accurate data-driven decisions. While machine learning AIOps can gradually create a topology model from ingested data and metadata, deterministic AIOps needs a real-time map from the start. As a result, the AIOps solution must have the ability to automatically and continuously discover the cloud environment as change occurs and ingest any raw data — including metrics, logs, events and traces — from every entity within that topology model. This provides the real-time observability that ITOps teams need to drive successful cloud automation through AIOps.
3. Knowing the Root Cause
It’s important to remember that there are two types of root cause: technical and foundational. A technical root cause analysis will detail the specific technology incident, like a CPU spike of a running process, while a foundational root cause explains what led to that result, like a new deployment within the application code. To uncover the foundational root cause effectively, the AIOps solution must be able to browse through the history or change log of the entity identified as the technical root cause.
AIOps clearly has significant potential for solving many of the biggest challenges that ITOps teams face in their efforts to keep pace with the needs of modern organizations and embrace a more DevOps-based culture. However, it’s important they not be blinded by the hype surrounding the technology and that they realize that not all AIOps are created equal. If they want to achieve fully automated AIOps with real-time root cause analysis and application self-healing capabilities, deterministic AI-based approaches are critical. With this kind of AI driving their AIOps, ITOps teams will finally be able to harness the full power of automation to drive faster innovation and digital business success.