Machine Learning for Operations
Managing infrastructure is a complex problem with a massive amount of signals and many actions that can be taken in response; that’s the classic definition of a situation where machine learning can help. Adoption of MLOps or AIOps (as Gartner has christened this trend) has been slow, perhaps because making the most of them requires automation to apply the recommendations and at least in part because IT is naturally conservative due to the need to ensure availability. Silos between IT teams, like separating service management and performance management, also makes it hard to gather all the necessary data for effective machine learning. But the potential is significant and interest is growing.
It’s not just that AIOps can help with availability and performance monitoring, event correlation and analysis, IT service management, help desk and customer support, and infrastructure automation. It’s also part of the general ‘shift left’ DevOps trend where operations become an integrated part of app development and delivery. That means becoming increasingly response and proactive, but it also means improving the communications and coordination between teams, and connecting the data silos. With more applications to operationalize, monitor and support, and more of these using microservices and containers and cloud services that multiply the amount of infrastructure that needs attention, machine learning is becoming a key tool in keeping up.
IT and operations is a natural home for machine learning and data science. According to Vivek Bhalla, until recently a Gartner research director covering AIOps and now director of product management at Moogsoft, if there isn’t a data science team in your organization the IT team will often become the “center of excellence”.
By 2022, Gartner predicts, 40 percent of all large enterprises will use machine learning to support or even partly replace monitoring, service desk and automation processes. That’s just starting to happen in smaller numbers.
In a recent Gartner survey, the most popular use of AI in IT and operations is analyzing big data (18 percent) and chatbots for IT service management — 15 percent are already using chatbots and a further 30 percent plan to do so by the end of 2019. About 8 percent are using predictive analytics to prevent failure (with 34 percent planning to), 6 percent are already using AI for application performance management (and 23 percent plan to) and 5 percent use it for network monitoring and diagnostics (31 percent plan to do that). About 5 percent already use AI to optimize placing workloads in public cloud; with only 19 percent planning to do that in future, that’s a less popular direction. Only 4 percent use it for improving root cause analysis already, though that’s the second most popular planned use at 40 percent (after the 42 percent planning less specific big data analysis). A third also plan to use AI for general IT and operations optimization and intelligent automation.
“Look at the repetitive, low-level tasks that are ripe for automation to free up the time of operations staff, lowering their stress levels and letting them use that extra bandwidth to work smarter,” Bhalla said at a recent Moogsoft event.
Increasingly, that’s being done with off the shelf solutions rather than “homegrown” implementations using tools like Logstash and Elastic X-Pack for the ELK stack, as those off-the-shelf products mature from bandwidth-looking analysis and visualization using averaged-out data, to more real-time approaches using streaming and wire data as well as stored logs. Log analysis tools like Splunk and Sumo Logic have been adding machine learning options for extracting patterns and anomalies from historical data to go alongside metrics and visualizations of real-time service health with alerts when anomalies are detected, and automation options. Micro Focus’ Operations Bridge adds anomaly detection and clustering of related alerts to IT monitoring and Sematext Cloud does anomaly detection with machine learning from performance metrics and logs. Similarly, tools like Moogsoft AIOps and OpsRamp OpsQ started with automated real-time pattern discovery and are extending that to stored historical data.
Machine learning is also arriving in existing tools like network monitoring and management tools; for example, Juniper’s AppFormix analytics and optimization platform uses network telemetry to detect anomalies like higher latency or lower bandwidth than expected on a link. As well as sending alerts It can also make a REST call to the network controller to take the link down for maintenance and reroute traffic elsewhere. Morpheus uses machine learning to place your workloads in VMs, containers or in different public clouds to save money; that includes pausing and shutting down workloads automatically via ServiceNow requests.
From Correlation to Causation
Visualization and statistical analysis of historical data are what Gartner views as a reactive approach; you can look back and understand what has happened using machine learning, either for general performance understanding for root cause analysis. As you move to the combination of historical and live data with machine learning and causal analytics, operations teams can become more proactive with predictive warning systems. If AI-powered systems are going to predict problems and even automate fixes, they need to do more than spot patterns; they need to understand them.
Simply detecting which alerts and errors come from the same event can be very valuable, reducing the flood of noise to something useful. “IT systems generate vast quantities of self-describing data but the data streams generated tend to be highly redundant,” Moogsoft Chief Technology Officer Will Cappelli told the New Stack. “Stripping out that redundancy turns something that’s voluminous but information poor into something thinner, information rich.” That can reduce up to 95 percent of the data volume. Moogsoft’s new Observe tool (which you can deploy without the rest of the Moogsoft AIOps platform) takes time series data and metrics data and then throws away everything that isn’t an anomaly or its context.
Correlating what events were created in the same time period (taking into account latency), using the physical and app topology of the IT system and comparing the text streams for related text (with the option for customers to write their own rules) aggregates the alerts so they’re more manageable. (OpsQ and BigPanda’s L0 do similar kinds of correlation and aggregation for visibility and noise reduction).
The next level is causal analysis, Cappelli explained. “You wind up with an envelope of correlated data items that you have some reason to think are related, and then we introduce causal analytics. Which of these data elements are pointing to events that are the causes of other events that are pointed to but other data items in this package?” Some of that is done by probabilistic root cause analysis using statistical machine learning, which is common in AIOps tools from IBM, Big Panda, Elastic and Splunk. “We look at packets of correlated data and we can structure this package causally based on neural networks.”
Moogsoft also uses a second causality system; a recently developed vertex entropy algorithm using graphs. “There are ways to look at topological information and figure out which nodes in the topologies are most likely to be where important events take place. By looking at the local connectedness of a node with a couple of other nodes you can get a sense of how critical a node is in the graph. Once you’ve figured out which nodes are the important ones, that’s a big clue that says data from that node presents a root cause or is playing a causal role.”
Going from understanding what caused an incident to fixing it is still a big leap. For now, Moogsoft bundles up the causally related data into a “situation” (rather like a ServiceNow super ticket) and puts it into a collaborative workspace called a situation room that suggests who has the right skills to work on the problem and tries to guide them to an effective solution.
“A situation isn’t just a notification of an event, it’s an analysis of what the event means,” Cappelli said. “Our algorithms identify different situations and look at how this situation is similar to another situation so that things that were done to fix that situation can be, applies. That gives you the ability to preserve institutional knowledge about how problems were dealt with and to learn from things that have taken place in the past. If there’s a new team in a new situation going down a path that’s been proven to be fruitless, we’ll tell them to guide them in a different direction.”
The situation rooms don’t include automation or runbook automation; instead, they connect to tools like Puppet and Chef and Ansible. But that means that AIOps can potentially take you all the way from a flood of raw events to the service management setting where you can solve the problems.