This article is a post in a series on bringing continuous integration and deployment (CI/CD) practices to machine learning. Check back to The New Stack for future installments.
With orchestration and monitoring playing such key roles in DevOps, the emerging trend of using artificial intelligence (AI) to support and even automate operations roles by delivering real-time insights about what’s happening in your infrastructure seems an obvious fit.
DevOps is about improving agility and flexibility; AIOps should be able to help by automating the path from development to production, predicting the effect of deployment on production and automatically responding to changes in how the production environment is performing. That’s especially true as trends like microservices, hybrid cloud, edge computing and IoT increase the complexity of app infrastructures — and the number of logs that you might have to look at to find the root cause of an issue, and the number of people who need to be in a conference call or chat room tracking down what’s gone wrong and how to fix it.
AIOps depends on aggregating data from multiple systems and DevOps relies on integrating previously siloed systems. AIOps requires the same kind of culture change as DevOps because it means looking at the entire system rather than specific technologies or infrastructure layers, and being comfortable with a high level of automation.
The promise of AIOps is that it can detect anomalies, predict performance problems and deviations from the baseline, suggest optimizations, correlate signals across multiple platforms for troubleshooting, do root cause analysis and even automate fixes if you’re comfortable with that.
“Machine learning and AI technologies will help provide guidance on where teams should focus their energy on optimizing the workflow as well as provide insights into fluctuations in performance and demand,” Josh Atwell, a senior technology advocate at Splunk told The New Stack. “Combined with earlier issue detection, AI and machine learning (ML) will allow teams to optimize resources, increase deployment speed, and improve site reliability.”
But while there’s a rash of AIOps tools aimed at traditional operations teams and significant interest in them, it’s harder to find options specifically designed for DevOps. A high proportion of your DevOps pipeline might be automated when it comes to building, testing and pushing code but decision making is still mostly down to humans looking at error codes, logs or visualization tools like New Relic’s Kubernetes Cluster Explorer.
Some DevOps tools are starting to add machine learning-powered analysis. If you’re monitoring web applications with the Azure Application Insights service, the Smart Detection feature can email you when its machine learning detects unusual numbers of failed requests or performance anomalies in response or page load time.
A number of the AIOps tools designed for traditional operations can also cover some key DevOps systems; ScienceLogic S1 monitors both on-premises and cloud systems and BMC TrueSight and OpsRamp can monitor infrastructure across multiple clouds. For OpsRamp that’s Kubernetes services like AKS, EKS and GKS and the next release will support more Cloud Native Computing Foundation projects for container management and other Kubernetes services.
AIOps requires the same kind of culture change as DevOps because it means looking at the entire system rather than specific technologies or infrastructure layers, and being comfortable with a high level of automation.
Nastel’s AutoPilot application performance monitoring uses ML to correlate events and data from multiple systems across hybrid cloud, on-premises and mobile systems, monitoring user experience as well as transaction and performance, and it can connect to GitHub repos. Cisco is working on an AppDynamics Serverless Agent to collect metric and events from Java microservices running in AWS Lambda, but it’s still in a private beta.
“There are few tools in the DevOps world that exploit anything identifiable as AIOps functionality,” Moogsoft Chief Technology Officer Will Cappelli told the New Stack. “The DevOps team is getting telemetry and applying at best some visualization software; their analyses are largely being made with their own eyeballs.”
Partly that’s because the traditional ops world is an easier target. “The IT operations community has a greater sense of urgency; they’re more aware of the issues because they are dealing with large complex messy infrastructures all the time, they see the complexity of the underlying infrastructure directly — and it has become unbearable. DevOps teams tend to be very focused on the upper layers of the stack and they tend to only look at those elements of the infrastructure that are directly implicated in the applications they’re delivering.”
That naivety about the way the infrastructure can impact the performance of applications — or the way applications can affect the performance of another part of the infrastructure — won’t last, Cappelli believes. “As DevOps becomes more pervasive, the ability to ignore the overall environment they’re delivering functionality will decrease and the urgency of deploying AIOps will increase.”
DevOps Isn’t Painful Enough
There is definitely potential for AIOps in DevOps says Atwell, starting with the same consolidation of alerts and notifications it delivers for traditional ops. “The biggest trend for AI in DevOps is focused around reducing noise in the digital exhaust of software development processes. This is enabling development, platform, and SRE teams to focus energy on issue prevention and environment optimization instead of manually trying to make sense of all of the incoming data.” The real-time insights AIOps promises also fit well with frequent deployments.
The next step will be more proactive suggestions. “They will be able to provide intelligent guidance on changes that should be made in the code or in the environment. This will be based on assessing data from the environment and the testing tools. The system will review historical data to develop baselines and evaluate the system against those baselines regularly.”
For testing, machine learning could eliminate false positives running static code analysis when code is checked in by checking for vulnerabilities in components used in the code base and replacing them with updated versions that have a fix for the problem, Bhanu Singh, OpsRamp vice president of product development and cloud operations, told us.
It could also reduce the amount of code testing required for each code push. “If you have a complex system with hundreds of thousands of test cases your entire test suite might take two hours to run, but machine learning could decide which test to run and not to run so you can get your change to production faster. Based on the change coming, it knows exactly which modules and microservices are getting impacted, it has picked up five use cases commonly used by customers, it runs through the relevant tests and if it doesn’t find anything it can push the build to the next stage.”
AIOps could also monitor the results of that code push. “Maybe the test cases passed but in production it causes a latency issue between two microservices or a there was a timeout or a warning message in the log. The system could learn from these anomalies and next time it would reject the build.”
AIOps could also improve runbook automation, suggests Cappelli. “There’s a massive amount of ultimately deterministic automation in the DevOps process and because it is complex and rigid it is prone to error. The environment is changing constantly and runbooks get out of date.” Having AIOps tools analyze incoming telemetry and modify runbooks to make them context-aware could remove a deployment bottleneck as well as a source of errors.
The Right Data to Learn From
Some of these are similar to existing AIOps features and just need data that’s already available. But taking things further identifying critical anomalies and understanding causality will require much more knowledge about developer behavior and telemetry that’s often not being gathered from the production environment, Cappelli said.
Application logging done today is focused on telling developers what’s going on, Atwell pointed out. “Going forward logs have to be generated that better inform operations and support. The data that will be most valuable early in AIOps will be system performance metrics and logging. This information provides, baselines as well as increases the ability to predict potential outages with increasing confidence. Logging value will correlate with logging quality.”
With that data, Atwell predicts that AIOps will support DevOps with “value stream optimization, automation, workload management, and quicker security and bug identification.”
The list covers the full spread of DevOps because what AIOps is doing, as Cappelli says, is “going beyond automation to automating insight.” And as DevOps becomes more mainstream, there will be more demand for tools that deliver that.
The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: MADE, Real.