Data centers store a tremendous — some would say, ridiculous — amount of logs. There is too much data for the smartest administrator to make immediate sense of. For analytics to continue to be helpful, it needs to be smarter than it is now. If this much logging on a granular scale is to be of use to anyone — before, during, or after a network event — logic needs to be capable of ascertaining causes and effects as they happen.
Machine learning could be one thing that could provide such an intelligence.
Two years ago, market analysts thought it seemed a little weird that a communications service provider such as Alcatel-Lucent would buyout, rather than just purchase the license to, a customer experience management tool called Motive. Did the parent company of Bell Labs suddenly want to enter the CRM software space along with Salesforce? When A-L was asked directly, it responded directly: It wanted to put Motive’s analytics tool to reason out the root causes of network failures on a global scale.
It was one of those science fiction ideas that so many folks treated as science fiction that they didn’t realize it had already become science fact: Machine learning could be put to use in diagnosing the causes of data center failures and performance degradation, and furthermore, to become so familiar with the patterns of traffic and their underlying sources that it could predict when failures may occur in the future.
Arguably, it takes a knowledgeable human being to be able to appreciate an artificially intelligent system diagnosis.
Last year, a company called SIOS Technology already made some headway in this department, with the introduction of an analytics tool called SIOS iQ (small “i,” big “Q”). It measures the behavior patterns of vSphere environments, to pin down the causes of performance issues with virtual machines. Remediation is a part of this business that SIOS iQ is easing its way into. There’s still work to be done tying together the diagnosis with the remedy — especially when the automation capable of applying remedies is wrapped up with CI/CD.
Making this connection happen may be a matter of bringing more DevOps professionals, with broader skill sets, into the mix. On Tuesday, SIOS announced an expansion of iQ designed to detect issues with SQL Server, including the newly released SQL Server 2016, when running in VMware environments. For now, the goal of this expansion is to notify database and IT admins as to when performance issues can and will occur.
“Our goal is to be the first stop where the IT admin goes, to try to understand what the status of the environment is, where the problems are, where to look first, and to instantaneously identify what to do about them,” explained SIOS CEO Jerry Melnick, “rather than running around the room, collecting all the people who know something about it, collecting all the information, and trying to sort through a problem. We wrap it all up, analyze it, figure out what’s wrong, and tell you what to do about it.”
SIOS visualizes iQ, at least for now, primarily as a visualization tool. Many of the product enhancements made over the last year have concentrated on making diagnoses more graphical and more digestible to a human observer. Arguably, it takes a knowledgeable human being to be able to appreciate an artificially intelligent system diagnosis. What’s more, it’s the dashboard that sells an analytics package to the people in the enterprise who are typically responsible for purchasing, as recent software development analysts surveys attest. While AI functionality is considered “cool,” operations managers want to see results first. So SIOS has concentrated thus far on results.
But the next step, Melnick admits, is down the road a bit. While the next round of SIOS iQ enhancements have been made generally available, it’s the second half of the year when SIOS will begin dealing with the issue of automation. Currently, SIOS iQ can render remedial suggestions — once again, visually. But the opportunity exists here for AI and machine learning to become integrated with the CI/CD frameworks and deployment pipelines that run data centers at scale today.
The SIOS CEO (who was promoted to the post last October from COO) showed us an example of a situation where iQ’s current dashboard highlights an SQL Server behavior that is clearly outside the norm. Under a box marked “Symptoms,” it shows the problem automatically with pink half-tone and projects how much latency the problem has added to the current workload, over and above the norm.
As with any machine learning platform, “the norm” is something that the system has to be trained to learn. Historically, the problem with machine learning systems as network performance analysts is that they’re typically installed when “the norm” is already a problematic situation. When the behavior of some systems improves to a state that would be more desirable, in some cases, an ML algorithm can send out an alert.
Over the last year, we learned, SIOS has worked to remedy this situation. As CEO Melnick told The New Stack, its current iQ algorithms are now seeded with anonymized behavior patterns from real-world customers — so iQ has at least a better idea of how “normal” should appear, than if it opens up its view of the data center with a completely blank slate.
“We monitor across the application stack, as well as the infrastructure,” said Melnick. “Even without application-specific awareness, we are looking into the virtual machine and monitoring its activities, from the data that we have — storage, network, compute, CPU, and memory utilization. Those are all important metrics, so the patterns today correlate all of those things.” Conceivably, he said, if an application performance monitor is capable of rendering even deeper metrics, there are ways to train iQ to accept them as well.
Given the inclusion of all these factors, the learning cycle for SIOS, the CEO said, takes about seven days — which is much narrower than we’d heard last year.
“We actually import what we call ‘standard operating patterns,’” he said. “Over time, these are replaced with what we’ve learned in your environment. So you will get more ‘issues’ — I don’t want to call them ‘events’ or ‘alerts’ — they’re issues that we’ve identified in our patterns outside of what should be a standard system, for the first seven days. By the end of Day 7, we’re going to settle down into patterns of behavior that we’ve recorded and learned.”
At the end of each month, iQ performs a rigorous recycle pattern that refreshes its knowledge base, and intensifies its training concerning normal and abnormal behaviors. The system operator or DevOps professional may be personally involved with this process, for semi-supervised learning.
Melnick knows that developers have a special expectation for events, as opposed to “issues.” An event, just like in object-oriented programming, is a trigger for a procedure. The mechanism that will eventually deliver a means of automating that procedure does exist, Melnick told us. But the activation of that mechanism into what some could call a “virtual robot” — a software-based rendition of an automated systems analyst and operator — is part of what he describes as the product roadmap, the starting point for which falls sometime during the second half of 2016.
It’s not just a “sometime” thing that’s some date out of the blue, he assures us. SIOS conducts product refreshes in six-week sprints now, he said, so the July 29 version 3.8 release will include improved forecasting that goes way beyond linear regression. At that time, the product will be capable of rendering projected general states of behavior in four categories: performance, efficiency, reliability, and capacity.
“You’ll see, each day, the level of criticality of those issues,” said Melnick. “By using a machine learning technology, our projections are based on actually learned patterns of behavior, of your environment, over time, across all the tiers of computing, and is projecting how they’re changing over time, and how on a particular day that environment will be impacted.”
Conceivably, having such a forecast not just on hand but recorded as data could change the way remedial automation addresses these events if and when they occur. For example, forecasted data could serve as parameters, informing automated routines as to how they should respond in the absence of real-time data, on account of the very issues being experienced.
For now, SIOS iQ’s ML capability has been developed for VMware environments. Containerized environments are a consideration for the company, but nothing definite is in the works. It’s fair to assume that whether SIOS goes forward in the future with an iQ for containerized platforms, or for orchestrators like Mesos or Kubernetes, depends on how well its experiments fare in the realm of the first generation of virtualization.
AI, contrary to what many purveyors of science fact say, takes a long time to gestate. But the more the community of developers participates in that process, the more that gestation may be helped along.
Feature Image: A printed magazine ad, circa 1967, for a “Talking Learning Machine” by Mattel, licensed under Creative Commons.