How Machine Learning Can Save You from Observability Overload

With the increased adoption of highly distributed, complex environments —including, of course, Kubernetes and microservices — DevOps teams need proper observability metrics more than ever, to help them gauge and improve the health of their systems.
Observability metrics can be used, among other things, to find the “unknown unknowns,” in order to improve the performance of systems deployed across multicloud environments, with multiple microservices and API connections running interdependently.
These metrics are not only essential for knowing how application performance and user experiences can be improved, but can also save operations resources, such as avoiding overprovisioning.
And yet …
At the same time, DevOps teams are drowning in metrics, and it’s gotten harder for teams to interpret them and take action on them. Organizations often lack the personnel they need to properly analyze the data and apply the information they’re gathering.
“Everyone’s got multiple data centers, multiple on-premises and traditional silos and they have multiple public cloud providers, edge locations and all these different things. At the same time, there’s a skills shortage everywhere within IT,” said Scott Sinclair, an analyst for ESG Global. “It’s becoming more and more difficult to figure out what your apps need.”
The Promise of Machine Learning
One promising solution to this DevOps challenge: Using machine learning (ML) to interpret often vast amounts of observability data.
ML can be used, for example, to identify concrete actions that can improve the performance of running applications. This can help alleviate a tremendous source of strain and resources that otherwise must manually be used to parse through the observability data.
Such an ML system, integrated with the organization’s preferred observability tool, would communicate the best course of actions the operations team should make. The operations engineer would only have to validate one of a few ML-proposed options in order to improve the app’s performance and the environment it’s running in.
The ML-based system would also process the observability data and provide specific ways for the configurations of distributed Kubernetes and other environments to be optimized. The automated optimization process should provide information so that Kubernetes clusters, for example, can be made to run more efficiently and at a lower cost.
“As applications become more diversified and distributed, especially with the rise of Kubernetes and microservices, it’s getting more and more difficult to figure out exactly what infrastructure needs,” Sinclair said. “Most people assume they know and don’t really know, especially as environments can be more distributed.
“No one has extra time to go through and analyze your apps and figure out what they need, which is what an ML system does that can analyze your apps, figure out what they need and give you actionable results.”
The actionable results ML can provide include specific recommendations about how to tweak systems. It can also analyze different scenarios to show the tradeoffs and contingencies when certain configuration settings are changed.
When ML processes observability data, Sinclair noted, “one of the more interesting outcomes is when infrastructure becomes more intelligent. It allows for you to analyze a certain app in order to optimize its performance or to do different things.”
More Than a Smart Algorithm
The concept of an ML system that is able to analyze observability metrics and generate actionable results for multicloud architecture is fairly straightforward. However, the algorithms for such a computationally intensive ML system running underneath the hood of a system involves a high degree of complexity.
While many, if not most organizations, lack the resources to create such an ML system in-house, a proper system from a vendor should easily be able to demonstrate whether it works or not in a simple test-bed environment. Ops teams will know right away whether or not the tool is outputting useful and the relevant actionable results they need.
The ML system should also be easy to use. “Companies often don’t necessarily have time or the expertise to dive deep into how Kubernetes is allocating resources — they need something abstracting that all away from them, since they don’t have time to or the skills to manage that,” said John Platt, vice president of machine learning, for StormForge.
“The ML should ‘watch’ the observability data, in some ways abstracting away the finer details involved in the balance of power and abstraction, so you can get stuff done with flexibility while making it work in any environment.”
In StormForge’s case, ML is applied to automating the process of Kubernetes resource efficiency at scale. Its offering includes testing tools applicable to continuous integration/continuous delivery (CI/CD), while its StormForge Optimize Live tool, released Wednesday, serves to optimize Kubernetes production environments through its integration with performance testing and observability tools, such as Datadog and Prometheus.
It allows for Kubernetes infrastructure to be optimized through ML-generated instructions to, for example, tweak CPU or memory configurations in order for applications to run more efficiently or to fix performance issues.
“By touching on more of the operations phase of the lifecycle where an application is actually in production with StormForge Optimize Live, we’re making recommendations to update the application configuration to make it run more efficiently,” said Rich Bentley, vice president for product marketing at StormForge.
“As a standard DevOps lifecycle process, there isn’t much of a base that exists today like this. But it’s our mission to make optimization a systematic, continuous process with ML that’s just part of the regular DevOps process.”
Reining in Costs
As mentioned previously, the essential task is for the ML system to be able to process the immense amount of observability data that Kubernetes environments typically generate and to “dumb it down” so that the operations team can just take actions as needed, according to Charley Dublin, vice president of product management for Acquia, a platform provider for Drupal applications and an early StormForge Optimize Live adopter.
“With Kubernetes, there are so many variables involved and areas to optimize on a pod or cluster level,” Dublin said. “With tons and tons of data, the first thing that is required is for the ML to filter out what’s the most relevant data that will affect either your performance or cost variable,” Dublin said.
“It then must run through iterations of different kinds of tests to let you see what the configuration of resources would be when different options are selected, as the ML variables will indicate what affects performance and what affects costs.”
In other words, the ML helps organizations reduce costs while optimizing resources, something that is often solely lacking at organizations, even as they continue to pay surprisingly high cloud provider costs to run Kubernetes.
“Reducing costs, surprisingly, is often pushed to the wayside due to the pressure to optimize for speed and quality,” Sinclair said. “Typically the story you hear is you get a massive Amazon bill, and you say, ‘Oh my gosh, what are we spending on?’
“By being able to leverage ML tools like what StormForge offers, you are able to ensure your apps get what they need without overly burdening your people. But you’re also able to say, ‘We’re reducing the amount of budget taken up by our existing production apps.’”