How AI Can Supercharge Observability

Observability has gotten much more complex of late, certainly far more complicated than the early days of IT monitoring when everything was run on a mainframe, and the flow of logs and any available monitoring data could be easily collected and visualized.
Even more recently as apps moved to the center of most organizations, things were a lot simpler. However, in our current world of Kubernetes, microservices and serverless, things look a lot different. Imagine taking a hammer to that easily observable flow of yesteryear and watching it shatter into hundreds of pieces; yet, all these little pieces must also remain tightly interconnected and continuously communicate.
This is essentially what happened with the initial introduction of abstraction and virtualization. Then Kubernetes came along, with its ephemeral, fast-moving, and distributed nature adding numerous layers of complexity. Here, everything became even more complicated to manage, and far more difficult to monitor and troubleshoot; many people were left wondering what exactly they’d gotten themselves into. We might ask ourselves — does everything actually need to be this complex?
It’s understandable to long for old times, but here we are, and as a result of this sea change, observability is never going to be the same.
Tracing the Boundaries of ‘Modern’ Observability
First, let’s back up and cover some basic principles, starting with a definition. Observability, within the context of our cloud infrastructure and applications, is the art of inspecting software and creating data-driven decisions to monitor and fix production systems. It’s critical to note that these decisions should be focused on specific outcomes and SLOs, not just ongoing monitoring, alerting and troubleshooting.
Then, let’s consider the art of designing a reliable observability system in today’s world — where coding or infrastructure problems have morphed into big data problems — and this now also entails finding ways to drive efficiency in the compute, networking, and storage demands of these modern observability systems. More data doesn’t mean greater insight.
As it turns out, abstraction, virtualization, and microservices were only the tip of the iceberg. With the advent and ongoing adoption of AI tools to generate code — such as Copilot, Code Whisperer, and more — it’s actually approaching an unsolvable problem for humans to process, analyze, and correlate billions of different events to understand if the code they’re writing is performing like it should. Again, observability is a looming big data conundrum.
Even if engineers have the skills to understand observability signals and how to analyze telemetry data — which is hard talent to come by — the sheer volume of data to sort through is unrealistic, even staggering. And the fact is, the majority of that mountain of data is not particularly useful in generating key insights into the performance of mission-critical systems.
More doesn’t mean better. Meanwhile, most popular observability solutions suggest that the big data problem needs to be solved by attacking the massive data pipeline and complexity using a long list of complex capabilities, and additional tools — all of which come with a sizable price tag congruent with the inflation of data. But there is hope.
Enter the AI Observability Era
Observability in the modern era of microservices and AI-generated code doesn’t have to be prohibitively complicated or expensive, and yes — the growing use of AI provides significant promise. The same large language models (LLMs) driving AI-powered code offer a new approach to observability.
How does this work? LLMs are becoming adept at processing, learning and recognizing patterns in a large volume of repetitive textual data — precisely the nature of log data and other telemetry in highly distributed and dynamic systems. LLMs know how to answer basic questions and make useful deductions, assumptions, and predictions.
This approach is not perfect, as LLM models are not yet built for real-time, and are not accurate enough to fully depend on in terms of determining the full scope of context necessary to troubleshoot all of our observability quandaries. However, it’s far easier to start with an LLM, get a baseline of what is occurring, and derive helpful advice than it is to expect humans to understand and contextualize the volume of machine-generated data in a reasonable timeframe.
As such, LLMs are very relevant to solving observability issues. They’re designed to work with text-based systems, as well as analyze and provide insight. This can easily be applied to observability through integration to provide meaningful recommendations.
We believe that one of the biggest values of LLMs in this sense is to better enable human practitioners who may not have the highest degree of technical savvy and empower them to work around the huge amounts of complex data that need to be addressed. Most production issues requiring recovery have enough time for an LLM to chip in and help based on historical contextual data. In this way, LLMs are capable of making observability simpler and more cost-efficient.
Meanwhile, as powerful as AI in observability is becoming today, there are more interesting and transformative opportunities on the horizon. What’s coming next are LLMs that will help you write and investigate in a natural language instead of cryptic querying languages — another huge boon to users of all levels, but even more so those with less hands-on experience, including line of business stakeholders.
Instead of needing more seasoned users who understand all the related information, people are now able to write queries related to common parameters, importantly the natural language of business unit executives, not just production engineers. This unlocks observability for a wide range of new processes and stakeholders, beyond production engineers.
At Logz.io, we’ve begun the work of integrating with LLMs and we’re now working diligently on exciting capabilities across our platform designed to tap into this emerging set of AI capabilities. We believe that this is the next wave of critical innovation that will provide essential observability for organizations looking to meet their mounting data challenges. While pressing issues of cost and complexity persist in the market, we believe that this gives everyone a lot of reasons to be optimistic.