KubeCon+CloudNativeCon sponsored this podcast.
Debate continues in the industry about what observability is, and more specifically, what it should offer DevOps, especially those working in operations who are often responsible for detecting those “unknown unknowns” that degrade system performance. In this The New Stack Makers podcast, we discuss how observability should be easier to use and how it can be cost-effective.
Our guests are Bartek Plotka, a principal engineer at Red Hat, as well as the SIG observability tech lead for the Thanos project and a Prometheus maintainer; and Richard Hartmann, community director at Grafana, a Prometheus maintainer, OpenMetrics founder and a SIG observability chair member for the Cloud Native Computing Foundation.
Alex Williams, founder and publisher of The New Stack, hosted this episode.
Defining observability in a DevOps context and what it offers can remain difficult since the topic is so broad. In many ways, observability has become a buzzword, which means “some of its meaning gets lost,” said Hartmann. However, it remains very useful in today’s often highly distributed environments.
Historically, the first definitions came from control theory, defined as “being able to make deductions about the internal state, or to deduce the complete internal state of a system just by looking at inputs and outputs,” said Hartmann.
The focus of observability is more “about making humans understand or to enable humans to understand what is actually happening, what might be broken and what is running well, and to actually work with the data and extract new knowledge about the system from the data,” said Hartmann.
“Instead of the old-style monitoring system with one graph, some threshold and that’s it, as it matures, I think there is more and more of an understanding of what it actually is: about making humans understand what is happening,” said Hartmann. “But you also have a ton of interfaces to machines… and as such, you can also make machines understand this data, and there are other buzzwords… But the point is treating system state and insight about system state as a little pipeline for the new source of information.”
For Plotka, observability has a more direct definition. Observability is “something more practical: the ability to essentially debug your application when it’s on fire when you really don’t know what’s happening, why it stopped, why it’s so slow and why it doesn’t handle the amount of requests you would expect it to handle.”
What observability is required to do also changes depending on the organization’s IT infrastructure. Organizations with an infrastructure that is microservices-centric will typically require more tracing capabilities than organizations with legacy systems, Hartmann explained. Most organizations also rely on logs and metrics for their observability needs, while large at-scale infrastructures will often require more metrics data.
Given the magnitude of data to manage at many organizations, it becomes very difficult to “have all of the details,” Hartmann said. “So, either you sample and you drop stuff, or you just introduce counters and other numeric data, which gives you a simplified view of what is going on,” said Hartmann. “The beauty of this is that you can already optimize a lot in favor of human understanding as well.”
Red Hat is a sponsor of The New Stack.