Observability Is the New Kubernetes
“Observability is as popular as Kubernetes these days.”
That’s how moderator, DataStax‘s Raghavan “Rags” Srinivas kicked off the observability state of the union panel at this year’s KubeCon+CloudNativeCon North America. Indeed, OpenTelemetry is either the second or third biggest project at the Cloud Native Computing Foundation after Kubernetes. And it has reason. Kubernetes makes complex systems easier to create in a more distributed way. This inherently demands more observability, in order to understand the “unknown-unknown” behavior of your systems, too.
The widespread adoption of Kubernetes has also led to a level of standardization that promotes common language and data around service orchestration and standardization. Service mesh has allowed for measurement of requests flowing between services. These combine to make it easier to embrace the practice of observability. Now the struggle is how to handle all of the data — and its often immense cost — that comes with this growingly expensive telemetry. And how you can get started with all that.
Srinivas was joined by Honeycomb’s Liz Fong-Jones,Red Hat’s Bartek Plotka, Google OpenTelemetry team’s Josh Suereth, and Polar Signals’ Frederic Branczyk. Fong-Jones is also on the OpenTelemetry governance committee, while Plotka and Branczyk are Prometheus maintainers. This piece is an amalgamation of points discussed on the panel as well as the conversation that continued in the vibrant CNCF observability Slack channel, all on how to do observability smartly and what needs to happen yet.
How Much Is Too Much Knowledge?
The question that led to many others in the live chat and the Slack following was from Srinivas: “Is there too much data? Observability can get very expensive. When is too much too much? When trying to sort signal from noise, how do you ensure that you aren’t throwing away hidden meaning? What steps forward have we made on automating that?”
The tension between high quality, granular observability and the cost associated with compute and storage is real. And, really, do we need to know everything that’s happening in your application?
Fong-Jones argued that sampling can be a really effective way to ensure you keep every error but only one out of every hundred or thousand successes.
She also pointed to the amount of duplication that often goes on: emit data to logging, tracing and metrics platforms. “Better to have fewer higher quality signals than to try to get one of everything.”
She also recommended using tracing or structured logs, not unstructured logs, to limit tag spew into metrics protocols, and to not centrally index logs — keep them locally on machines, but use sampled traces and metrics as your central indexed types.
The synergy of observability and feature flagging, Fong-Jones continued, is really powerful. This allows you to attach feature flag values to attributes, and you can also flag on and off more verbose trace events.
As this particular conversation continued heavily into the CNCF Slack chat, Suereth said “There’s a cost to observability signals, and to some extent it’s based on how much compression/sampling you can reasonably do. Metrics (cheaper than) [sampled] Traces (cheaper than) Logs. You could just use structured logs and infer all the other observability signals from it, but you’d be paying a high ingestion/processing code (possibly).”
Branczyk echoed “I think it’s important not to obsess over one type of data — or collection mechanism, protocol or even query language. Focus on the questions that you are asking and then choose the right tool. The reality is each type of data tends to be good at a certain aspect so choose the right tool for the job!”
Plotka argues that devs shouldn’t even be deciding, but negotiating with sysadmins and standard tooling. “Ideally everything is dynamic. I love the idea to bring the logging metrics and tracing signals — reduce duplication so do just one of them.
“Use, signals, web components — at the end you have consistent statistics to rely on, but it doesn’t mean you need to know about every signal function from the past,” Plotka said.
Fong-Jones added that if it’s truly high volume, you can use metrics and export counters in a separate thread or continuous profiling.
Branczyck defined continuous profiling as sampling “what it is exactly that your code is executing. Lucky for us instrumentation techniques like eBPF have made the collection part quite lightweight, so it’s ‘just’ storage that needs to be solved.”
Then, How Do You Get Started with Observability?
More data, more learning curve. Even if Prometheus and other tools allow you to get started with observability almost out of the box, and even if there was tooling that allowed you to turn things on and off, you wouldn’t necessarily have historical data. And the average observability newcomer wouldn’t understand the data.
Srinivas asked the KubeCon panelists to offer ways to get people new to observability walking, not even running yet. It all starts with your service level agreement or SLO. Suereth said set your logs and metrics or logs and traces — not all three — to monitor SLOs to start understanding where things are going slow and then to dive into root causes.
You have to crawl before you can run. He defined crawling as ops-based reactionary — Can I deal when a system goes down? — versus running — Can I evaluate if a feature improved my business?
Suereth’s ah-ha moment with observability was when his team was demoing to a VP at Google and everything was running awkwardly slowly. Why was that device specifically slow? They used observability to identify that subcomponent in the massive system of subsystems, as they were able to identify one particular behavior of a specific type of user, hidden among the statistics.
It is essential teams do what they can to reach that ah-ha moment early on.
For Fong-Jones it was “Not having to write new code and to have insights into things that happened in the path. How we answer questions that we didn’t anticipate when we originally wrote the code.”
She says that really any success with observability is done in production, so a first step has to be shortening release cycles.
Later, as the conversation moved to Slack, she suggested getting started with coverage of microservices, to just begin with one or two services — you don’t need complete coverage yet.
“We advise starting as close to users/ingress as possible, and adding new trace spans as needed to diagnose problems deeper in [the] stack.”
An attendee also asked in Slack if there’s a way to automatically reduce the scope of data to automatically detect a problem, like with machine learning. Yes, that does sound ideal but that’d be automating much of the learning process out.
Branczyk responded, in the continued Slack conversation, that “The reality is that understanding your data is a lot more effective than any type of machine learning or something like that. Start simple, start with an SLO on errors and latency on our load balancer. For a lot of organizations this is enough to alert on. Start with this and then expand it from a load balancer to application level.”
Shift Left: The Next Steps for Observability
So where will observability head in the next two to five years?
Fong-Jones said the next step is to support developers in adding instrumentation to code, expressing a need to strike a balance between easy and out of the box and annotations and customizations per use case.
Suereth said that the OpenTelemetry project is heading in the next five years toward being useful to app developers, where instrumentation can be particularly expensive.
“Target devs to provide observability for operations instead of the opposite. That’s done through stability and protocols.” He said that right now observability right now, like with Prometheus, is much more focused on operations rather than developer languages. “I think we’re going to start to see applications providing observability as part of their own profile.”
Suereth continued that the OpenTelemetry open source project has an objective to have an API with all the traces, logs and metrics with a single pull, but it’s still to be determined how much data should be attached to it. “Should everything as possible be able to abstract from the start? But then you don’t wanna turn it on. Moving forward a baseline [is needed] that’s good enough for each of the standard questions.”
Fong-Jones pointed out that this is the next step in the world of shifting left for shorter feedback cycles. Although she is worried that, like DevOps, the concept of observability will become diluted.
Branczyk says standardizing protocols are incredibly powerful for edge cases, and the next step is for open standards and the wire protocols be standardized.
He puts the biggest challenge as being cultural. “Observability is definitely still growing. At a lot of companies, even when I talk to people in the Kubernetes space, a lot of them aren’t even doing monitoring — we are in a bubble and there’s a lot of education of the market that needs to be done.”
Branczyk continued that, for the old observability hats, while logs metrics and tracing are really useful signals, there’s still so much more data out there.
“We are talking about clusters, but sometimes they are so small and running in Internet of Things devices around the globe,” he said. “We need tools for those cases too.”
Much of the potential of observability is still unknown, but that’s kind of the point, isn’t it?