How OpenTelemetry Can Serve as Observability’s Missing On-Ramp
The use of observability tools and techniques to gain insights into how applications perform has become more critical as infrastructure becomes ever more complex. As organizations increasingly deploy their applications on a patchwork of multicloud and on-premises infrastructures, observability offers visibility into how applications are not only running but how they are changing and how they are interacting. Observability can also help to detect potential problem areas that drill down into gauging the effects on each user’s individual experience.
But like any technology many organizations are keen to adopt, impediments exist for observability’s adoption. Tracing, metrics and logs, for example, are often confused with observability, instead of the data sources as the raw material needed to extract the insights enabled through observability.
Organizations will often mix and match different observability tools to meet very specific needs of their applications and infrastructure. All told, the Cloud Native Computing Foundation‘s “End User Technology Radar on Observability” released in September 2020 reported that half of the companies surveyed were using at least five observability tools and a third of the respondents used more than 10 observability tools.
Still, none of these tools works well without high-quality telemetry, or high-quality performance data from the source of what is being measured. Yet high-quality telemetry has often come at the opportunity cost of developer time spent instrumenting, and/or vendor lock-in through proprietary agents. In order to meet this problem head-on, the CNCF OpenTelemetry project offers vendor-neutral integration points that help organizations obtain the raw materials — the “telemetry” — that fuel modern observability tools, and with minimal effort at integration time.
“In this environment, the interoperability provided by OpenTelemetry is crucial because it helps breed innovation as new tools are built for previously unsolved problems,” Cheryl Hung, vice president of ecosystem at Cloud Native Computing Foundation (CNCF), said. “Interoperability also helps companies adopt new tools as they can be reassured that the formats and standards will be cross-compatible and avoid lock-in to a single tool or vendor.”
Consisting of loosely-coupled APIs, protocol specifications, SDKs, and infrastructure components, OpenTelemetry also allows for backend access to telemetry data without having to install additional software or instrument new code when using a tool from a new vendor. The OpenTelemetry collector is a lightweight piece of infrastructure that offers an interchange and multiplexing point for telemetry data, operated as an agent or as a horizontally-scalable pool.
OpenTelemetry’s standardization of telemetry libraries, which enables auto-instrumentation of application code that meets OpenTelemetry specifications, is especially important, Torsten Volk, an analyst for analyst firm Enterprise Management Associates (EMA), said. “Auto-instrumentation addresses the significant risk of developers not fully instrumenting their code, and therefore, creating monitoring blind spots that can bring significant operational risk,” Volk said.
OpenTelemetry should also further simplify, and ultimately, automate code instrumentation. DevOps teams can better benefit from the analysis of a joint stream of release and operations data coming from logs, metrics and distributed traces, Volk said.
“Poor code instrumentation is the root cause for IT operations struggling to understand interdependencies between services, applications, databases and the underlying storage, network, and compute infrastructure,” Volk said. “The standardization and, ultimately, automation of code instrumentation will lead to more complete streams of telemetry data for today’s fancy AIOps platforms to triangulate and use for better quality predictions, recommendations, and for faster root-cause analysis.”
The Confusion Barrier
As mentioned above, confusion about what observability is and how insights are gained with telemetry data still abound. As Ben Sigelman, CEO and co-founder of observability software provider Lightstep and co-creator of the OpenTelemetry project, explained, a stack typically consists of (1) a telemetry layer, (2) potentially multiple storage layers, depending on the particular assemblage of observability providers, and (3) a value layer where observability addresses actual business problems, whether they relate to core monitoring, performance analysis, incident resolution, CI/CD or other things.
“Since our industry is still wrapping its head around observability as a concept, there has been a tendency to confuse ‘the telemetry’ with ‘the observability,’” Sigelman said. “That said, the observability can never be better than the telemetry, and that’s why OpenTelemetry is so important: it provides a path to high-quality telemetry that avoids the opportunity cost of developer time spent instrumenting manually, and also avoids the vendor lock-in of proprietary agents.”
While noting how “the use of observability to gain insights into how application and operations performance has become that much more critical to adopt as infrastructure becomes more complex,” Logan Franey, a senior product marketing manager at application performance monitoring company Dynatrace, organizations seeking to implement observability face challenges on several fronts.
“But like any technology many, if not most, organizations are keen to adopt — whether it is stateful storage or identity and access management for cloud native infrastructure — impediments exist for observability’s adoption,” Franey said. “Tracing, metrics and logs, for example, are often confused with observability, instead of perceiving such capabilities as means to an end to help organizations achieve observability as the conduit for helpful data points and insights on which organizations can rely.”
The Use Case
As an example of a vendor seeking to meet the OpenTelemetry project’s compatibility specifications, LogDNA offers a dashboard for Kubernetes metrics and cardinality data analysis.
“To enable that experience, we’ve chosen to create our own [reporting functionality] to be able to collect the information we felt was most important for developers,” Michael Shi, a LogDNA product manager, said. “As OpenTelemetry continues to mature in their beta, we’re certainly open to supporting an OpenTelemetry exporter to further enrich logs.”
The tracing aspects of OpenTelemetry “are really great — they’re definitely best-in-class,” Tom Wilkie, vice president of product at observability software provider Grafana Labs, as well as a Prometheus maintainer, said. “They have great and really wide support and a really rich set of primitives.”
Wilkie noted how Grafana is already an OpenTelemetry user and uses the OpenTelemetry tracing software internally for Grafana Tempo. “I think Tempo is the first OpenTelemetry native tracing system that really prefers unity and instrumentation,” Wilkie said.
For metrics, more work needs to be done, he said. “If you look at metrics, OpenTelemetry effectively started from scratch a year ago or so,” Wilkie said, “If you compare it to something like the OpenMetrics project, which comes out of Prometheus and which we’ve been working on decades, if you include the work that was done at Google, then you can see why it’s just, just less mature and still going to take time.”
As part of Wilkie’s and Grafana’s contribution to Prometheus, Wilkie said “we’re engaging with the OpenTelemetry team, and we’re trying to make sure OpenTelemetry has Prometheus compatibility.” After the past decade building Prometheus and its associated instrumentation libraries and the best practices for Prometheus, “we look forward to working with the [OpenTelemetry team] during the coming weeks and months.”
For OpenTelemetry logging, “I just don’t think it’s really there yet,” Wilkie said. “Our advice to our users and our customers is to use OpenTelemetry (for tracing), but to use Prometheus for metrics and open metrics and the Prometheus client, “Of all the systems out there, we feel Prometheus is really the most advanced and most mature and really the one that has learned all the really hard lessons.”
We’re not super opinionated about the logging libraries to use in your applications — we have a slight preference for logfmt- and json-structured logging, but in general Loki can handle logs in any format,” Wilkie said. “Ditto you can ship us logs using pretty much any agent — we have plugins for all of them — and if you want a single agent that does Prometheus metrics, OpenTelemetry traces and Loki logs I’d encourage you to check out the Grafana Cloud Agent.”
However, a common misconception is that OpenTelemetry is “one big thing,” and “that leads to many misunderstandings about project maturity,” Sigelman said.
“OpenTelemetry is a broad project, and most notably an intentionally decoupled project. Many of the critical pieces are stable and ready for use in production — for instance, tracing support in major languages, and the OTel collector is already broadly used in production at name-brand enterprises today. Other more far-flung pieces — or support in far-flung languages — are less mature. And this is all by design,” Sigelman said. “Because of its breadth and decoupled nature, OTel maturity should always be assessed per component,” Sigelman said.
Cloud Native Computing Foundation (CNCF), Dynatrace, Lightstep and LogDNA are sponsors of The New Stack.
Feature image via Pixabay.