Development / Monitoring / Contributed

Splunk: OpenTelemetry and the Future of Observability

11 Jun 2021 3:00am, by

Spiros Xanthos
Spiros Xanthos is the VP of Product Management for Observability and IT Operations at Splunk overseeing Splunk’s Observability and IT product portfolios. Previously he was the CEO and Founder of Omnition, an Observability platform for Cloud Native Applications that pioneered no-sample tracing and co-created OpenTelemetry. Omnition was acquired by Splunk in September 2019. Before Omnition, Spiros started and ran Pattern Insight that built Log Insight (a Log Analytics Platform) until selling it to VMware in 2012 and ezhome which he ran until September 2017.

We’ve all heard of the term “observability,” the capture and analysis of data that allows you to understand your complex applications and environments. Observability is a hot topic, and OpenTelemetry is quickly becoming the shining star within it. OpenTelemetry, a Cloud Native Computing Foundation (CNCF) project launched in 2019, is rapidly growing to become the most popular and supported way to capture performance telemetry from applications and infrastructure. This in turn allows developers and site reliability engineers (SREs) to gain observability into a system’s structure, status and behavior.

OpenTelemetry provides a single set of SDKs, agents and protocol definitions that capture metadata, metrics, distributed traces, logs (currently in an experimental phase but quickly developing) and eventually other types of data from every layer of the application and infrastructure stack. The significant benefits of OpenTelemetry including lower total cost of ownership, the end of proprietary data lock-in and the innovative drive and rapid expansion of capabilities guided by the open source community are clear. Let’s take a look at why OpenTelemetry is the future of observability and why it is quickly taking over the open source world.

The Need for Observability

While the shift to cloud services and Kubernetes has allowed millions of developers and organizations to launch high-scale web services, the resulting applications are complex, making them hard to understand, monitor, debug and develop for. This, along with a shift of on-call responsibilities from operations teams to development and SRE roles, has made observability a critical input to the development velocity and reliability of software — and the businesses that rely on it.

One’s ability to observe the inner workings of a system is only as good as the telemetry that comes out of it, meaning that telemetry needs to be captured and propagated through each platform, operating system, language, storage and RPC client, web framework and other libraries that developers want to gain insight into. The signals generated from each of these sources also need to be correlated, so that an action taken on a service can be linked to the underlying infrastructure and to actions taken on other services that were part of the same chain of requests.

This is a major challenge, and for years developers have struggled with a patchwork to capture just a subset of this data for analysis. The results were usually uncorrelatable between different signal types or layers of the application stack, with the entire setup requiring constant maintenance. Even with well-staffed teams, vendors in the space struggled to capture data from the platforms, languages and frameworks that their customers demanded.

Conquering Complexity with OpenTelemetry

This problem of extracting high-quality signals from infrastructure and applications is exactly what brought the OpenCensus and OpenTracing projects together to form OpenTelemetry, and is why OpenTelemetry has gained so much popularity — in fact, it’s the second most active project in the CNCF behind only Kubernetes. This is a testament to the value that it brings developers and SREs.

Typically, developers will use OpenTelemetry in an application by:

  • Using an automatic instrumentation language package or linking an SDK and appropriate instrumentation to their codebase
  • Deploying a Collector to the same host or Kubernetes pod
  • Exporting the SDK / automatic instrumentation package traces and metrics to the Collector
  • Exporting these and system metrics from the Collector to their destinations of choice for processing

The long-term goal of OpenTelemetry is for telemetry sources, the libraries, frameworks and other code that developers want to capture data from, to generate telemetry natively with the language-specific OpenTelemetry APIs. There are already examples of this: the latest version of the .Net framework and ASP.Net call OpenTelemetry-compatible APIs directly, as do several database clients from Google. In the short term, the OpenTelemetry project provides instrumentation that captures telemetry natively from these sources and translates it to the OpenTelemetry APIs. This allows the APIs, SDKs, Collector and language auto-instrumentation packages to provide immediate value to everyone.

Going All-In on OpenTelemetry

As a founding member and one of the top contributors to the OpenTelemetry project, we at Splunk believe that OpenTelemetry is more important now than ever. It has the ability to accelerate the implementation of robust observability and deliver amazing results with cloud-native applications at a time when digital experiences through mobile and web applications are more important than ever.

OpenTelemetry is also easy to get started with and once deployed, it provides numerous benefits. With built-in support across many frameworks and client libraries, and a large registry of instrumentation, OpenTelemetry provides out-of-the-box support for most web frameworks, RPC systems, storage clients, databases, web servers, operating systems and Kubernetes.

Additionally, its native wire protocol and exporter interfaces allow OpenTelemetry users to completely avoid vendor lock-in. If signals are being sent to a Collector prior to export, changing export targets is as simple as editing a YAML file. OpenTelemetry can send data to multiple destinations at the same time, meaning that different teams in an organization can evaluate, use and smoothly migrate between different tools and still gain unobstructed observability across the entire stack.

Because it uses a single set of SDKs and semantic conventions, OpenTelemetry also provides correlations and attaches consistent metadata to every signal. Custom application metrics are associated with both services and hosts, distributed traces are correlated with logs and metric exemplars, etc.

Lastly, OpenTelemetry is supported by a massive, healthy, open source community and every part of it is under active development. The OpenTelemetry project’s focus on telemetry collection — versus storage, analysis, etc. — has kept the incentives of vendors, cloud platforms and end-user contributors aligned, and the project welcomes new members and contributions daily.

Making an Impact

OpenTelemetry is already seeing broad adoption, which is quite a change from the past where most products in the Application Performance Management space relied on proprietary agents that used some combination of bytecode injection and monkey patching to capture distributed traces and metrics from most language runtimes. These agents were expensive to develop, only worked with specific languages and supported a small number of sources, their reliance on bytecode signature detection meant that they had to be updated any time a telemetry source received minor updates, and generally consumed a large percentage of CPU cycles. OpenTelemetry’s capabilities and supported sources, reliance on APIs and standards rather than bytecode patterns, flexibility and minimal performance impact have made it a must-have for vendors and end-users alike.

What’s Next for OpenTelemetry?

We’re convinced that OpenTelemetry will become an industry standard for observability, benefiting from unprecedented growth over the next few years. We’re already seeing this with OpenTelemetry’s tracing capabilities reaching general availability in February 2021, and most OpenTelemetry components have already issued 1.0 releases with production-ready tracing features. From there, OpenTelemetry’s metric capabilities will reach 1.0 later this year, though many end-users and vendors are already relying on the Collector’s ability to capture system and pre-packaged application metrics.

Logging is still in an experimental stage, however development of native logging capabilities within the Collector is already underway thanks to the donation of the Stanza logging agent from ObservIq. Additionally, more enhancements and signal types will arrive over time. For example, Splunk is in the process of adding eBPF capture to the Collector, which will allow metrics and other signal types to be gathered directly from the OS kernel.

Overall, the benefits and impact of OpenTelemetry speak for themselves. The hyper-growth within the open source community is unprecedented, so much so that we predict that OpenTelemetry will be the most commonly used data instrumentation and collection technology in the world by the end of 2022. To other companies, we say it’s time to commit to standardizing on OpenTelemetry so everyone can reap the benefits of the open source project dominating the DevOps community.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.