How OpenTelemetry Works with Kubernetes
Open source observability has become a critical part of the DevOps toolchain, providing developers, operations engineers and other stakeholders with real-time visibility into the health and performance of their applications and system.
Projects like Jaeger for distributed tracing, Fluentd for log aggregation, and Istio for service mesh observability, have expanded the tool set. As open source observability continues to mature, we can expect to see further advances in areas such as automation, machine learning and analytics, enabling even more sophisticated monitoring and troubleshooting capabilities.
OpenTelemetry (OTEL) is a set of tools, APIs and software development kits (SDKs), and currently an incubating project at the Cloud Native Computing Foundation. It’s primarily focused on data collection, and not on storage or query capabilities.
The main goal of OpenTelemetry is to provide a standard way for developers and end-users to create, collect, and export telemetry data from their applications and systems, and to facilitate interoperability between different observability tools and platforms.
OTEL supports several programming languages, including Java, Python, Go, Ruby, and many more, making it a versatile solution for collecting telemetry data from different types of applications and systems.
Once the telemetry data is collected by OpenTelemetry components, it can be exported to various backends, such as Software as a service solutions, platforms, or storage systems, that provide storage and query capabilities. OpenTelemetry provides integrations with various backends, including Prometheus, Jaeger, Zipkin, and many more, making it easier to export telemetry data to different systems.
Using OTEL with Kubernetes doesn’t need to be difficult. In fact, installing an OTEL operator for Kubernetes is a straightforward process, and in this article, you’ll learn how to do it.
With this operator, you can easily manage the OpenTelemetry components in your Kubernetes cluster, and configure them to export telemetry data to the backend of your choice. This simplifies the process of monitoring your Kubernetes cluster and enables you to make informed decisions about the health and performance of your applications.
Essential Components in OpenTelemetry
The first four components are used by instrumentation developers or observability companies to create observability products
The specification provides a standardized way of defining the behavior and functionality of these components, which ensures consistency and compatibility across different OpenTelemetry implementations. For example, the specification defines the format and semantics of trace and metric data, ensuring that they can be correctly interpreted by other components in the system.
The OpenTelemetry API provides a standard way for developers to instrument their applications with tracing, metrics and other telemetry data. The API is language-agnostic and allows for consistent instrumentation across different programming languages and frameworks.
The API provides a standard way for developers to instrument their applications with tracing and metrics.
The OpenTelemetry SDKs provide language-specific implementations of the OpenTelemetry API. The SDKs make it easier for developers to instrument their applications with tracing and metrics by providing libraries and utilities for collecting and exporting telemetry data.
Data Model – OTLP
The OpenTelemetry Data Model provides a standardized format for telemetry data, called OTLP (OpenTelemetry Protocol). OTLP is a vendor-neutral format that makes it easier to export telemetry data to different backends and analysis tools.
The last two components, the OpenTelemetry Auto-Instrumentation and Collector, are for developers who want to collect and export telemetry data from their applications to different backends, without having to write their own instrumentation code.
OpenTelemetry includes an automatic instrumentation agent that can inject applications with tracing and metrics without requiring any manual instrumentation code. This makes it easy to add observability to existing applications without requiring significant code changes.
The Auto-Instrumentation component can be downloaded and installed as a library or agent, depending on the programming language or framework being used. The Auto-Instrumentation library automatically injects the application code with OpenTelemetry API calls, to capture and export telemetry data.
The Collector component is responsible for collecting telemetry data from different sources, such as applications, servers and infrastructure components, and exporting it to various backends.
The Collector can be downloaded and configured to collect data from different sources and can perform aggregation, sampling, and other operations on the telemetry data before exporting it to different backends, depending on the specific use case.
How Telemetry Data Is Created
Let’s consider an example where we have an e-commerce application with three workloads — frontend, driver and customer — that communicate with each other over HTTP. We want to collect telemetry data to monitor the performance and health of these applications.
To do this, we instrument each of these applications with OpenTelemetry APIs:
tracer.span().start(). These APIs allow us to create telemetry signals, such as logs, metrics and traces.
Once these signals are created, they are sent or collected by the OpenTelemetry Collector, which acts as a centralized data hub.
The Collector is responsible for processing these signals, which includes tasks such as batching, relabeling, PII filtering, data drop and aggregation to ensure the data is accurate and meaningful. Once the collector is satisfied with the data, it sends the telemetry signals to a platform for storage and analysis.
The Collector can be configured to send these processed signals to a variety of platforms, such as open source solutions like Prometheus, Loki, Jaeger or vendors like Dynatrace, New Relic and so on.
For example, the Collector may send logs to a log aggregation platform like Loki, metrics to a monitoring platform like Prometheus, and traces to a distributed tracing platform like Jaeger. The telemetry data stored in the platform can then be used to gain insights into the system’s behavior and performance and to identify any issues that need to be addressed.
Defining the Kubernetes Operator’s Behavior
You can deploy the OpenTelemetry Operator to your Kubernetes cluster, and have it automatically instrument and collect telemetry data for your applications.
The OpenTelemetry Kubernetes Operator provides two Custom Resource Definitions (CRDs), which are used to define the operator’s behavior. Together, these two CRDs allow you to define the complete behavior of the OpenTelemetry Operator for your application.
The two CRDs are:
The otelinst. This CRD is used to define the instrumentation for an application. It specifies which components of the OpenTelemetry API to use, which data to collect, and how to export that data to a backend.
With the otelinst CRD, you can specify the name of the application to be instrumented, the language and runtime environment, the sampling rate for traces, and the type of exporter to be used.
The otelcol. This CRD is used to define the OpenTelemetry Collector’s behavior. It specifies the configuration for the Collector, including the receivers (sources of telemetry data), processors (for filtering and transforming the data), and exporters (for sending the data to a backend).
With the otelcol CRD, you can specify which protocol to use for communication —such as Google Remote Procedure Calls (gRPC) or HTTP, which receivers and exporters to use, and any additional configuration options.
Installing the OpenTelemetry Kubernetes Operator
The OpenTelemetry Kubernetes operator can be installed using various methods, including:
- Operator Lifecycle Manager (OLM). This is the recommended method for installing the operator as it provides easy installation, upgrades and management of the operator.
- Helm charts. Helm is a package manager for Kubernetes that provides a simple way to deploy and manage applications on Kubernetes. Helm charts for the OpenTelemetry operator are available and can be used to deploy the operator.
- Kubernetes manifests. The operator can also be installed using Kubernetes manifests, which provide a declarative way to manage Kubernetes resources. The operator manifests can be customized to fit specific requirements.
To collect telemetry data, we need to instrument our applications with code that creates the telemetry signals. There are different approaches to instrumenting an application for telemetry data.
In this approach, developers explicitly add instrumentation code to their applications to create telemetry signals such as logs, metrics, and traces. This approach provides developers more control over the telemetry data, but it can be time-consuming and error-prone.
Direct Integration in Runtimes
Some runtimes, such as Quarkus and WildFly frameworks, have direct integrations with OpenTelemetry. This means that developers don’t need to add instrumentation code to their applications — the runtime automatically generates telemetry data for them. This approach can be easier to use and requires less maintenance, but it may be less flexible than the explicit/manual approach.
The main disadvantage of the direct integration in runtimes approach is that the instrumentation is limited to the supported frameworks. If the application uses a framework that is not supported, then the telemetry data may not be captured effectively or may require additional custom instrumentation.
This approach can also lead to vendor lock-in if the chosen runtime or framework is only compatible with specific observability vendors.
Therefore, this approach may not be suitable for all applications or organizations, especially if they require flexibility in choosing the observability stack or need to instrument a wide range of frameworks and libraries.
In this approach, an OpenTelemetry agent or auto-instrumentation library is added to the application runtime. The agent/library automatically instruments the application code and generates telemetry data without the need for developers to add instrumentation code.
This approach can be the easiest to use and requires minimal maintenance, but it may be less flexible and may not capture all relevant telemetry data.
While the auto-instrumentation/agent approach has many advantages, one of the main disadvantages is that it can consume more memory and CPU cycles as it supports a wide range of frameworks and instruments for almost all the APIs in the application. This additional overhead can impact the performance of the application, especially if the application is already resource-intensive.
Additionally, this approach may not capture all the necessary telemetry data or may result in false positives or negatives. For example, it may not capture certain edge cases or may capture too much data, making it difficult to find the relevant information.
However, despite these disadvantages, the auto-instrumentation/agent approach is still highly recommended for organizations starting out with observability, as it provides a quick and easy way to get started with collecting telemetry data.
How Telemetry Data Is Collected and Exported
The Collector is responsible for receiving telemetry data from the instrumentation code, processing and exporting it to a platform for storage and analysis. The Collector can be configured with various components, such as receivers, processors and exporters, to meet specific needs.
Receivers are responsible for accepting data from various sources, such as agents, exporters or the network, while processors can transform, filter, or enhance the data. Finally, exporters send the data to a storage or analysis platform, such as Prometheus or Jaeger.
There are two distributions of the Collector, Core and Contrib.
Core is the official distribution, which contains stable and well-tested components, while Contrib is a community-driven distribution, which contains additional experimental components.
You can also build your own Collector by selecting the components you need and configuring them according to your requirements. The Collector is written in Go, which makes it easy to deploy and maintain. The documentation on the OpenTelemetry website provides detailed guidance on how to set up, configure and use the Collector.
OpenTelemetry can be used as an alternative to Prometheus in some cases, especially if you have limited resources on edge devices. Prometheus has a strong focus on monitoring and alerting, while OpenTelemetry is designed for observability and provides features beyond just metrics, including tracing and logging.
Additionally, OpenTelemetry can be used to export data to various backends, including Prometheus, so you can still use Prometheus for monitoring and alerting if you prefer. The flexibility and extensibility of OpenTelemetry allow you to tailor your observability solution to your specific needs and resource constraints.
The OpenTelemetry Operator is responsible for deploying and managing the OpenTelemetry Collector, which is a central component for collecting, processing and exporting telemetry data. It does not deploy other sidecars such as Envoy, but can work alongside them to collect additional telemetry data.
The OpenTelemetry Collector can be deployed in different modes such as a sidecar, daemonset, deployment or statefulset, depending on the specific use case and requirements.
However, if the goal is to collect logs from the nodes in the cluster, deploying the collector as a daemonset can be a good option as it ensures that a collector instance runs on every node, thus allowing for efficient and reliable log collection.
OTEL Collector Configuration
Here’s an example Kubernetes manifest file for deploying the OpenTelemetry Collector using the otelcol Custom Resource Definition:
In this example, we define a collector named
otel-collector that uses the OTLP receiver to receive trace data, the Prometheus exporter to export metrics to a Prometheus server, and two processors (
queued_retry) to process the data. The config field specifies the configuration for the collector, which is written in YAML format.
Collecting traces, metrics, and logs using OpenTelemetry is important for several reasons:
- Increased observability. By collecting and correlating traces, metrics and logs, you gain a better understanding of how your applications and systems are performing. This increased observability allows you to quickly identify and resolve issues before they impact your users.
- Improved troubleshooting. OpenTelemetry provides a standardized way of collecting telemetry data, which makes it easier to troubleshoot issues across your entire stack. By having access to all the relevant telemetry data in a single place, you can quickly find the root cause of issues.
- Better performance optimization. With access to detailed telemetry data, you can make informed decisions about how to optimize your applications and systems for better performance and reliability. For example, by analyzing metrics, you can identify areas of your system that are underutilized or overutilized, and adjust resource allocation accordingly.
- Cross-platform compatibility. OpenTelemetry is designed to work across multiple programming languages, frameworks and platforms, which makes it easier to collect telemetry data from different parts of your stack. This interoperability is important for organizations that use multiple technologies and need to standardize their observability practices across their entire stack.
OpenTelemetry logs provide a standardized way to collect, process and analyze logs in a distributed system. By using OpenTelemetry to collect logs, developers can avoid the problem of having logs spread across multiple systems and different formats, making it difficult to analyze and troubleshoot issues.
With OpenTelemetry logs, developers can collect logs from multiple sources, including legacy logging libraries, and then process and analyze them using a common format and API. This allows for better integration with the rest of the observability stack, such as metrics and traces, and provides a more complete view of the system’s behavior.
Additionally, OpenTelemetry logs provide a way to enrich logs with additional contextual information, such as metadata about the request, user or environment, which can be used to make log analysis more meaningful and effective.
What’s Next for OpenTelemetry?
Auto-Instrumentation for Web Servers
The OTEL webserver module consists of both Apache and Nginx instrumentation. The Apache module is responsible for tracing incoming requests to the server by injecting instrumentation into the Apache server at runtime. It captures the response time of many modules involved in an incoming request, including mod_proxy. This allows the hierarchical time consumption by each module to be captured.
Similarly, the Nginx web server module also enables tracing of incoming requests to the server by injecting instrumentation into the Nginx server at runtime. It captures the response time of the individual modules involved in the request processing.
This document outlines the long-term vision for profiling support in the OpenTelemetry project. The plan is the result of discussions and collaboration among members of the OpenTelemetry community, representing a diverse range of industries and expertise.
The document is intended to guide the development of profiling support in OpenTelemetry, but is not a checklist of requirements. The vision is expected to evolve over time and be refined based on learnings and feedback.
Open Agent Management Protocol
Open Agent Management Protocol (OpAMP) is a network protocol that enables remote management of large fleets of data collection agents. It allows agents to report their status and receive configurations from a server and receive agent installation package updates from the server.
OpAMP is vendor-agnostic, so the server can remotely monitor and manage a fleet of different agents that implement OpAMP, including a fleet of mixed agents from different vendors.
It supports remote configuration of agents, status reporting, an agent’s own telemetry reporting, management of downloadable agent-specific packages, secure auto-updating capabilities and connection credentials management. This functionality allows for a single pane of glass management view of a large fleet of mixed agents.