How OpenSearch Visualizes Jaeger’s Distributed Tracing
We all know how important observability is. Open source tooling is always a popular option. The complexity of selecting tooling is always a challenge. Typically, we end up with several best-of-breed tools in use in most organizations, which include many different projects and databases.
As organizations continue to implement microservices-based architectures and cloud native technologies, operational data is becoming increasingly large and complex. Because of the distributed nature of the data, the old approach of sorting through logs is not scalable.
As a result, organizations are continuing to adopt distributed tracing as a way of gaining insight into their systems. Distributed tracing helps determine where to start investigating issues and ultimately reduces the time spent on root cause analysis. It serves as an observability signal that captures the entire lifecycle of a particular request as it traverses distributed services. Traces can have multiple service hops, called spans, that comprise the entire operation.
One of the most popular open source solutions for distributed tracing is Jaeger. Jaeger is an open source, end-to-end solution hosted by the Cloud Native Computing Foundation (CNCF). Jaeger leverages data from instrumentation SDKs that are OpenTelemetry (OTel) based and support multiple open source data stores, such as Cassandra and OpenSearch and Elasticsearch, for trace storage.
While Jaeger does provide a UI solution for visualizing and analyzing traces along with monitoring data from Prometheus, OpenSearch now provides the option to visualize traces in OpenSearch Dashboards, the native OpenSearch visualization tool.
OpenSearch provides extensive support for log analytics and observability use cases. Starting with version 1.3, OpenSearch added support for distributed trace data analysis with the Observability feature. Using Observability, you can analyze the crucial rate, errors, and duration (RED) metrics in trace data. Additionally, you can evaluate various components of your system for latency and errors and pinpoint services that need attention.
The OpenSearch Project launched the trace analytics feature with support for OTel-compliant trace data provided by Data Prepper — the OpenSearch server-side data collector. To incorporate the popular Jaeger trace data format, in version 2.5 OpenSearch introduced the trace analytics feature in Observability.
With Observability, you can now filter traces to isolate the spans with errors in order to quickly identify the relevant logs. You can use the same feature-rich analysis capabilities for RED metrics, contextually linking traces and spans to their related logs, which are available for the Data Prepper trace data. The following image shows how you can view traces with Observability.
Keep in mind that the OTel and Jaeger formats have several differences, as outlined in OpenTelemetry to Jaeger Transformation in the OpenTelemetry documentation.
Try It out
To try out this new feature, see the Analyzing Jaeger trace data documentation. The documentation includes a Docker Compose file that shows you how to add sample data using a demo and then visualize it using trace analytics. To enable this feature, you need to set the
--es.tags-as-fields.all flag to
true, as described in the related GitHub issue. This is necessary because of an OpenSearch Dashboards limitation.
In Dashboards, you can see the top service and operation combinations with the highest latency and the greatest number of errors. Selecting any service or operation will automatically direct you to the Traces page with the appropriate filters applied, as shown in the following image. You can also investigate any trace or service on your own by applying various filters.