Development / Monitoring / Sponsored / Contributed

Demystifying Distributed Traces in OpenTelemetry

5 Oct 2020 12:00pm, by

This is part of a series on Distributed Tracing. For a list of other articles in this series, check out the introductory post.

Lightstep sponsored this post.

Austin Parker
Austin Parker is the Principal Developer Advocate at Lightstep and maintainer on the OpenTracing and OpenTelemetry projects. In addition to his professional work, he's taught college classes, spoken about all things DevOps and Distributed Tracing, and even found time to start a podcast. Austin is also the co-author of Distributed Tracing in Practice, published by O'Reilly Media.

Hello again! So far, this series has talked a lot about distributed systems — what they are, where they came from, why they’re popular, and the problems they can cause. Today, I want to get more into the weeds about one particular technique you can use to understand them: distributed tracing. Specifically, we’re going to focus on distributed tracing as it’s implemented by the OpenTelemetry project, a vendor-neutral open source system for generating telemetry data.

So, let’s start at the beginning. What is a trace? A trace is a representation of the work being done by all of the services involved in a request. Traces have some unique properties — you can think of them as a tree data structure, without loops. So, for each trace, there is only one starting point — the root span — and potentially many leaf spans. However, when we visualize a trace, we most often think of it as an icicle graph (or, an inverted flame graph, if that’s how you want to think about it)

A flame graph is a useful method to visualize hierarchical data. They’re commonly used for visualizing profiling data, such as the output of perf or dtrace — which are profiling tools used for gathering performance data on a process, such as how long each function call takes.

What makes up these visualizations? You can think of a trace as an action of some sort involving your application — like registering a new user account, clicking a “done” checkbox on a to-do list, or fetching a table of values. Each of these actions may involve one or more components of your application performing some work, and that work takes place over time. We call these smaller parts spans. In a trace visualization, each row is a single span — with its length corresponding to the amount of time (relative to the root span) that it took to complete. Each span also has a name, which represents the task that it’s performing.

What about an example that doesn’t have to do with computers? We can express a lot of things as traces — any ordered operation that doesn’t loop back on itself can be drawn out as a tree. Take the following mathematical equation:

((12+5)*8/2)+1 = x

We can express this as a trace by thinking about our order of operations:

The same trace expressed as a tree (left) and an icicle graph (right)

In this case, our root span would be the action we’re trying to accomplish (solving for x), and each of its children would represent a step in solving it. The visual relationship between each span is also worth paying attention to — each step explicitly depends on the step before it (after all, I can’t solve the outermost part of the equation without first solving each prior step). However, traces can also model work that can take place independently — what if, for example, our equation was ((1+1)*(2+2))?

Notice the difference between how parallel work is illustrated in these visualizations of solving our equation.

Since the result of the innermost sums don’t depend on each other, they can be represented as occurring at the same level of the trace. These relationships, however, are generally inferred from the underlying trace data, with some help from metadata that can be attached to each span. This is where the real power of tracing comes into play — the ability to create rich, detail-filled spans that can be correlated and analyzed is what makes distributed tracing such a useful and necessary tool.

So, what’s in a real trace? Well, it depends! Traces can be simple, or amazingly complex. They can show a handful of operations, or thousands. The thing to remember about a trace is that a trace is only as complex as the work that it’s representing. Each trace is built from spans. As I said earlier, these spans should represent a distinct unit of work that takes place inside one service and only one service. At the minimum, a span is going to contain several required properties: its name, a span context that uniquely identifies the span, and a start and end timestamp. Spans also may contain attributes — these are user-configurable pieces of metadata that help you categorize, sort, filter and search for spans later. A span can contain events, which are time-stamped messages about things that happened during the span’s lifetime. Spans also contain some span-specific metadata, like the identifier of their parent span or a value to indicate if the span represents work being done by a client or a server.

OpenTelemetry, Spans, and You

We’re going to use OpenTelemetry as the basis for describing the more technical concepts in this post, and throughout the rest of the series. Keep in mind, though, that while some of this is specific to that project, the core concepts and ideas here are generally transferable to other span-based distributed tracing systems.

Let’s talk about this in slightly more concrete terms by imagining what a span would look like from a web server. We can conceptualize the work of a web server, at a high level, as a program that listens for requests on a given network port, and responds to them with either the requested data or an error message. What would a span look like for a web server? The simplest one would be a single span that represented all of the work being done from the moment that a request was received by the server, to the moment that it completed sending a response to the client. These moments would form the starting and ending timestamps on a span. We would also need to give the span a name. This name should be something that we can use as a “grouping key” later on — it needs to be readable by a human being, and should allow us to group a particular class of spans together. If our web server is looking up files and returning them to the client, then it’d be reasonable to suggest that the requested path might be a good name. With this in mind, what does our basic span look like if we wrote it out in object notation?

Now, this is only half of the puzzle, so to speak. Having a span is great, but the really interesting question is: “what can I learn from this?” Right now, unfortunately, you wouldn’t learn that much — but more than you might think. Want to know what the most popular URLs are on that server? Well, you could collect this data for a week and then do a frequency analysis on the name. Would you also like to know how long each request took? That’s easy enough to get as well — you can use the start and end timestamps to calculate the duration. What we really need, though, is to add attributes and events in order to be able to ask more interesting questions.

An attribute is what it sounds like — a piece of data that characterizes the span it’s a part of. If you think about our example here, there’s a lot of data we might want to know — what was the status code of our response, what was the HTTP method that was requested, how many bytes were transferred in the response, were there any parameters in the query string of the request… and so on, and so forth. Attributes are where this data lives. Events can be thought of as discrete… well, events, that occur while a span is executing. If our web server happened to encounter an unexpected error, then that would be a good candidate for an event. What if we tried to access a file to return it to a user, but couldn’t due to incorrect permissions? That would also be a good event. Events should be diagnostic data for humans — error messages, stack traces, or simply informative messages about what’s happening in our software. Let’s add some attributes to our span, and see what it looks like.

A span where everything worked out great:

A span where things didn’t go great (we tried to access a file that we didn’t have permission to):

Now that we’ve added these attributes, our spans become more useful. We could ask questions like, “what’s the slowest five percent of POST requests that were successful” or “how many 500 errors occurred when someone tried to access this one specific URL route.” As we add more attributes, the complexity of our questions can increase. Attributes become adverbs in the questions that we ask — important modifiers that allow us to surface insights from our trace data.

Spans, by themselves, aren’t very interesting. Really, a single span can be thought of as simply a structured log statement — it describes a single request in a repeatable, known way. Distributed tracing comes into its own, however, as you create more and more spans, and connect them together by propagating their span context. How does this work, and by what mechanism? How do you add distributed tracing to an existing, or a new, piece of software? We’ll go through these questions in the next part of this series, so stay tuned!

Feature image via Pixabay.

At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: feedback@thenewstack.io.

A newsletter digest of the week’s most important stories & analyses.