Development / Monitoring / Sponsored / Contributed

OpenTelemetry in Go: It’s All About Context

29 Sep 2020 12:00pm, by

LightStep sponsored this post.

Ted Young
Ted is Director of Developer Education at Lightstep and an OpenTelemetry co-founder. He is a member of the OpenTelemetry Governance Committee.

With OpenTelemetry nearing the end of its public beta, it feels like a great time to dive into some of the underlying principles that make OpenTelemetry such a useful tool.

Go has been my main programing language for the last six or so years, and I still really like it. So let’s use Go as our example. As you’ll see, it really is a great language to explain the fundamental feature of OpenTelemetry: context propagation.

I’ll start by explaining some of the theory, but if you’re a code-first kind of person, feel free to jump down to the tutorial and check that part out first.

What Problem Are We Solving?

OpenTelemetry is a set of observability tools — which we call signals — glued together with context propagation.

How is this new and different? Most traditional observability tools — logs and metrics come to mind — provide ways to record and count individual events. But they don’t provide a great way to contextualize those events; at least, not without a lot of work.

Let me give you an example. My team was working on a distributed scheduler, which was great fun to design, but less fun to debug — especially in production. The problem was logs. We cared deeply about our performance and correctness guarantees, so we had detailed log coverage — even in production. We even tested our logs(!) to ensure that the data we were reporting did not accidentally become decoupled from the events they were recording. Anyways, you get the point — lots of caring about observability was going on.

Running locally, the logs were manageable. But the real interesting problems only happened at scale under heavy load, and were often rare and hard to reproduce. This is what made them “interesting.”

A Needle in a Stack of Needles

When you have gigabytes of logs and only care about a tiny, specific subset of them, it can be a bit like looking for a needle in a stack of needles. A common pattern is to start with an error message, and from there find some identifiers (request IDs, user IDs, etc.) to use as filters. But in a distributed system, there is rarely a single identifier present at the site of the error which would also be present on every log in the transaction as it hopped from service to service. You might have a request ID, which would give you the logs from one service. But what about the hop to the next service? That would be a different request ID. So you could eventually collect all of the logs in the transaction, but it was an ad hoc and iterative process which could sometimes be really slow, especially if you were still root causing the issue and wanted to investigate a number of different transactions as you explored different theories. If you’ve ever debugged a distributed system in this manner, you know how frustrating this can be.

It was clear that having a unified ID stapled to every log in the transaction would make it much easier to filter the logs. But carrying that ID around and passing it from service to service turned out to be a lot of work.

Enter Context Propagation

In order to pass this transaction ID around, you need two pieces. The first piece allows the ID to follow the execution of code within the program. If you’re a Go programmer, you know exactly what that piece is: it’s a Context object!

After all, if you are going to pass one ID around; why not pass around a bag of values? Besides the transaction ID, you could pass around other useful things, such as a deadline, or additional indexes for your logs — project ID, account ID, etc.

The second piece is to propagate the transaction ID from one service to the next, by adding it as metadata in the request.

The above code is the labor-intensive way. You have to know about every value you want to send and hardwire it into every request you make. No fun. OpenTelemetry makes this much easier.

Baggage

The most basic tool in OpenTelemetry is called Baggage. By itself, baggage is not even an observability tool; it simply allows you to propagate your context from service to service. You can think about it like the go Context, only it flows through every service in your system. We’ll talk about why this is useful when we get to metrics.

All of OpenTelemetry’s observability tools are built on top of this basic principle of context propagation.

Tutorial

Installing OpenTelemetry

Ok so let’s get into some actual observing. Starting from the top, let’s set up OpenTelemetry, and discuss what it does. OpenTelemetry is a large framework with a lot of options. To get started quickly, we wrote a handy wrapper called otel-launcher.

Add Instrumentation Libraries for Critical Packages

OpenTelemetry comes with instrumentation for a variety of libraries and frameworks. By adding instrumentation to your HTTP server or framework (ingress), HTTP and other clients (egress), and flowing Context through your application, you can get enough detail to get started with tracing in production. (In some languages these instrumentation libraries are installed automatically, but in Go they must be installed by hand.)

Currently, available instrumentation can be found here. Don’t see your favorite framework or library? Consider writing an instrumentation plugin and contributing back!

Describing Your Service with Resources

The first critical piece of context is the service itself. It’s important to know where spans are coming from. In OpenTelemetry, Resources are used for describing your services. Every event which happens within the service will automatically be contextualized by these properties.

Describing Your Transactions with Tracing

Tracing is the meat and potatoes of OpenTelemetry. This is where context propagation really starts to shine. You can think of it like a big stack trace that includes every event in your transaction.

Every event occurs in a span, which represents an operation. Spans in turn are linked together to form a trace. The trace provides a TraceID to index every event in the transaction. Spans have attributes, which provide more indexes. In addition, spans automatically measure the length of time the operation took to complete.

So, that was a span. Let’s look at recording some actual events:

Conclusion

That’s all folks! Hope you enjoyed this intro. If you want to dig deeper into OTel, check out the new in-depth getting started guides I’ve been working on.

Feature image via Pixabay.

At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: feedback@thenewstack.io.

A newsletter digest of the week’s most important stories & analyses.