The Takeaways from Our OpenTelemetry Implementation Journey
It’s easy to get excited about OpenTelemetry, the open observability framework that engineers use to collect logs, metrics and traces from virtually any type of application.
But things get tricky when you start to actually implement it. That’s not because OpenTelemetry is particularly hard to work with — in many ways, it’s a more powerful toolset than what we’ve had before — but because it’s easy to overlook important details when adopting it for your team.
Some of those details are in the technical realm; they have to do with how OpenTelemetry is configured and operates. Others are what you might call cultural. They involve the way your engineers engage with and think about the tools.
Our experience implementing OpenTelemetry at Equinix Metal offers some valuable insights in both of these categories. Below, I’ll share some of the big takeaways from our journey and tips for making your transition to OpenTelemetry as successful as possible.
Tip 1: Decide Why You Need OpenTelemetry
For starters, make sure you establish clear goals for why you are using OpenTelemetry in the first place.
This may not seem hard to do. It’s an easy framework to love. It’s free. It’s open source. It works with the major observability tools. It doesn’t lock you into the vendor’s code. This is especially important for integrating with open source tools and frameworks like Tinkerbell.
Still, it’s important to make sure your OpenTelemetry journey is purposeful. Don’t just decide to set up the tool because others are doing it. Specify what you want to get out of it.
We began implementing OpenTelemetry at Equinix Metal even before the framework had reached 1.0 status. Our decision to embrace the platform had two reasons. The first and most obvious was that we didn’t have an existing solution in place for tracing, and OpenTelemetry was an obvious way to fill that gap. (Currently, we’re using OpenTelemetry only for tracing; we use different solutions for logging and metrics collection.)
But the second, deeper goal was that we wanted to become early adopters of the tool so we could help it grow. We believe in the mission of the Cloud Native Computing Foundation (CNCF), which incubates OpenTelemetry. We also believe that an open standards-based approach to telemetry offers great promise for our customers, and we wanted to get in on the ground floor of helping to drive broader adoption of the tool.
Tip 2: Don’t Expect Overnight Success
I’d love to say that within hours of implementing OpenTelemetry, engineers from across Metal were shouting from the rooftops about their amazing new observability insights.
That, however, was not the case. Our OpenTelemetry implementation story began with failure. When we first added the framework to the Metal API, which we use to manage our infrastructure, we began seeing weird bugs. That led to pushback against the tool from some of our engineers, who worried not only that OpenTelemetry was not yielding the grand observability insights it was supposed to deliver, but that it was actually creating new reliability challenges.
In fact, the rollout was so rough that after about a year we decided to disable OpenTelemetry. We then went on vacation (this was right around our end-of-2021 holiday shutdown) and came back reenergized and with fresh minds.
The reset allowed us to identify the root cause of our OpenTelemetry woes. It turned out to be poorly designed parallel threading code that had been added to the Metal API long ago. OpenTelemetry wasn’t actually the source of the weird issues we were experiencing with the API. They had been happening for a while, but OpenTelemetry made them worse because it increased the number of parallel threads being run.
Once we realized that, the solution was obvious. We ripped out the buggy code and were then able to run OpenTelemetry with aplomb.
The lesson here is that if your rollout doesn’t lead to instant success — and it may well not — persevere. It takes time to get buy-in for new initiatives like OpenTelemetry, and you may run into hiccups, or worse… But if you give up at the first roadblock, you’ll miss out on a lot of long-term value.
Tip 3: Use OpenTelemetry Auto-Instrumentation
One of OpenTelemetry’s coolest features are its auto-instrumentation libraries. Instead of writing all your own code to generate traces, you can load these libraries, which reach into the codebase and wrap the instrumentation code around your code in a mostly automated way.
Taking advantage of auto-instrumentation is an obvious way to speed OpenTelemetry adoption. But if you haven’t worked with OpenTelemetry before, you may not think to take advantage of this feature. Or, you may believe that it won’t work well enough to save you from a lot of manual coding. (Hint: It will!)
We explain in more detail how we used auto-instrumentation in this blog post, but the main point is to make sure you don’t waste lots of time writing your own instrumentation code when you don’t have to.
Tip 4: Create Custom Instrumentations
The tip above doesn’t mean you shouldn’t expect to tweak your instrumentation code at all.
On the contrary, it’s important to take full advantage of OpenTelemetry’s custom instrumentation functionality. Custom instrumentation means you can configure your traces to pull out specific contextual information that matters for your business.
At Metal, we configured custom instrumentations to collect details like hardware specs and operating system data associated with the requests we trace. This context makes the traces more meaningful and actionable.
Your contextual needs will be different because your business is different. And that’s the point: Use custom instrumentation to collect whichever details you need to make observability data as meaningful as possible for your team and organization.
Tip 5: Use Tracing to Upgrade Your Mental Models
It’s one thing to collect traces (or logs, or metrics) with OpenTelemetry. It’s another to collect, link and share that data in a way that ensures engineers can actually benefit from it.
That’s why we embraced the concept of “mental models” when working with the observability data that OpenTelemetry gives us. Mental models are structured ways of thinking about how the data actually relates to the system being observed.
When you have good mental models, your engineers can answer questions like: How does a trace from one service relate to another? (Configuring trace propagations helps here, too.) Which data am I seeing that I might otherwise not even think to collect? How can observability data help me be proactive by enhancing reliability instead of simply reacting to the next incident?
A good way of confirming that your mental models are sound is asking engineers to explain traces back to you. In other words, ask them to interpret a trace and explain what it tells them about the system or service it’s from.
Tip 6: Observe Your Observability
Finally, it’s critical to ensure that you have a process in place for validating your OpenTelemetry and observability investments. Put another way, you should observe your observability.
We do this at Metal by performing retrospectives on every incident we encounter. This approach allows us to ask why we didn’t catch the incident before it turned into an incident, and which data did we not collect, not link, not share appropriately or otherwise failed to leverage in the way that we could have? What can we do to make sure it doesn’t happen again? And how did it go so well?
Getting people together to learn from incidents is the most powerful tool site reliability engineers have to change our organizations. Great observability is essential to helping engineers understand the incidents and learn from them.
We love OpenTelemetry — it’s already helped us drive some meaningful user experience improvements — and we’re confident you’ll love it, too. But don’t expect to drag-and-drop auto-instrumentation libraries into your APIs and call it a day. Instead, step back and make a plan for how to get the most out of OpenTelemetry based on your team’s unique needs. Think about your custom instrumentation requirements, for example, and how your engineers will actually put OpenTelemetry insights to use.
We want you to do this not just because we care about you, but also because we really believe that OpenTelemetry creates opportunities for effective, proactive reliability operations. We want to see more teams leveraging OpenTelemetry to its full effect — even if it’s not smooth sailing at every juncture.