Distributed Tracing Is a Hassle, Here’s Why
Why is distributed tracing such a hassle? The question came up on Twitter a few weeks ago. And since then, the conversation has continued.
Distributed tracing is not super accepted in this heterogeneous software world. It can take down production. Oh shiiiiiiit.
But it seems like everyone is having a go at it. Distributed tracing is hot!
There are those who wish the future would get here faster. In other words, a world in which distributed tracing is just part of the workflow, used with ease. But we know it’s not so simple — yet.
So when I talked to the team at Rookout this week, I asked about the developer experience with distributed tracing. They reminded me that in fact they have a bit of news on the topic, which seems pertinent considering my interest in what the hell a developer experience means, anyways up and down the stack.
I understand distributed tracing is a mess because there was OpenCensus, OpenTracing, etc., and now we’re all moving towards OpenTelemetry.
but please, have a page that says “this tag makes that UI bit appear” because I love the UI bits and I want them all. thanks
— fasterthanlime 🌌 (@fasterthanlime) June 23, 2021
Rookout breaks down distributed tracing as a problem that has its roots today in how cloud native microservices are debugged. The cloud native approach often leads developers into the use of more traditional log-based approaches. But it gets complicated pretty fast. Step-by-step debugging is just not made for distributed, at-scale architectures argues Rookout product manager Oded Keret. Logging costs increase and performance gets impacted. The Rookout team argues that log-based debugging means having to add a line of logging code, waiting for a new release, and wasting more precious development time while delaying issue resolution.
One line to rule them all — hardly. A student of Kubernetes will look at that option and scratch their heads. It’s a dynamic world. There are containers and pods — one debugging tool can not rule us all. Serverless, too, the student wonders — it’s like a flying circus out there with the distributed performers, here now, gone the next.
What comes from debugging is the need to see more. And thus over the years, we have seen the rise of application performance monitoring (APM). But a funny thing happened on the way to the distributed circus. The tools became just windows into more complexity. What the monitoring provided could often just make it more confusing. The developer watching the flying circus might not be able to monitor everything. What happened to the invisible flying elephants, a developer might ask. It doesn’t help that complete and total monitoring is not a reality as much as elephants can fly and cavort in the ephemeral space. Better to imagine them and bring them back to reality — all those pink darlings.
But, wait — that’s when the student might say — distributed tracing can help me better track those free spirits running through the machines.
One time I added distributed tracing to the front end (“surely this won’t impact anything, it’s reporting not a feature!”), and it killed all requests because the tracing header wasn’t allowlisted for CORS https://t.co/9WT8ea0W8G
— Drew Petersen (@KirbySaysHi) June 18, 2021
Keret and the Rookout team argue that tracing is a complement to logging and monitoring. The real need is to view the internal state of the application in real-time. His perspective brings the conversation into the world of observability, arguably one of the most important developments in at-scale computing of the past few years. Pioneered by the good people at Honeycomb, observability is now all the talk. Rookout, New Relic, Grafana, and young companies like Akita are all in the game as are many others such as Lightstep, CA and Thundra.
(Have I forgotten, anyone? That would be a yes. Write a post for us and tell us you are out there. Make it readable and relevant, please. The editorial team will kick my ass if we get a bunch of promotional, me-too, stories. Challenge our readers, get them engaged — explain, please.)
Open source projects help move the needle for that developer out there who is just trying to keep track of the elephants. The Open Telemetry Project is one of the most popular projects across the Cloud Native Computing Foundation, second only to Kubernetes, Keret said.
Rookout’s take on observability is in many ways about measuring time, a theme that surfaces as well in a post by Akita’s founder, Jean Yang. At Rookout they’ve brought distributed tracing into debugging sessions with its new tracing timeline. It offers the Open Telemetry project users a way to see the internal relations between different cloud-based microservices. It’s a way to troubleshoot, meant for the developers who may be using tracing tools such as Jaeger, Zipkin or Lightstep.
The Rookout team argues that the next dimension to this story is about understandability. The ability to combine tracing information with code-level, context-specific debug data to give developers insight into application behavior. How did the elephants end up in Topeka, anyways? Maybe they went there for the food? The food looks nice and spicy at Monsoon Indian Grill.
Still, how, and why, we understand how the elephants flew into Kansas is a question to ponder in the land of Oz.
Tracing has two distinct functions:
1. Keeping track of transaction information globally — something you want in every large (even monolithic) system.
2. Powerful yet complex API, wire format, and visualization.
The second is often an overkill, and agility would serve you better!
— Liran Haimovitch (@Liran_Last) June 24, 2021
Akita Software Founder Jean Yang writes on the Akita blog that time writing software has turned into time operating software. Much of the time is now spent “divining monitoring graphs.”
What’s dangerous is what comes of the demand for the quick fix, that pursuit of the silver bullet that Yang discusses in her post.
In the end, it’s more about what my friend Tyler Jewell would say. It’s clearly about how the tools fit into the flow, into the slipstream as he would say. That’s what matters. And then it’s just about what you are trying to do.
Are you looking for tools that allow for abstractions? Or are you looking for complexity-revealing tools?
In the end, only time will tell how distributed tracing fits into the developer’s slipstream. Until then, the Monsoon Grill is now definitely a restaurant for my bucket list.