This is part of a series on Distributed Tracing. For a list of other articles in this series, check out the introductory post.
Lightstep sponsored this post.
This basic loop encompasses a lot of what we do as developers trying to understand our systems. You do something, see what happens, gather more information, and try to do it again after tweaking some things. Most systems will behave in a similar way. If I’m calling some external service and it doesn’t respond, my service should try again after a few moments. If I’m trying to write a value to a database, I shouldn’t give up if it fails — I should try again, depending on why it failed. That concept of why, though, is important. I need to have context for what happened, in order to determine how to try again. These feedback loops are vital for everything we do in life! If I’m trying to get a large piece of furniture through a door, it’s not going to do a lot of good if I simply keep pushing when it gets stuck. I need to evaluate why it’s stuck, then change what I’m doing. Blindly pushing away just gets me a ruined couch, a broken door, and back pain.
In the previous installment of this series, I talked about the rise of distributed systems, and some of the reasons they’ve become so popular. Now, let’s think about them in the context of what we just discussed: the notion of feedback loops. When I’m developing a small application, my feedback loop is very short. I can make a change to a line of code, recompile and re-run the software, and immediately see what changed. This is pretty useful, to say the least, when it comes to not only understanding my software — but improving it. The connection between a change and the result of that change are extremely obvious and easy to quantify. Imagine, though, a distributed system where I can only change small parts of it at once. My changes may be a relative drop in the ocean, but even a small drop can ripple outwards and eventually become a mighty wave.
Consider the following scenario: I have a service that receives some data (it doesn’t really matter what kind). Maybe I’m taking a value and re-encoding or reformatting it as part of an integration with a new source of customer data. The service gets deployed, and all’s right with the world. But one day, a ticket comes in. The data format is being changed, so I need to change how I convert it. No problem: a couple of lines of code, a couple of test cases and let’s even assume we gracefully handle both the old and new data formats. Heck, it’s a Friday, let’s go ahead and deploy it — what’s the worst that can happen, right? I push my code, the PR merges, and I knock off for a relaxing weekend of wistfully looking out the window and watching YouTube, remembering the times before Covid-19.
I did everything right in this scenario, didn’t I? In isolation, of course I did. I wrote test cases, I defensively programmed, I made sure everything worked in staging, I had someone review my code, I double-checked the specifications and the documentation… everything should be fine. Except, what if it’s not fine? What if the new conversion logic is slower — even just so — than the old code? What if my defensive programming, checking to make sure that I can convert data in the old and the new format, what if that’s added latency to the critical path of my application? And, depending on where my service is being called from, what happens when those other services get backed up? The extra milliseconds don’t seem noticeable at first, but maybe they do matter… and let’s say they do. Suddenly, an older service, four or five hops away from mine, starts timing out because of the additional latency I added. Those timeouts cause rippling failures, as other services begin to time out as well, or even begin to fail with errors, and crash. These timeouts and crashes eventually lead to unexpectedly high load on the primary database for the application, which starts to fall over, and suddenly my small change has caused a complete outage. Whoops!
Alright, let’s pause — how do we fix this? Well, that’s actually a tricky question to answer! These sort of systemic failures can be solved in many different ways, and that’s one of the things that makes them so challenging to tackle. Do we address the direct cause: our service deployment? Perhaps, and you could even say that it’s the root cause of the outage — but really, is it? A lot of other things had to go wrong, after all, to cause the entire system to fall over. A bunch of other services, for example, started to time out and fail — that’s a cause. Legacy services started to crash because they couldn’t address the timeouts, that’s also a cause. When those services restarted, it caused additional load on the database, which caused the database to fail — also a cause. We can pull the camera out a bit too, though. The reason we deployed a new version of our service in the first place was due to a request for a change in data formats, which caused this whole rigamarole to begin with. We could zoom out even further, if we wanted to — why did the data format change? Was that avoidable? It’s important to understand that nothing is just a technical problem, just a computer problem — everything starts, and ends, with people.
Diagnosing and solving these problems can be challenging! As you can hopefully see, it’s not enough to just have data about what’s going on. You need the context of why things are happening. You need the ability to work in reverse — from effect, to cause — and the ability to understand how services in your system are connected together, and how changes in one part of the system can affect other services, even those that aren’t immediately upstream or downstream of it. Generating this data can be hard — you need something that is capable of being integrated into a variety of services, each of which could be deployed in a distinct way, running in data centers around the world. You’d then need a way to collect this information and display it in a human-readable form, and build tools to help you interpret it by allowing you to search and query it.
Now, thankfully, there’s a lot of great solutions to these problems. As I mentioned in the last part of this series, quite a few tools have been developed over the years to monitor services, collect diagnostic information about them, and allow you to try and puzzle out what’s wrong. One common problem people run into, though, is that they all work a little bit differently across different languages, runtimes, and deployment strategies. Most critically, though, is the lack of context. It’s this specific issue that distributed tracing addresses. If you think back to the beginning of this piece, figuring out why something happened was the critical part of our problem-solving feedback loop. Distributed tracing gives you the why, by giving you request-level visibility into what’s happening in your system. It’s not a panacea — it doesn’t solve all your problems on its own — but it’s the glue that binds the diagnostic data you receive from your services together. How does it work, though, and what is a distributed trace, really? In the next part of this series, we’ll dive into the technical details of OpenTelemetry and explain exactly what a trace is.
Feature image via Pixabay.
At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: firstname.lastname@example.org.