Using Distributed Tracing to Get Better Coverage with Chaos Engineering
Everyone knows that chaos engineering is a great way to improve the resilience of software applications. By offering an opportunity to practice handling failures in a safe way, chaos helps developers build more robust code as well as gain confidence in their ability to respond to failures. Distributed tracing can support this by helping to find root causes and other contributing factors: traces encode causality within a distributed system, so that you can use them to trace failures as they propagate across services. But tracing can do more than just help you respond to incidents — both real and staged. It is also a tool that can help you better understand your application before incidents occur and therefore to plan more effective chaos experiments.
Modern distributed tracing has its roots in work done at Google (including by one of my co-founders). Google’s internal Dapper tool is used to debug applications in production, plan optimizations, and hold teams accountable for performance. Distributed tracing forms the backbone of observability by supporting the ability to navigate from effect back to cause. You may hear that observability is about “being able to ask any question” — and to an extent, that’s true, because you should certainly be thinking about how to look at telemetry from many different angles. But in my mind, there’s only one question that matters: what caused that change?
When debugging failures during game day exercises or automated chaos attacks, distributed tracing can show you what other changes in the software are correlated with those failures. For example, perhaps other services were slower to respond (possibly missing their SLOs) or traffic was routed to a secondary data center (increasing network costs). It is important to not just look at individual requests but to consider statistically significant sets of requests. It’s also important to analyze not only requests that were the target of an attack, but also requests that were not part of the attack. That way, you can understand what’s normal and what’s not. In any distributed system, many things are going wrong all the time — although most of them don’t matter or are being handled gracefully. So being able to understand how application behavior is actually changing during an attack is critical for finding contributing factors quickly.
But how do you know that your chaos attacks are the right attacks? How do you prioritize? And how do you know when you’ve done enough? Like other kinds of testing, you can think about test coverage as a way to measure progress in your chaos exercises. For unit testing, we’d measure how many lines of code are run during at least one test. But what’s the right way of measuring coverage in chaos engineering?
Just as in monitoring, looking at your infrastructure can help you form a baseline for chaos exercises. Your first take at monitoring probably answered questions like these: Are your hosts up? Are any processes crashing? So coverage meant making sure that every host had a monitoring agent and that every process was reporting crashes. As you start with chaos engineering, you’ll probably want to think about which hosts and which services are being targeted by attacks.
But just like in monitoring, you really need to think about how the application is behaving from a user’s perspective. So just like you are measuring performance of, say, logging into the app separately from completing a purchase, you should be thinking about how to inject faults in each of those interactions. To an extent, you’ll get some coverage by just ensuring that you’re injecting faults in every service. But as your organization starts adopting practices like incremental rollouts (for example, canaries or blue-green deployments) or feature experiments, using services as a way of defining coverage will lead to, well, some gaps.
For example, say you’ve rolled out a canary for one service and that you are also injecting faults into another, upstream service. If the point of a canary is to understand how new code will behave in production, then certainly you’ll want to know how that new code will behave in the face of upstream failures. But if the canary is handling only a small percentage of requests, chances are that it won’t be exposed to that failure. Obviously, you could add instances to the canary or increase the number of faults that you are injecting. But how can you get good coverage while both leveraging deployment best practices (like incremental rollouts and feature experiments) and also limiting the size of chaos attacks? Distributed tracing can tell you whether or not you got lucky in subjecting canary instances to any upstream attacks, but what I’m really excited about is using distributed tracing along with application-level fault injection (ALFI) to more precisely — and more effectively — target attacks.
ALFI lets you use information about how the application is processing requests to target extremely fine-grained attacks. Rather than black-hole traffic for an entire process, ALFI lets you drop requests for specific endpoints — or even for certain types of requests — to those endpoints.
Put this together with distributed tracing, and now you can target attacks based on what’s happening across the application. Using tracing, for example, you can inject faults in upstream services based on whether or not the request came from (or even passed through) a canary instance. That way you can ensure that you are getting good coverage — and not just hope that attacks are targeting both new and old versions — without increasing the number of faults.
Both distributed tracing and chaos engineering have a lot to offer on their own, but when used in tandem they offer an especially powerful tool for building resilient systems. Together distributed tracing and ALFI can increase confidence in the efficacy of your chaos exercises, while at the same time reducing the blast radius of attacks. Tracing can show you how good your chaos coverage currently is, as well as help you create attacks to increase that coverage without increasing the risk of those failures affecting end users.