Trace-Based Testing: The Next Step in Observability
Observability is one of the key constituents of cloud native technology for software engineers. In fact, the Cloud Native Computing Foundation Charter states that cloud native technologies “enable loosely coupled systems that are resilient, manageable and observable.” But why stop at simply observing system behavior?
Modern microservices and distributed architectures are deeply complex. Hundreds of services can be involved in a single flow. Applications can be written in multiple languages and might be linked to a number of backend data stores. Add to this multiple teams working on different applications, often across continents. In these inherently difficult architectures, uncovering the root of an issue is particularly challenging.
That’s why observability tools that enable distributed tracing have become so incredibly important in recent years — enabling engineers to understand the flow of services and build a picture of system behavior and performance. This information is ideal for integration testing. If the test you are writing has access to a trace telling you exactly what happened and when, you can take a huge chunk of work out of building integration tests by leveraging this data.
Trace-based testing is the next step for observability. It’s a method that enables you to specify exactly what transaction you want to test and what the results should be by observing system behavior. Basically, you can validate dependent relationships between components that you’d usually see when you push code to production. You can proactively test potential issues, rather than scrambling to fix failures.
How Did Trace-Based Testing Get Started?
More than a decade ago in 2010, Google published: “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure.” Modern tracing systems, such as OpenTelemetry Tracing, can trace, pun intended, their origin to Dapper. We recently wrote about the history of OpenTelemetry tracing, if you want further background. Google used the project internally for two years before publishing and reported that “Dapper’s foremost measure of success has been its usefulness to developer and operations teams.”
Aside from being incredibly useful, the Google team also realized that the data captured by Dapper in a trace could be used in testing — in fact, the first recorded use of trace-based testing was on Google’s Ads Review service.
“New code release goes through a Dapper trace QA process, which verifies correct system behavior and performance. A number of issues were discovered using this process, both in the Ads Review code itself and in supporting libraries,” it said.
Why Does Trace-Based Testing Matter?
Like the team at Google reported, understanding system behavior is invaluable. It means you can be proactive, rather than reacting constantly to issues that could have been predicted.
Back in 2018, Lightstep’s Ted Young presented the talk, “Trace Driven Development” at KubeCon + CloudNativeCon North America. This is one of the best explanations of the concepts behind trace-based testing and why it’s important. In this brilliant presentation, which I highly recommend watching, Young explained how inefficient testing methods are leading us to some pretty poor-quality code:
“I would argue it’s really important that we start merging our development practices and our monitoring practices. Right now, we develop code, and we have all this tooling that we use to develop that code, and then we throw it over into production. Then we want to see what it’s doing in production, and we use a whole different toolchain to do that, and we don’t test that toolchain in development. We don’t utilize that toolchain in development, and I have found over the years that that means that toolchain has a high variance and quality.”
He poses a straightforward, yet important, question: Why are our development and testing practices divorced from our monitoring practices? If we’re investing time and effort to write and test our code, it should be leveraged in the monitoring process to close the feedback loop and ensure 360-degree observability.
Data-Driven Software Development
Really, what trace-based testing boils down to is data-driven development — using the data that is inherently included in a trace to create tests and define assertions against, thereby verifying proper system operation with repeatable tests.
“We can start applying data science to observing our software in a way that is not just making pretty graphs, but actually writing assertions and creating hypotheses and testing them,” Young said in his presentation.
“We’ve been talking about trace data like trace data is something special, but really trace data is just data. Trace-driven development is really just data-driven development. There’s no reason why you can’t take the style of testing that we usually use to test for a logical execution in our code and write the same kind of test against the aggregate data that we collect in production.”
How to Get Started with Trace-Based Testing
As far back as 2010, Google engineers saw the benefit of basing tests on data provided by a complete, well-instrumented, distributed trace. Up until now, however, this value has been hard to use.
That’s why we’re working on Tracetest, an open source tool to create deep integration tests using OpenTelemetry traces.
Adding an assertion based on the trace
Tracetest unlocks this value, allowing you to use OpenTelemetry trace data to enable both integration and complex end-to-end tests. By relying on the tracing data from your existing tracing system to enable testing, we make integration tests much easier to write. In addition, by building tests based on your traces, we make instrumentation of the trace a core part of the development cycle — not an afterthought, creating a feedback cycle that results in better testing and better observability.