Trace-Based Testing for a Distributed World
Testing has not changed significantly in the last decade in spite of radical shifts in the underlying system architectures that modern development teams use. Distributed architecture has introduced more complexity into both system testing and integration testing. While work has been done with mocking to attempt to recreate reality while providing isolation, it still proves to be lacking and unreliable for many teams.
Trace-based testing addresses this gap by using modern observability techniques, namely distributed traces, to allow tests to be created to verify the entire system flow based on the observed flow — what really occurred.
In order to explain trace-based testing, we want to “deconstruct” the parts, explain each of them and then use this knowledge to understand the whole.
Breakdown of a Trace-Based Test
Trace-based tests have several distinguishing characteristics, some of which are common to other forms of testing, and some unique to traced-based testing. Let’s look in detail at the makeup of a trace-based test.
For this example, we will look at the YAML-based definition of a test as defined by Tracetest, but the principles would hold against any implementation of a trace-based test. Let’s use this short test:
name: Pokeshop - Import
description: Import a Pokemon
- key: Content-Type
- selector: span[tracetest.span.type="http"http.method="POST"]
- attr:http.status_code = 200
- selector: span[tracetest.span.type="http" name="HTTP GET pokeapi.pokemon" http.method="GET"]
- attr:http.response.body contains meowth
- selector: span[tracetest.span.type="database" name="create pokeshop.pokemon"]
- attr:db.result | json_path '.name = meowth
- attr:tracetest.span.duration < 100ms
Trigger Your Distributed Test
The test trigger defines how to begin the execution of the test. Typically, this involves hitting an API endpoint, but tests can also be triggered via mechanisms such as putting a message on a queue or making a gRPC call.
The trigger starts the execution of the system under test, invoking the execution path under test. Triggers are a common concept across testing tools and frameworks, as they are used to create a response from the system that is then asserted against.
# trigger test by doing POST against /pokemon/import endpoint
- key: Content-Type
Response from REST API Call
As stated above, the response is the recorded set of values returned from the system based on the execution of the trigger. For a REST-based trigger, the response data typically includes information such as the HTTP status code, the response body and other attributes.
Most testing tools follow an established pattern:
- Fire a trigger.
- Assert against the response.
While this approach was very successful in the days of monoliths and simple systems, it does not fully test in depth the flows involved in a modern distributed system. For that reason, trace-based testing relies on not just data from the response for assertions, but also on data from a tool built for modern architectures — distributed tracing.
A distributed trace creates a record of the execution path through the various parts of a microservice-based system, recording key information captured at each step. Distributed tracing evolved in the late 2000s from an internal project at Google named Dapper. It has only been heavily used in the past 10 years. Its adoption is a response to the complexity in troubleshooting across systems introduced by modern distributed architectures. Engineers were no longer able to rely on a single log to view execution information.
A trace is made up of spans, where each span represents one operation. Each span has recorded attributes that capture key information about the particular operation.
Trace-based testing allows assertions to be made against information contained in the trace. This allows fuller checks of the entire process and is not limited to just checking the response.
Test specifications are made up of two parts:
Each of these are covered in detail in the following sections.
Selectors are an important part of trace-based testing. They are used to specify which span or spans are inspected by a particular test specification. They can be thought of as a filter that returns a list of spans.
Most API test tools do not have a concept of a selector, as they are limited to only verifying against the top-level response. The concept of a selector will be very familiar to anyone with experience with front-end testing tools such as Selenium or Cypress.
Assertions are a series of checks that should be applied against each span specified by a particular selector. The assertions typically specify one of the attributes contained in a span and apply a logical check against it.
For example, you might set an assertion against a gRPC span to verify that the status code from the call returns a zero. These are similar to the assertions found in other API testing tools.
Benefits of Trace-Based Testing
Now that we have explained the parts involved in creating a trace-based test, let’s discuss some of the benefits realized by moving to a modern testing method built for distributed testing:
Trace-Based Tests Are Easy to Create
The “test rig” portion of building a traditional system or integration test is the most painful and verbose part. Getting access to each system under test, having the right authentication information to access each of the data stores you want to interrogate as part of the test, and writing all the code to tie this together in your test is the bulk of the code and effort.
As you can see from the example above, a trace-based test is short and to the point. Trace-based testing eliminates this work because it relies on the observability data you have already enabled in your distributed trace. You have already done this work, so use it!
Test against Reality
Unlike testing against a mock, the test specifications in a trace-based test occur against what is actually recorded as occurring against your system as shown by the distributed traces. There are no false assumptions about how the systems being mocked will react, as the actual system is tested.
Visual Nature of Building a Test
Most trace-based testing solutions allow you to visually build your tests while viewing the response and trace from a triggering transaction. Having the full trace graphically shown when building the test enables you to visualize the flow through your distributed app, aiding you in understanding the underlying services and knowing what should be asserted against.
Test the Entire Flow
Trace-based tests allow you to assert on not just the response of a trigger, but also verify operation deeper in the system. Here are a few examples of assertions that can be made:
- If the API call pushes the message to a queue, I expect three services to pull the message off the queue.
- A child process should finish within 30 milliseconds of a parent span starting.
You can also have wider assertions that look at all spans of a certain type, with assertions such as:
- All gRPC spans should return a status code of zero.
- Any database queries should execute in less than 100 milliseconds.
Failed Tests Always Have a Trace Attached
Since trace-based testing always runs against the response and the distributed trace, the richness of the information provided when a test fails is much greater than a traditional test. Not only can you see the response data from the test and the assertions that failed, you also get the full system trace detailing what occurred during that particular run. Developers love having all the information needed to troubleshoot the failure.
Implementing Trace-Based Testing in Your World
Want to build a trace-based test against your system? All you need is a system that supports distributed tracing, preferably using the standards-based OpenTelemetry approach, and Tracetest, an open source trace-based testing solution that you download and install.
Installing Tracetest in your Docker or Kubernetes environment only takes a couple minutes. Once installed, you tell Tracetest how to access traces from your distributed system, and you are ready to begin building your first test. Tracetest works with any of the popular distributed tracing solutions such as Jaeger, OpenSearch, Elastic, Grafana Tempo, New Relic, Lightstep and more.
Need assistance or have questions? You can reach the Tracetest team on our Discord channel, and you can add issues to help direct the course of the project in GitHub. If you like Tracetest, please give us a star in our GitHub Repo.