Observability — Freeing Your Load Tests from the Black Box
The shift to cloud native architecture has fundamentally altered the tools needed to monitor, troubleshoot and optimize the performance of distributed applications and infrastructure. We have moved from having a monolith with simple logging, to using aggregated logs across several services, to now relying on the deep visibility provided by distributed tracing tools so we can visualize and understand our complex microservices. Troubleshooting tools have evolved!
But what about performance testing? There are newer, better tools, but are the same black box testing methods from the late 2000s still being used to check the response? Is this adequate for today’s decomposed services? Let’s examine this space.
Cloud Native Architecture Has Changed the World
The world is now undeniably cloud native and distributed. We once had large, centrally located teams writing monolithic applications with a single data store, application layer and a frontend application. Now we build microservice-based applications that are developed by numerous teams, working in different locations. The applications depend on message buses to asynchronously communicate to multiple services written in a variety of languages using different data stores.
There are numerous well-documented benefits of moving to a microservice-based application:
- Teams can work and commit independently in small groups.
- Velocity is not constrained by having to coordinate on one massive codebase.
- Services can be agnostic to programming language and technology decisions.
There are, however, challenges to testing these distributed applications. Nowhere are the challenges as readily seen as they are with load-testing.
Why Black Box Testing Is Not Adequate for Load-Testing Distributed Apps
With monoliths, most API calls against a backend were synchronous. You call the API, it processes the request, writes or retrieves data from the centralized database, and returns, if it was successful, a status code of 200. To load test, you need to hit the API surface with several thousand calls and see when it breaks. You could use profiling tools against your codebase to judge where the issues were or analyze the slow query log for the database to determine if SQL queries were the bottleneck.
Contrast this with a distributed system. When you place an order on Amazon.com, you get an immediate response confirming the order has been submitted, even though a slew of other processes occur as part of the order being placed. When you tell GitHub to create a build, it returns a response immediately, and the work begins asynchronously. What happens if we try to load test these asynchronous processes?
We can use our traditional load-testing tools, queue up a few hundred calls and fire them at the API. If the distributed system is using a message queue, the API surface is likely to just put a message on the message queue and return the 200 status code. The load test will verify that the API request handler can absorb the load and perhaps verify that the message queue can enqueue requests at the given rate, but it will not verify or check any of the downstream processes. If a microservice pulling requests off the queue cannot scale, has a bottleneck or experiences errors under load, this issue will remain unidentified and unresolved. A bad test that gives you false confidence is worse than no test at all.
There are other issues with testing these distributed systems with black box load-testing techniques. Even when a load test highlights an issue with the synchronous portion of the call, how do you know which of the handful, or possibly dozens, of underlying microservices involved in the flow caused the failure or slowed down the entire process?
As a good friend used to say, “You get one point for identifying that there is a problem, and you get 10 points for identifying the solution.” With current tools, you do not know which microservice is at fault without manual investigation. You can “throw it over the fence” to the dev team, but without a clear idea of which microservice is involved, they will not know who or even which team should focus on the problem. This is frustrating, resulting in slower time to remediate the problem, and causing friction among staff and teams as the blame game passes the ball back and forth.
To summarize, load-testing microservice-based architectures with current black box load-testing methods results in:
- Shallow tests that only verify the synchronous processing parts of a transaction, not the complete flow.
- Tests that, when they fail, do not indicate why or who should work on solving the problem.
- And, worst of all, false positives where load tests show as passing when the underlying system actually collapsed or degraded under the load.
What Is the Solution for Load-Testing Distributed Systems?
Great! We have identified the problem. Let’s give ourselves one point! Now let’s find a solution so we can get an additional 10!
What is the root issue with load-testing against distributed, microservice-based applications that is holding us back? Visibility. Or, to be more exact, causing a lack of visibility. The testing system cannot see underlying processes, much less assert against results from these individual processes.
Treating the system under test as a black box, hitting it with numerous requests and verifying the output was a very rational and effective technique in the past. It now fails to look at the various important subprocesses occurring in the complete life cycle of the transaction that was triggered by the test. Wouldn’t it be awesome if the visibility enabled by distributed tracing could also be used to provide the visibility needed by load-testing tools? And if you could build assertions based on this trace data and test the entire distributed system? Many companies have already instrumented their systems with distributed tracing to enable visibility when experiencing operational issues. Why not use it for testing as well?
Observability with Distributed Tracing Provides the Answer
Well, now you can! Thanks to the introduction of a testing methodology known as trace-based testing and the integration of two open source tools to allow trace-based tests to be used in a load-testing scenario. The two tools are Tracetest and k6.
Trace-based testing leverages observability to allow tests to be run against distributed systems. It collects both the response and the trace generated by the test run. With full visibility of the complete process being provided by the data in the response and the trace, assertions can be created that verify critical parts of the process. You can verify that three processes are supposed to pull the message off a message queue and assert that each happens within a certain number of milliseconds. You can verify that a particular database is written to, and that the duration of the write stays within particular bounds. Any of the information throughout the entire process can be covered by a single test by creating multiple test specifications.
Tracetest is an open source, Cloud Native Computing Foundation project that enables trace-based testing. It allows you to create tests by defining how you want to trigger the system under test. It then allows you to specify the exact part of the full response, whether it is the actual response or a specific process in the trace, to apply assertions to. A test specification has two parts: the selector, which uses a selector language similar to CSS selectors to target the service or services you want to test against, and the assertion, which is one or more checks to run against each service selected by the selector. The tests created in Tracetest can be created graphically, using the captured trace to easily build the test specifications, or can be written directly in a YAML format. There is a CLI allowing the tests to be run as part of your automated testing processes.
In November 2022, the k6 and Tracetest teams started discussing the possibility of applying trace-based testing techniques to load tests generated by k6. The team at k6 had recently enabled trace IDs to be returned in a load test via the xk6-distributed-tracing extension. With this capability of associating each test run as part of a load test with the corresponding distributed trace, the k6 team added the ability to visualize failed tests and examine where the issue lies.
Tracetest only has a couple interface points with systems under tests:
- It needs to be able to trigger a test against the underlying system to initiate the test. As part of this, Tracetest generates the parent trace ID so it will know what trace is associated with the test.
- It needs to be able to retrieve the trace from an underlying tracing system such as Grafana Tempo or Jaeger.
In looking to interface with k6, the Tracetest team realized that k6 should trigger the test and that any processing of all the resultant traces should happen asynchronously after the load test finishes. We did not want Tracetest to become part of the “system under load test” and affect the results.
Tracetest has always had the concept of multiple trigger types, as you can define tests with REST calls, import a Postman collection, use a gRPC call, etc. To enable an outside solution, such as k6, to trigger the process and for Tracetest to apply trace-based tests against the results, we came up with the concept of a “traceID-based test” where the trace ID is handed to Tracetest, and Tracetest collects the trace and processes the tests against it.
With all the elements in place, building an extension to make the two tools work together was the last major step. The xk6-tracetest extension was created. This extension allows you to build a k6 load test as normal, with just a few lines of code added to include the xk6-tracetest extension, specify the path to the Tracetest server, point to a Tracetest test to run and an optional area to enhance how failed Tracetest tests are shown as part of the run. You can see a full example of a k6 test with a Tracetest test included in our example.
We also have a short video that highlights the benefits of running a trace-based test against each run as part of a load test. In this video, software engineer Oscar Reyes shows how an entire microservice can be stopped and the black box-based load test will return no failures. Once he adds the Tracetest to the test, the issues deeper in the process are identified as issues, and the test fails.
We also had an opportunity to spend an hour with the k6 team to show the integration. You can watch it here:
Start Fully Load-Testing Your Distributed App
This new integration between Tracetest and k6 gives load-testing the deep visibility needed to actually catch load and performance issues across your distributed app. Ready to try k6 and Tracetest in your environment?
You can get started with k6 by following the getting started docs.
For Tracetest, start by checking out our easy download. Then configure Tracetest to connect to your existing trace data store and create a test or two.
Once you have played around with k6 and Tracetest separately, go to our instructions detailing how to write a test combining the two.
Any issues can be raised in GitHub, and you can communicate directly with the team in Discord. Open source projects such as Tracetest depend on the input and support of the community and welcome your input and ideas. If you like our direction and what you are seeing from Tracetest, give us a star on GitHub!