5 Ways Trace-Based Testing Matters to SREs
While the development and QA staff have primary responsibility for testing, site reliability engineers (SREs) are typically responsible for the overall system and its availability.
Unit tests are typically created and run in silos within individual teams, but integration, end-to-end and system tests must be conducted in a fuller environment that combines the work of multiple teams, technologies and more realistic deployment environments. This fuller environment is the domain of the SRE engineer.
Having the task of creating and implementing tools that increase site reliability and performance, the SRE team has a role to play in enabling testing against the full distributed system.
Tools focused on assisting SRE engineers with managing a distributed application at scale have undergone a complete revolution in the last decade, mainly due to necessity. Log files are great in a monolithic environment with one user, but they quickly become untenable when you have a distributed system with many users.
This revolution changed the way that logs and metrics are handled, but the greatest difference has been the advent of distributed tracing. The ability to fully track a single transaction as it flows through the complete system, seeing the timing and key information at each step along the path, is a game changer. It has become a central tool on which the SRE community depends to understand and debug complex transactions in its environments.
What Is Trace-Based Testing?
Trace-based testing uses data contained in the spans and attributes of a distributed trace to assert expected system behavior. Traditional API tools assert against the response of a system call, while trace-based testing asserts against the response and the trace results.
You can think of it as Postman with X-ray vision, seeing and allowing testing against not only the obvious result from a call, but also the entire flow, ensuring that each microservice involved in fulfilling the request behaves in the expected manner. It is a natural solution for testing in the distributed world, as it is built on distributed observability tooling, namely distributed tracing.
Great, but how does trace-based testing help SREs?
No. 1: It Brings More Value to the Table Across Roles
Implementing distributed tracing has been described as the high promise, high effort, low value story (see Ian Smith’s presentation “Distributed Tracing — The Struggle Is Real” at KubeCon, in Detroit), but why is that? Power users, typically the SREs, are the only ones in the organization who use tracing tools. However, others have to spend time implementing tracing in their code, maintaining it, etc. Meanwhile adoption lags, never reaching critical mass across multiple stakeholders, so implementing tracing is viewed primarily for its cost.
Basing tests on your traces, however, reverses this trend. Both developers and QA engineers actively create and maintain tests, and building them for distributed systems is notoriously difficult. Trace-based testing is the “easy button,” as it relies on work your organization has already done by implementing distributed tracing. Being able to leverage your investment across the organization to enable testing adds more value, and that value is realized by multiple stakeholders.
No. 2: Trace-based Testing Results in Better Observability
This one is for you, the SRE team. Trace-based testing improves observability, helping you diagnose issues quicker and reducing mean time to repair (MTTR). Since the dev and QA teams rely on the business data captured in attributes in the trace to form their assertions, you will also “observe” the critical business data when viewing the trace. The information needed for a good test is typically the same information needed in debugging, and basing tests on trace data ensures the data in traces is rich.
No. 3: Traditional Testing of Distributed Systems Is Broken
You want to catch issues as code is moved through the pipeline via tests, but the traditional methods used to write end-to-end tests are inherently broken. They are hard to build and expensive to maintain, so many organizations just give up. You, the SRE, end up catching the issues, in prod, with your observability tools, at 3 a.m.
Trace-based testing uses modern observability techniques to enable modern testing. With it, developers and QA engineers are given a trace as the basis of the tests they create. It inherently shows the path taken and systems used in a particular triggering of the system. We all navigate better with a map, and the trace is the map you rely on when creating a trace-based test.
No. 4: Trace-Based Tests Can Be Run in Prod
The same set of tests used in your CI/CD processes can be run against production. Want to verify that an order in production is being processed properly? Use your test from the deployment environment by feeding in the trace from production. The same set of rules and verifications can be applied to the production trace, with the test results highlighting issues throughout the flow. You can also run a subset of tests directly in production, as long as they are harmless and do not end up shipping a toy to a random customer.
No. 5: It Makes the SRE Team Heroes
The Google SRE Workbook defines the role of an SRE team as:
“Simply put, SRE principles aim to maximize the engineering velocity of developer teams while keeping products reliable. This two-fold goal is good for the product users and good for the company.”
Implementing trace-based testing fulfills the tenets of the SRE mission by providing the developer team with a means to more quickly write extensive tests, increasing both test coverage and developer velocity. Better tests lead to better reliability as code is being rolled into your production environment, resulting in a more reliable product for your customers.
Ready to get started? Tracetest is an open source solution that enables you to build and execute trace-based tests. You do not need to change your application or any of your code, and you do not need to switch your observability vendor. The tool has two integration points: It needs to be able to trigger your system, and it needs to be able to access the resultant trace. That’s it! Tracetest can be installed in Docker or Kubernetes; it runs in your environment and begins providing value immediately.
The Tracetest team is available to help you. You can reach us on our Discord channel, and you can add issues to help direct the course of the project in GitHub. If you like Tracetest, please give us a star in our GitHub Repo!