3 Ways Observability Matters in Cloud Native Testing

Changes and improvements to observability have seen rapid advances over the past 15 years, many due to the challenges presented by the distributed nature of cloud native architecture.
The benefits of these innovations have mainly been realized by site reliability engineers (SREs) and DevOps engineers in production troubleshooting. The visibility you get with these observability technologies is no less valuable in CI/CD pipelines when testing deployments. This value is just beginning to get recognized.
We will look at three ways value can be realized by using distributed tracing for testing before moving code to production. First, a little background.
Why Is Distributed Tracing Important?
Distributed tracing is used in production by SREs to quickly identify issues, determine the root cause and help remediate issues. A distributed trace captures information throughout the journey of a single transaction. It captures details about the path through the system, the services involved and detailed information about each step in the process such as errors, timing and key attributes such as call and response details. SREs and DevOps engineers can use the provided visibility to quickly identify when and where in a series of steps a process failed.
But how can distributed tracing help when testing these systems before production?
Understanding the System — A Map Would Help!
To create a new test for a system, you must understand the existing flow of operations. Today’s applications are deconstructed into multiple microservices communicating via asynchronous message buses with a wide range of storage and cache technologies. There are often multiple teams working on individual services involved in these flows. There can be multiple languages used in writing these services.
How does a developer or automation engineer gain enough understanding of how a transaction progresses through these systems to even begin writing a test? Documentation and system architectural drawings are often out of date, and obtaining knowledge from all involved parties is difficult in today’s dispersed working environments.
Distributed tracing can help.
By capturing a trace prior to writing the test, an engineer can see a visualization of the microservices involved in the process. The trace is like viewing a map. It shows the journey of the particular call you want to test. And this helps the engineer understand the entire system and thus develop tests. We all get to our destination much easier and faster with a map, and it’s just as true when creating a test. The distributed trace is the map.
Async Processes Have Broken Black Box Testing
Yesterday’s systems were mainly synchronous and easier to test. The system under test typically had an API layer that received requests and called the business logic. This logic usually relied on an underlying database as the data store. Black box testing was not concerned about the system’s internal workings, and it returned results based purely on the response of the initial call. Since most issues resided in the business logic, this was adequate.
Today’s systems are often asynchronous in nature. The system is called, and the call results in one or more messages being enqueued on a message bus. The call returns immediately, signaling it received the initial call successfully, but the real processing is undertaken by microservices consuming messages from a queue. With the numerous interconnected services, the most frequent errors occur on the boundary of services, and black box testing does not catch these.
Trace-based testing comes into its own in these environments. Based on the visibility provided by enabling distributed tracing, trace-based testing allows assertions to be written across the entire cloud native application. It is white box testing, relying on the work you have already done to instrument your application to enable the full system to be checked and verified. Does a particular call to your application need to result in three microservices writing to their individual data stores? A trace-based test can validate that this happens.
Speeding up Mean Time to Repair (MTTR)
Mean time to repair (MTTR) is a measurement usually reserved for production, but the importance of an engineer being able to quickly determine why a test failed, correct the code and push updated code affects release velocity. This time can be reduced by providing more valuable information to the engineer. The information provided by most black box API tests is minimal, restricted to information about the response from the call to the system. Engineers typically have to spend time and energy trying to determine how to reproduce the issue locally before being able to identify the needed correction.
Using modern observability can help. Attaching the particular distributed trace captured by the run of the failed test provides the entire team with much more information. They can see the full details around the execution and can see how all the microsystems were responding. With this additional information, it is easier to determine which microservice was the culprit and route the issue to the proper engineer. Once received, the additional knowledge contained in the distributed trace will aid the engineer in determining the root cause and quickly resolving the issue.
Using Observability in Testing with Tracetest
Tracetest is a trace-based testing tool based on modern observability. It allows you to create tests by triggering the application via an HTTP or gRPC call or by putting a message on a Kafka queue. The automation engineer can then see both the response and the visualization of the distributed trace. This serves as a map, helping the engineer build assertions to validate the most important areas in the flow.
These are white-box tests, allowing both functional and non-functional tests to be created across the service. For example, you can ensure the proper amount of inventory is decremented from your inventory microservice when a shipment order is processed, and you can also verify that it happens within 1,500 milliseconds.
These tests can be run in your CI/CD environment each time you push code, verifying the system continues operating as expected. The distributed traces generated by your application are captured as part of the trace-based tests. They’re then included in results sent to engineers as the result of a failed test. With this visualization of the full process and the captured details, the issue can be corrected much quicker and a new corrected release made.
With Tracetest, you can leverage your existing investment in observability to increase your confidence to deploy. Go to Tracetest.io to get started. Need assistance or have questions? You can reach the Tracetest team on our Discord channel, or reach out to me with any questions via LinkedIn.