Challenges of Testing Distributed Cloud Applications in 2021
Software development has come a long way in the last twenty years. The advancement of hardware and software has been outpaced only by consumer and enterprise demand. As a result, the complexity of software systems has skyrocketed.
While it is still possible to develop and test some types of applications in a local environment, this is generally not done. Complex systems made of many different moving parts can be difficult or even impossible to run locally. This complicates both local development, which needs to happen without having the whole system available, as well as testing, which needs to wait until the changes are available on a fully built environment to have all the necessary coverage.
In this article, we’ll explore the challenges related to testing distributed architectures and discuss some approaches to achieve this.
Challenges in Testing Distributed Architectures
Systems that grow to handle a large amount of load can scale up or scale out. The latter provides more flexibility to scale beyond the constraints of a single machine, but makes the system inherently more complex. Achieving the right level of performance, while maintaining a healthy level of availability and fault tolerance, remains a challenge — especially in view of the limitations imposed by the CAP theorem.
The resulting architectures are often made up of many different components, which might include:
- Batch processing services
- Load balancers
- Message queues
- Different types of storage
Setting all of this up in a local environment is a complex undertaking, if not impossible. For example, it might require more memory than is available on a single local machine, or cloud services that aren’t available locally.
Regardless of whether a distributed architecture can be set up locally or not, its very nature makes it more challenging to test than monolithic systems. It is much harder to identify what is going on when a single request can traverse several different components.
Also, failures as a result of inter-component communication can manifest themselves in many different ways — including duplication or loss of messages, split brains, or intermittently failing requests due to a subset of problematic services sitting behind a load balancer.
xUnit Tests: A Limited Solution
Unit testing has been around for a while and can be an effective tool for testing logic. Its main advantage is automation — meaning unit tests can be run on-demand or as part of a CI/CD pipeline, as often as necessary. This is particularly useful for regression testing.
But in cloud development or development of other distributed systems, the benefits of unit tests are much less pronounced. While logic still needs to be correct, in such systems, there is a lot more emphasis on performance and reliability. This depends on the way components interact between themselves, which by definition is outside the scope of the typical unit test. In fact, unit tests require mocking cloud resources and third-party services (such as payment gateways).
The loose coupling between microservices, along with the option of writing them in different programming languages, means that different teams can use their choice of tools without depending on each other.
However, this means that tooling is heterogeneous across the organization — making it difficult to repurpose development resources between different teams. And since unit tests are essentially tied to the original code, testing an entire system could require knowledge of many different programming languages.
Using APM to Troubleshoot Distributed Applications
It is essential (though more difficult) to test complex distributed systems. This ensures they work reliably, even under heavy load, and that bugs are caught as early as possible in order to minimize their impact and cost. Since it is difficult to test such systems locally and unit testing is inadequate on its own, a more pervasive style of end-to-end testing is necessary.
This creates a need for an observability solution. It is very hard to debug logic in a distributed system where it could span several different services, and even harder when it might not be possible to do this in a local environment. As a result, many companies resort to application performance monitoring (APM) solutions to gain visibility into the health of different services and the interaction between them.
APM tools can be effective for visualizing how different components in a system are connected. They can also provide insights on individual services — such as whether they’re up and running, the number of errors returned within a time period, and the load they are experiencing at any given time.
On the other hand, APM tooling requires a significant investment. Aside from any licensing and server costs for running central APM services, APM solutions themselves tend to be very complex. It is not uncommon for larger organizations to have dedicated personnel working on such systems.
Approaches for Testing Microservices
The complexity of developing, testing, and maintaining distributed systems is somewhat alleviated when the components involved are built with loose coupling between them. This is at the core of hexagonal and microservices architectures, and has a number of advantages:
- Single-responsibility principle (SRP): Services can focus on doing one thing well, and delegate other concerns to other services.
- Maintainability: It is much easier to maintain and evolve a service independently if it is not entangled with other unrelated logic.
- Reusability: Designing around pluggable components means that they can easily be reused in new or growing services where they are needed.
- Replaceability: Like reusability, replaceability means components with different functionalities but the same interface can easily be swapped out as needed. In the case of microservices, entire services can be rewritten without breaking the architecture.
- Testability: As with classes in object-oriented programming, services with well-defined inputs and outputs are easy to test, both in isolation and as part of a greater whole.
Designing software in a way that it is testable and maintainable is important, but it’s not enough on its own. Running end-to-end tests on distributed architectures that can’t be hosted locally requires environments to be set up to support them.
One way to do this is to create additional environments and gradually promote software changes across the environments as they get tested. Hosting and maintaining these environments requires significant additional cost and effort.
Some of this can be alleviated by automating deployments via CI/CD and using smaller and/or scheduled resources for lower environments to minimize costs. However, the latter means that the environments are not identical; and thus, tests do not provide full confidence, especially when it comes to performance.
Another approach is to test directly in production. While this is riskier than having full isolation between test and production environments, this risk can be mitigated by creating tenants for testing and directing the flow of test requests differently from real ones. This means tests are still controlled and run in some degree of isolation.
Being able to rapidly deploy changes on the real production environment and gain immediate feedback about them is an agility boost that alone can outweigh the risks, not to mention the cost and effort savings of not creating all those extra environments.
Cloud Debugging with Thundra Sidekick
No matter how your environments are set up, debugging a distributed application remains a difficult and time-consuming process. It requires distributed tracing — i.e., understanding how a request passes through different services and what happens along the way — and is further complicated by the fact that some issues may not be easily reproduced in non-production environments.
All of this is compounded by the risk and tediousness of making changes to the production environment in an effort to gather more information about the root cause.
Fortunately, Thundra Sidekick makes this process a whole lot easier. By setting non-breaking tracepoints, you can effectively debug a process in production without blocking the service and impacting customers or downstream services. The tracepoints capture the context (including variables) when they are hit, giving insights into the application state that can help identify what caused the issue and how to resolve it.
Thundra Sidekick integrates with your IDE via the IntelliJ IDEA Plugin, allowing you to do this production debugging from your familiar programming environment. On top of that, you can use it to deploy a hotfix directly to a cloud environment, quickly ensuring that the fix really works before you roll it out via your normal CI/CD process.
Boost developer productivity with remote debugging for microservices. Get started with Thundra Sidekick.