Embracing Testing in Production
At Octopus, we recently embarked on a new initiative to deliver tools that build opinionated GitHub Actions workflows and Jenkins Pipelines to help customers implement their continuous integration and continuous delivery (CI/CD) workflows.
The team was small (me), the deadlines were tight, and the tools were designed to build custom scripts and templates that themselves built yet more scripts and templates to be executed in platforms that were proving impractical to emulate locally.
I found writing code that writes code to be a uniquely frustrating experience. Linters and compilers are only useful on the final output, but offer little insight into my original templates, as they are full of markup syntax that often makes them invalid examples of their final output. I also did not have a clear picture of what the final result should look like, instead of iterating with many small changes tested in rapid succession.
Compounding the problem, the target platform executing the templates, GitHub Actions, did not have a robust offline option. Some interesting and active open source projects have sprung up to fill this gap, but I preferred to validate my templates in GitHub directly.
The final hurdle was the microservice architecture with which these tools were developed. Sitting in front of the template generator were convenient and well-tested web-based interfaces and services that pushed files to GitHub on the end users’ behalf. I could, of course, automate the process of typing git push in an isolated testing loop, but I did wonder if there was a better way to reuse the services that had already implemented this process.
Unsurprisingly, a majority of the advice on the internet implored me to isolate, record, replay, mock, substitute and automate my testing and development efforts. A recent Twitter post by Mitchell Hashimoto sums this up nicely:
I often recommend to more junior engineers: build feature X without ever running the software. Open a PR with confidence to say “this works” without having to SEE it work. This helps build understanding, confidence, and forces writing testable code.
I totally agree with this statement. At the same time, what I really wanted to do was leverage the existing microservice stack deployed to a shared environment while locally running the one microservice I was tweaking and debugging. This process would remove the need to reimplement live integrations for the sake of isolated local development, which was appealing because these live integrations would be the first things to be replaced with test doubles in any automated testing anyway. It would also create the tight feedback loop between the code I was working on and the external platforms that validated the output, which was necessary for the kind of “Oops, I used the wrong quotes, let me fix that” workflow I found myself in.
Looking for Inspiration
My Googling led me to “Why We Leverage Multi-tenancy in Uber’s Microservice Architecture,” which provides a fascinating insight into how Uber has evolved its microservice testing strategies.
The post describes parallel testing, which involves creating a complete test environment isolated from the production environment. I suspect most development teams are familiar with test environments. However, the post goes on to highlight the limitations of a test environment, including additional hardware costs, synchronization issues, unreliable testing and inaccurate capacity testing.
The alternative is testing in production. The post identifies the requirements to support this kind of testing:
There are two basic requirements that emerge from testing in production, which also form the basis of multitenant architecture:
- Traffic Routing: Being able to route traffic based on the kind of traffic flowing through the stack.
- Isolation: Being able to reliably isolate resources between testing and production, thereby causing no side effects in business-critical microservices.
The ability to route test traffic to a specific and isolated microservice was exactly what I was looking for. It removed the need to recreate the entire microservice stack and supporting platforms locally for testing while leaving any production traffic unaffected.
The only question was how to implement this with AWS Lambdas, which were hosting our microservices.
Looking for an Existing Solution
Unfortunately, while Kubernetes platforms can take this kind of routing for granted with advanced tooling like service meshes, there was no such ecosystem for Lambdas. Lambda extensions come tantalizingly close, but are focused on collecting metrics or modifying the execution environment rather than intercepting and modifying network traffic like Kubernetes does with sidecars:
You can deploy multifunction Lambda Layers to manage large binaries, or (now in preview) Lambda Extensions to plug in third-party agents that I’ve been told should definitely not be thought of as “sidecars for Lambda.”
The AWS App Mesh FAQ makes no mention of Lambdas, and while there are many questions around the dynamic routing of Lambda traffic on sites like StackOverflow, such as here and here, the response is always “you’re on your own.”
Defining the Problem
The problem I was trying to solve was traditionally the domain of a reverse proxy. However, unlike traditional reverse proxies, which have rich, static, server-side rules, what I needed was a rather dumb reverse proxy that implemented routing rules embedded in test requests.
This dumb reverse proxy (DRP, or even better, “derp”) needed to be deployed as a Lambda, and required the ability to forward traffic to upstream HTTP servers, Lambdas and even Amazon Simple Queue Service (SQS) queues. Looking further ahead, it would also be nice if the DRP could integrate with other platforms, like Azure or Google Cloud.
Go was the perfect choice to build the DRP. It already has an HTTP reverse proxy included in the standard library, compiles to native binaries with a short cold boot time and is popular enough to have first-class SDKs for major cloud providers.
Deciding on Routing Rules
Given the routing rules are included with each test request, it made sense to include them in an HTTP header. It is certainly possible to send complex objects, like JSON blobs, in HTTP headers, but a better solution was to allow the routing rules to be defined as a simple string.
The rules take the form route
/path/to/resourceis an HTTP path, optionally supporting Ant path syntax e.g.
METHODis the HTTP method such as
destinationis the upstream service to redirect the traffic to such as
destinationnameidentifies the upstream service, be it a Lambda name, SQS queue, or HTTP server
Multiple such rules are concatenated with a semicolon, leading to headers like
This string is passed to each microservice in the
Routing header, and each microservice is expected to pass this header along with each outgoing call. In the absence of any routing rules, the DRP routes traffic to a default upstream service.
Enabling Advanced Deployment and Testing Patterns
These routing rules provide for some interesting deployment and testing scenarios.
Feature branching is supported by deploying a feature branch Lambda with a unique name, like
TemplateGenerator-MyFeatureBranch, and routing test requests to the feature branch Lambda.
Blue/green deployments are achieved by deploying the new green microservices parallel to the existing blue microservices, testing the green stack by routing test requests via the DRP, and once the tests pass, reconfiguring the DRP to set the default upstream services to those in the green stack.
Perhaps most exciting of all is the ability to route test traffic from the cloud network back to your local PC. Using services like ngrok to expose a local port via a public hostname or a standard client VPN into an AWS VPC, it is possible to route traffic for a single microservice back to your desktop environment. Much like the Kubernetes offerings Telepresence or Bridge to Kubernetes, this effectively allows a locally run microservice to participate in requests passed around a remote microservice stack.
Thinking about Security
There are obvious issues with allowing anyone to route traffic anywhere based on a well-known header, and so the DRP is configured to only inspect the
Routing header is ignored. This ensures that only trusted team members can route production traffic.
The DevX Impact of Testing in Production
This local development experience was incredibly valuable. It allowed me to use the stable production environment, removing the need to recreate the complete microservice stack and supporting platforms locally, while also allowing me to iterate locally on a single microservice that received test traffic, but otherwise identical to production traffic. And all of this was done safe in the knowledge that no production traffic was affected.
An additional benefit was that all logic implemented in the API Gateway was respected. API Gateway is a complex platform offering almost unlimited options for manipulating traffic before it reaches upstream services. It is possible to run a local test API Gateway, but setting this up is now unnecessary.
However, this approach does require that each microservice expose both a Lambda event handler to respond to traffic from the API Gateway and an HTTP server to expose the service while debugging locally. This wasn’t a burden though, as all modern frameworks make it easy to spin up an HTTP server. In practice, this means those structuring their code along the DDD layers will have an application layer exposing both an HTTP and Lambda interface, with the lower layers being ignorant of how the traffic was received.
I fully expect all Lambdas we deploy in the future will be hosted behind a DRP. The productivity gains unlocked by the ability to develop and debug individual microservices locally against a stable production microservice stack are undeniable. While testing in production is no substitute for comprehensive unit and integration tests, it does allow you to quickly reproduce and observe quirky behavior, and experiment with new ideas and solutions.
The DRP source code is available on GitHub. Let us know if you find it useful!