How We Managed to Test the ‘Untestable’
Deploying an untested change to production will always cause some amount of stress.
Stress level increases with the importance of a system you are changing. If there is any seasonality in your traffic patterns, stress also increases during a “high season.” That may be daily peaks, high traffic periods of the year, the week of your company going public or any similar seasonality. Every past incident you had when you deployed a change also affects your stress level. The more recent they are, the more stress they add.
In my experience, deploying a tested change reduces stress to almost zero. Even with testing in place, incidents can and will happen, but you have a way to catch them sooner and not repeat them.
Reduced stress is reason enough to put effort into testing changes, even the ones that seem hard or even impossible to test before they reach production.
HAProxy, the Testing Nemesis
At Infobip, we deploy more than 1,000 different changes to production daily. Most of those changes are code, which can be, and is, tested. But there are also changes in network configurations, virtualization or storage layers, which are much harder to test.
Until recently, one of the things we considered impossible to test was HAProxy configurations. But we just weren’t thinking enough about it.
All our inbound HTTP traffic passes through an L7 balancing layer, where we use HAProxy. Over the years, we have grown to more than 40 data centers, many of which have specific configurations, mostly due to special client requirements, various migrations and different product stacks available in specific data centers. It’s impossible to have a staging environment addressing all the discrepancies.
Where We Fell Short Before
We tested a change on a single staging environment we had, manipulating the configuration to a state similar to the one we wanted to change in production. This process depended entirely on the engineers making the change.
There was no hard procedure in place that required us to test the change before production. We considered it common sense and preached it, but when someone created a pull request for the change, we never checked whether it was tested.
The only requirement for a change to get deployed to production was an approved pull request by one of the engineers. A few incidents later, we increased this to two engineers for the most important systems. We considered this a temporary measure until we thought of something better. Of course, this was a pure “hopium”-based strategy.
Here’s an example of one HAProxy-related incident:
This was a change that resulted in serious degradation of our platform. After this was applied, some of the requests on the api.infobip.com endpoint were routed to the wrong backend. As seen in the picture, it was a simple change.
Two engineers approved it, both thinking that HAProxy behaves differently than it does. One of them was a new hire. Imagine the stress they experienced and all the flashbacks they had when the next change was needed. We decided to improve this process significantly.
A New Approach: Scratch Only Where It Itches
At first, we thought that testing HAProxy configurations before production was inherent to the type of configuration and the way we use it.
“But every HAProxy configuration is unique to a specific data center!”
“We can’t fake responses of all backends!”
“There is a lot of client-specific context in our HAProxy configurations that you must understand to do it right!”
A few costly incidents over the years made us think harder to find a solution. And it wasn’t that difficult to find it. During the process, we learned that while these statements are still true, they don’t make the configuration untestable.
Our HAProxy configuration consists of many access lists routing traffic to specific backends depending on regexes. When we took a long, hard look at the incidents, we noticed about 90% were due to a misconfigured regex or wrong order of access lists.
That’s when we realized that we didn’t need to test every line of the HAProxy configuration or make the testing process more complicated than it should be. We should just test the things we are having problems with! Scratch only where it itches!
And this itching part can be tested if we make it so. So, we did.
Every change we make in HAProxy configuration must pass a series of tests, but someone must still approve it. On some HAProxies, we now run more than 3,000 tests on every change. Part of them was auto-generated, and part of them was written manually.
Every push to a git repository triggers a Jenkins build, which creates an ad-hoc Docker environment, prepares the HAProxy configuration for testing and runs all the tests. HAProxy directives that are not subjected to tests are replaced with generic values. If any test fails, we consider the build failed, and the change cannot be merged to the production branch.
We have tried a couple of tools for evaluating responses to HTTP requests, and we chose Hurl due to its good assertion engine, simple file format and test execution speed.
Hurl is a command line tool that runs HTTP requests defined in a simple plain text format. It can perform requests, capture values and evaluate queries on headers and body response. It is very versatile: It can be used for both fetching data and testing HTTP sessions. It can assert both JSON response and HTTP response headers. Additionally, you can even test whole HTML pages or even bytes content.
A simple web server that just responds (thus this “brilliant” name) HTTP 200 OK, reflecting everything it received. Every HTTP request header that the http responder receives is returned in a JSON structure to the client.
During configuration preparation, all backend servers in the HAProxy configuration are pointed to the http responder. Also, the backend name is injected as a header to every request. This enables us to check whether our quest is routed to the expected backend, or if a reverse proxy injected some required header, or if it blocks the request with malicious content in HTTP headers like we configured it.
Things that we test are:
- Access lists and request routing
- HTTP response status codes
- HTTP request header injections
- HTTP response header injections
- HTTP header or URL path rewrites
Writing the Tests
To test the behavior of a reverse proxy, you just need to describe the desired result of a specific request.
For example, we have the following directive in our configuration:
To translate, all requests on
test.api.infobip.com endpoint with a path beginning with
/mobile/ should be routed to the
Now we just write this sentence as a test.
This test is executed on every change. So, if we introduce some faulty regex that would make the traffic on/mobile end up on another backend, this test will fail.
Zero Incidents Achievement: Unlocked
Introducing a testing pipeline to our configuration change process has improved the quality of our platform, and the number of related incidents dropped to zero. My stress level is also quite close to that number now.
I encourage you to check your change deployment processes and try to find a spot for automatic testing. Remember: You don’t have to test everything. Test what matters. I guarantee that it will pay off.