Trivago Cracks the Case of the Vanishing Performance Improvements
In its staging environment, online travel service Trivago found the Fastify web framework responded to HTTP requests 107% (more than two times) faster than Express.js on average, as carried out on k6 load testing. Great news. But these performance improvements vanished in production. Why?
This conundrum was the subject of a recent blog post, written by Abdelrahman Abdelhafez, a Trivago backend software engineer. The post documents a technical challenge and investigation that sprouted up after migrating a monolithic Node.js GraphQL server from Express to Fastify.
To simplify the deployment of the tests, Trivago’s staging environment simulates the production environment to a high degree of parity. To further ensure accurate metrics, both the Express and Fastify staging deployments were:
- identical. The only point of divergence were the web frameworks running under the hood.
- run concurrently to simulate running conditions because the data sources that GraphQL fetches data from automatically scale.
With issues within the test environments ruled out and Abdelhafez and team, still hunting for the “Eureka!” moment, investigated the Express-to-Fastify migration pull request diff. They hoped to find something that pinpointed what lead to the bug. The answers they were seeking were not found so the debugging continued.
Abdelhafez explains this next part as, “[bringing] out the big guns: live profiling.”
For this part, Trivago rolled out a single Kubernetes pod equipped with a node-heapdump package which allowed them to take a snapshot of the V8 heap in a live deployment. Afterward, they directed a tiny portion of their incoming traffic to this special pod.
And for the first time, they laid eyes on the bug: The log reaction logic was consuming “obnoxious” amounts of memory. Why?
This is the process of overwriting sensitive information (passwords, access tokens, emails, etc) before they are written to the application logs. Log redaction simplifies Trivago’s compliance with GDPR and makes their logs less confidential thus making them viewable by more devs for monitoring and debugging purposes. It definitely sounds like something they need to keep in the application.
Log Redaction in the Fastify Days: Fastify uses pino internally and recommended it so that didn’t change. The Fastify docs say that pin has built-in low-overhead redaction support so pino was explored without deepdash.
The image below shows the redaction feature configured with pino’s redact option:
The code above logs the following to the console:
In theory, this is perfect. No object traversal of any kind. One just needs to specify the paths needed to redact and “voilá” pino redacts them. But, real-life use cases don’t usually work like this. And in the case of Trivago, they need to stringify the GraphQL variables object after redaction but before logging to prevent log post-processing tools (e.g. Elasticsearch/Kibana) from indexing individual GraphQL variable properties.
And this limitation caused Trivago to abandon pino’s baked-in redaction and try something else — fast-redact. fast-redact is a library that pino uses and promises “very fast object redaction” with it being only ≈ 1% slower than JSON.stringify. The code was implemented.
But fast-redact did not redact fast. Zooming in on the log redaction showed that the code written using fast-redact was the exact code slowing down the entire Node.js server. Why?
Finally an answer!
The deep dive investigation into fast-redact revealed a little implementation error.
When the image above is compared with the previous image of the fast-redact function, it’s right there in plain sight. Rather than generating the redact function once, it was regenerating the redact function on every single request.
Here’s what the correct implementation looks like per the ReadME.md.
The code was updated and it turns out fast-redact is fast. here is an isolated stat to prove it.
Abdelhafez explains this in a way that brings all the drama and I’m here for it: If the distance between the Earth and the Sun is 1 AU (≈ 8.3 light minutes) then 238,756 AU (≈ 3.8 light years) is roughly the distance between Earth and the second nearest star to Earth after the Sun, Proxima Centauri.
The Node.js server’s memory usage dropped from an average of 4GBs to 2GBs. This made their Kubernetes cluster in the EU region automatically scale down from 210 to 160 pods. Now hold the pods are able to handle the same amount of traffic in the busiest region.
How was this missed in the testing phase? Trivago manually disabled logging in the Express and Fastify deployments in the staging during load tests. The engineers didn’t consider that disabling logging also disabled the logic surrounding logging (ie redaction) and rendered that disk I/O “irrelevant” in the context of load tests. Abdelhafez said, “🤦♂️.”
To summarize: Fastify is superior to Express, performance-wise, and don’t disable any production flags during testing.