Wix Multithreaded Node.js to Cut Kubernetes Pod Costs
By adding worker threads to its Node.js servers, Wix decreased its Kubernetes pod usage by ~70%, giving the website building service the stats to prove Node.js is indeed suited for CPU-intensive high-throughput work.
But a single thread can only do so much as Guy Treger, Software Developer, explains in his recent blog post. Once traffic reached a total of 1 million RPM, it required a “more than accepted” number of production Kubernetes pods to properly serve it.
And suddenly there was a problem of unmanageability. Too many pods and a high volume of CPU-intensive tasks running on a single thread that was built into the architecture of how Wix ran. The most logical solution? Add more threads to separate the workload.
Since the goal is always more traffic, not less, the company’s engineers needed a new solution that would scale. And it’s a rare case when a company can rebuild the entire architecture and this wasn’t an instance of that. So the best thing to do was to add more threads since it was the single thread that was causing the most pain. The goal of adding more threads was to offload the work to other compute units so multiple threads can run partially on hardware that includes multiple CPU cores.
Node.js’s built-in multiprocessing capabilities were “overkill” for what Wix’s engineering team was looking for. They were looking for a solution that would require fewer resources, maintenance, and orchestration.
Per the Node.js docs:
Unlike child_processes or cluster, worker_threads can share memory.
A new feature, becoming Stable in v14 (LTS) released in Oct 2020, Node.js does offer native support but it’s pretty new and lacks in maturity. In order for Wix to fully implement into production-level code, the company’s engineers had to add additional open source packages. They originally searched for one package to tie everything together but found that adding individual packages was best suited for their needs.
Hurdles and Open Source Solutions
Two major hurdles awaited the Wix team when adopting Node.js multithread: task pool capabilities and supporting inter-thread communications.
Task Pool Capabilities
Out-of-the-box: manually spawn worker threads and manually maintain lifecycle.
Hurdle: It’s a large undertaking to constantly make sure there’s enough, re-create worker threads when they die, implement different timeouts, and take on all other responsibilities with manual maintenance
Open Source Solution: generic-pool (npmjs) — The thread-pool results were achieved by adding this popular pooling API.
RPC-like Inter-Thread Communication
Out-of-the-box: Threads can communicate between themselves (eg main threads and their spawned workers) using an async messaging technique.
Hurdle: Dealing with the messaging would make the code harder to head and maintain. The engineers were looking for a package that allowed threads to “call a method” on another thread and receive results asynchronously.
Open Source Solution: comlink (npmjs) — Inter-thread communication in the code was made more consider and elegant as a result of adding this package. This package is known for its RPC communication in the browser with long-existing JS web workers. Its compatibility with Node.js workers was recently added.
The code with all packages looks similar to the image below.
Usage in the web-server level looks like image below.
Results and Takeaways
Node.js is indeed suitable for CPU-bound high-throughput services. Overhead infra management was reduced. Their modest goal was overshadowed by substantial gains. SSRE pod count dropped by ~70% with RPM per pod improving by 153%. There was better SLA and a more stable application with response time p50 dropping by ~11% and response time p95 dropping by ~20%. The error rate decreased by 10x. Direct SSRE compute cost went down by about 21%.
Now that there is a working system in place, the Wix team is working on optimization. Areas where optimization is being explored are refactoring applications to make workers to pure CPU work, researching memory sharing to avoid large object cloning between threads, and finding the optimal number of CPU cores per machine to possibly allow for non-constant size thread pools. There is also an exploration of the potential to apply this solution to other major Node.js-based applications in Wix.