Etsy’s Tool for Squeezing Latency From TensorFlow Transforms
TensorFlow transforms are the leading cause of most machine learning model performance issues. But increasing the size of search batches doesn’t necessarily change server-side latencies. These are but two of the lessons Etsy has learned deploying its new ML testing tool.
The original blog post written by Sallie Walecka, a machine learning engineer at Etsy, and Kyle Gallatin, a senior software engineer for machine learning at the company, goes into great detail on the topic.
As the global online market develops deep learning models for its search ranking platform, the engineering team had to create automated testing practices to lower latencies.
For this, they developed an internal automated testing tool, Caliper. With Caliper, testing happens much earlier in the development process. Earlier testing revealed the leading causes for higher latencies. And using heavily-detailed distributed tracing helped eliminated unnecessary client-side timeouts and exposed a bottleneck caused by a problem already considered solved by engineers.
Why Is Low Latency for Search Rankings a Challenge?
Latency is never easy, but it’s specifically hard for search ranking machine learning because of “the number of features, bursty nature of the requests, and strict latency cutoffs,” according to the blog post authors.
While Etsy didn’t give a specific latency target, it’s less than 250 milliseconds because all of the following work takes place before 250 milliseconds (with the majority before 75 milliseconds).
For each search request, 1,000 candidate listings are fetched; then 300 features are retrieved from the feature store. The data is then sent to Etsy’s machine-learning services for scoring and ranking before getting sent back to the user. This process is described by the blog post authors as “bursty cpu-bound workloads that are very costly.”
These expensive cpu-bound workloads were hard to troubleshoot and adjust because model testing wasn’t performed until right before launch, due to high development overhead. It should come as no surprise that this led to “unexpected surprises and headaches,” according to Walecka nad Gallatin, which isn’t a long-term development strategy.
Caliper’s Automated Early Testing
Etsy’s Caliper can automatically test early in the development lifecycle.
As soon as a model artifact is trained, it can be load tested with training data and profiled with TensorBoard automatically. The legacy manual process was now just a memory because Caliper only takes five minutes to complete the test. It displays latency distributions via web UI, model errors and a profile of slow TensorFlow operations. Data selection, RPS, batch size and model selection are all still customizable.
Caliper revealed two important factors leading to higher latencies. The first was that most model performance issues were caused by slow TensorFlow transforms.
The second came from Caliper’s ability to test different batch sizes. Batch size didn’t lead to substantially higher latencies on the server side but greatly decreased the overhead of requests prepared by the orchestration layer. The batch size was increased from five to 25.
Distributed Tracing Means Even Lower Latency
But the war was not won yet, as there were still client-side timeouts at 250 milliseconds. The model predictions were made at 50 milliseconds but there was still an extra 200 problematic milliseconds somewhere. Etsy used distributed tracing and the service proxy Envoy for increased observability from what the Prometheus histogram metrics for the client, proxy and TF serving container it was using revealed.
Distributed tracing offered granular samples of latency across components on an individual request level, rather than sharing the p99 (99th percentile) latency range (which was 100 milliseconds to 250 milliseconds in this case). Envoy access logs provided the breakdown of where latency was occurring and revealed the problematic 200 milliseconds was spent transmitting features from the client to the proxy before it even got to the model.
Etsy determined that this issue was workload specific because while “features transmission” time wasn’t linear with payload size on a request level, search rankings shrugged from this issue while ads rankings didn’t. Overall it was a slower model with about 20 times as large a payload, with more features.
The payload size being the bottleneck was news to Etsy though because the current protobuf payload was 1 megabyte, which was down from the 4 megabyte size of the JSON payload for the decision-tree model.
Search orchestration tried compression with gRPC, which yielded payloads that were roughly 25% the size and the results reflected the solution. The search model’s error rates decreased by 68% and p99 latency was reduced by around 50 milliseconds.
ML features are only going to increase and models will become more complex. To combat this, Etsy is working on finding ways to further decrease payload size. Company engineers are also continuing development on Caliper and the automated infrastructure.
TensorFlow was developed by Google and is an end-to-end open-source platform for machine learning and AI. There was no reference to any future open-sourcing of Caliper.
Etsy published this article last October; it focused on the move from gradient-boosted trees to deep learning. It covers a lot of the how and why the move to deep learning was made.