IoT Edge Computing

How Pinterest Tuned Memcached for Big Performance Gains

16 May 2022 12:47pm, by

The Pinterest social media site uses a memcached-based cache-as-a-service platform to slash application latency across the board, minimizing the overall cloud cost footprint, and meet the strict site-wide availability targets.

Pinterest’s memcached-based caching system is massive — over 5,000 EC2 dedicated instances from Amazon Web Services, that span a variety of instance types optimized for compute, memory, and storage dimensions. As a whole, the fleet serves up to approximately 180 million requests per second and approximately 220 GB/s of network throughput over a ~460TB active in-memory and on-disk dataset, partitioned among approximately 70 distinct clusters.

With that large of a fleet, any optimization, no matter how small has a magnified effect. Because each copy of memcached runs as the exclusive workload on its respective virtual machine, Pinterest engineers were able to fine-tune the VM Linux kernels to prioritize the CPU time for caching. One configuration change reduced latency by up to 40% and smoothed over performance overall.

Also, by observing data patterns and breaking them into workload classes, Pinterest engineers to dedicate EC2 instances for specific workloads, driving down costs and improving performance.

Kevin Lin, Pinterest software engineer for storage and caching, in a blog entry posted last week, discussed some of what the company has learned running memcached at scale in production. Here are a few of the highlights.

High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments

High-level description of the available surface area for performance optimization for memcached running on virtual machines in public cloud environments

The Scientific Theory at Work

In order to identify possible areas for optimization, Pinterest set up a controlled environment to create structured, reproducible workloads for testing and evaluating. The following variables allowed for high-level testing with minimal impact to critical-path production traffic over the years.

  • Server-side: This includes metrics for request throughput, network throughput, resource utilization, and hardware-level parameters (NIC statistics like per-queue packet throughout and EC2 allowance exhausting, disk response times and in-flight I/O requests, etc.).
  • Client-side metrics for cache request percentile latency, timeout and error rates, and per-server availability (SLIs), as well as top-level application performance indicators like RPC P99 response time.
  • Synthetic Load Generation: This practice is historically known for detecting performance improvements or regressions while under maximum load. Pinterest used memtier_benchmark, an open source tool that generates their load against a memcached cluster.
  • Production Shadow Traffic: This is the process of mimicking real production traffic with the purpose of evaluating the system performance at scale. Pinterest uses Facebook’s mcrouter, an open source memecached-protocol routing proxy deployed as a client-side sidecar in the Pinterest fleet.

The Results Are in

Cloud Hardware

Pinterest divided its diverse collection of workloads on a high level into the workload classes below. Each class was designated to a fixed pool of optimized EC2 instance types to allow for vertical scaling for greater cost efficiency. Since horizontal scaling is available anytime to alleviate bottlenecks as they arise but is not the most cost-effective method. Pinterest was more interested in Vertical scalability.

Workload Classes:

  • Throughput (compute)
  • Data Volume (memory and/or disk capacity)
  • Data Bandwidth (network and compute)
  • Latency Requirement (compute)

The following workload profiles were created from the classes above:

  • Moderate Throughput, Moderate Data Volume | r5
  • High Throughput, Low Data Volume | c5
  • High Data Volume, Relaxed Latency Requirement | r5d
  • Massive Data Volume, Relaxed Latency Requirement | i3, i3en
EC2 instance type distribution for the Pinterest memcached fleet

EC2 instance type distribution for the Pinterest memcached fleet

When selecting the EC2 instances, and determining which cluster will go to which instance it mainly boiled down to these criteria: CPU, memory size, and disk speed.

Breaking down data patterns into workload classes and dedicating specific EC2 instances for each workload class drives down costs while improving I/O performance. Adding memcached add-on extstore improves storage efficiency while proportionately reducing the cloud cluster footprint.

The instances are configured with Linux software RAID at level RAID0 to combine multiple hardware clock devices into a single, logical disk for user space consumption. Pinterest stripes reads and writes evenly across two disks thus RAID0 doubles the maximum theoretical I/O throughput with a best-case two-fold reduction in effective disk response time with a troubled MTTF.

Pinterest makes it clear that this increased hardware performance for extstore at an increased theoretical failure rate is a highly worthwhile tradeoff.  The infrastructure is on a public cloud so it is self-healing and mcrouter is able to handle server changes immediately.

Computing

The memcached docs define “compute efficiency” as the additional rate of requests that can be serviced by a single instance for each percentage point increase in instance CPU usage, without increasing request latency. By this definition, optimizing compute efficiency is measurable in terms of allowing memcached to serve a higher request rate at lower CPU usage without changing the latency characteristics.

Half of all caching workloads at Pinterest are compute-bound (purely request throughput-bound). Pinterest’s goal was to downsize clusters without compromising serving capacity.

With the majority of Pinterest’s workloads running on dedicated EC2 virtual machines, this opens up a unique opportunity to optimize at the hardware-software boundary while in the past optimizations centered around hardware modifications.

Memcached is somewhat unique among stateful data systems at Pinterest in that it is the exclusive primary workload, with a static set of long-lived worker threads, on every EC2 instance on which it is deployed. Because of this, Pinterest tuned the process scheduling to request the kernel prioritize the CPU time for memcached at the expense of deliberately withholding CPU time from other processes on the host, like monitoring daemons.

This involves running memcached under a real-time scheduling policy, SCHED_FIFO, with a high priority — instructing the kernel to, effectively, allow memcached to monopolize the CPU by preempting (essentially starving) all non-real-time processes whenever a memcached thread becomes runnable. Below is an example invocation of memcached under a SCHED_FIFO real-time scheduling policy

$ sudo chrt — — fifo <priority> memcached …

This one line change drove client-side P99 latency down by 10% to 40% after rollout to all compute-bound clusters and eliminated spurious spikes in P99 and P999 latency across the board. The steady-state operation CPU usage ceiling was raised 20% without introducing latency regressions and 10% of memcached’s total fleet-wide cost footprint was shaved thanks to this optimization.

Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster

Week-over-week comparison of client-side P99 cache latency for one service while real-time scheduling was rolled out to its corresponding dedicated memcached cluster.

Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem)

Ratio of time spent by memcached waiting for execution by the kernel versus wall clock time, before and after real-time scheduling was enabled (data is collected from schedstat in the /proc filesystem).

Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series)

Stabilization of spurious latency spikes after real-time scheduling was enabled on a canary host (red-colored series).