Cachegrand, a Fast, Scalable Keystore for Data-Oriented Development

It’s easy to see Cachegrand a new open source project created by Microsoft Senior Software Engineer Daniel Salvatore Albano, is a labor of love.
“I love performances. Really, in my mind, there are no good reasons to let hardware go underutilized,” Albano said. It’s an appreciation not only for caching but also for the apps. “We spend money or resources so we should use them,” Albano said, noting that we should minimize the waste that comes along with using poorly performing tools.
Albano describes Cachegrand as a, “modern, blazing fast open source caching platform designed for performance [and] built for speed,” in his talk “cachegrand: A take on High-Performance Caching” at ScyllaDB‘s P99 Conf 2022. The software written in C, scales vertically (almost linearly), and aims for compatibility with most known caching solutions.
He was also upfront about its general purpose use and the slower speeds for specific use cases, though didn’t list them specifically. Redis is the closest competitor and there are some side by side comparisons later in this article. We love some open source competition for these large tech companies to keep them on their toes.
When Benchmarked on a 1x AMD EPYC 7502 32 core, 64 threads, 256GB RAM @3200mhz, Ubuntu 22.04, memtier with 64 bytes payloads, CacheGrand hit the following metrics:
- Up to 5.1 Mops GET/ 4.5 Mops SET, estimated 40x faster than Redis with 64x more load
- Up to 60 Mops GET/ 26 Mops SET with batching
System Design
The build is modular, with a layer of high-level functionalities built on top of the low-level functionalities. The upper layer are the modules. Albano is working on updates that will include support for Memcache, Kafka, GraphQL, and other technologies are also under consideration.
Cachegrand in Action
The following are some charts that requests-per-second with no batching:
The first bar is a direct comparison with Redis. Cachegrand handled more requests than Redis in side-by-side while also handling 6,400 more clients as well (all requests ran at the same time). Cachegrand can handle almost doubling requests until about 32 when more resources need to allocate to the operating and network systems.
Here are some latencies:
On the left, cachegrand is working with 64 threads meaning the machine is saturated with 6400 clients yet still shows very good P99 and P99.7 latencies.

Requests per Second with Batching.
All requests in the image above were sent by memtier. The chart on the left was generated with 256 GET and the right was generated with 256 SET.
Cachegrand was optimized for speed. Its infrastructure includes a custom memory allocator, uses fibers rather than thread pools, and data structures that enhance performance.
The Data Structures
Data-oriented development, the development process that provides optimizations and organizes data for the CPU to better leverage cachelines, was the development methodology used in cachegrand.
This is evident in the ability to do a linear search over a hashtable. The hashtable in cachegrand uses two separated arrays — one with only the hashes to have all sequential values and another with the keys and values which is a departure from the standard hashtable which only uses one array with keys and values. The value of having these two arrays is so that the CPU needs to read less data to determine if this is the data it’s looking to access. It aids in speeding up the process.

Linear search optimization.
The next data structure, Single Instruction Multiple Data (SIMD) allows a program to define one set of instructions, a single operation, that must be carried out on different data (eg AVX2 up to 265 bits). SIMD covers a wide range of use cases from complex math calculations to something simpler such as a linear search.

SIMD optimization.
The hashtable array in cachegrand is essentially one giant array and that array is broken down into buckets or chunks of 14 elements, with each element having its own element. A hashtable with 10,000 keys will have a little less than 1,000 spinlocks.
As long as the same key isn’t getting written to, contention won’t get hit. This greatly improves perforce as it reads the load across the entire state. When there is contention, spinlocks help to reduce the latency.

Localized Spinlocks Optimization
Albano listed a few new updates on the horizon as this project continues to grow though nothing specific was listed nor were any release dates. This one will be very interesting to follow and was incredibly fascinating to learn about.