What news from AWS re:Invent last week will have the most impact on you?
Amazon Q, an AI chatbot for explaining how AWS works.
Super-fast S3 Express storage.
New Graviton 4 processor instances.
Emily Freeman leaving AWS.
I don't use AWS, so none of this will affect me.
Cloud Services / Data / Operations

Optimizing Mastodon Performance with Sidekiq and Redis Enterprise

What do you do when you need to speed up Mastodon? These benchmark tests explore the practicalities of using Redis Enterprise Cloud to power the queues for Sidekiq.
May 18th, 2023 10:30am by and
Featued image for: Optimizing Mastodon Performance with Sidekiq and Redis Enterprise

In the last six months, the open source Mastodon platform has attracted millions of new users and made organizations contemplate creating their own servers (called instances, in Mastodon parlance). It’s not hard to set up a Mastodon instance to support a handful of users. However, it is hard to set up a Mastodon server that can handle a lot of traffic because the default configuration leaves much to be desired.

In my previous article, “How to Boost Mastodon Server Performance with Redis,” I noted that one chokepoint in Mastodon servers is its Sidekiq queues, which depend in turn on Redis queues.

The Mastodon tech stack is built using Redis open source (Redis OSS), and it works great for the purpose. The usual way to configure Mastodon is to run it and Redis OSS on the same machine, and to scale that setup with Redis Sentinel if needed.

“Free” is wonderful, as is the support of the open source community. Redis OSS makes sense for ordinary workloads (whatever “ordinary” means to you) on a basic Mastodon instance. But you might want to consider additional options.

Using Redis OSS with Mastodon is free if you only look at licensing costs. It may not be optimal in the larger context of application performance, or even in the context of total cost of ownership. Where should you put your resources — technical, financial and human?

In this article, we explore the practicalities of using Redis Enterprise Cloud to power the queues for Sidekiq. Redis Enterprise Cloud is a fully managed database-as-a-service and offers enterprise capabilities such as Active-Active clustering topology that scales linearly. Since both of us are benchmark writers, and Filipe is the performance guru at Redis, you’re about to see lots of numbers and graphs. Fear not: We explain everything as we go.

Since we learned from other Mastodon administrators’ experience that the job queues often are the bottleneck, exhibiting 100% CPU load in the Redis process during high-traffic periods, we theorized that we could improve the results by removing Redis from the Mastodon server. We realized that we needed to connect Mastodon and Sidekiq to an external Redis instance, most conveniently to a Redis Enterprise Cloud instance.

We discovered along the way that Mastodon wasn’t designed for that plan, although we found a pull request in its GitHub repository to fix the problem. We also discovered that conventional HTTP load testing wouldn’t help much with Mastodon, but we could adapt a Sidekiq benchmark to compare the performance of Redis OSS and Redis Enterprise Cloud.

Connecting Mastodon and Sidekiq to Redis Enterprise Cloud

Our first step was to demonstrate that we could connect Mastodon and its Sidekiq job queue to an external Redis Enterprise Cloud instance. And, of course, we needed to have such an instance to test.

Filipe created a four-shard Redis Enterprise Cloud cluster in AWS rated for 100,000 operations per second (ops/sec) and 10 GB deployment using Terraform.

The database cluster uses Redis on Flash in multiple availability zones, with an Active-Active configuration. The cluster has two c5.2xlarge instances, one m5.large instance and a 127 GB EBS volume.

At the time we performed these tests, Mastodon didn’t support connecting the Sidekiq queues to anything but a local Redis database. There was a pull request to enable this in the Mastodon GitHub repository, which we applied to our Mastodon instance.

The patch adds Ruby code after line 500 of mastodon.rake:

We needed to verify that the Sidekiq queues were running against our Redis Enterprise Cloud instance. To do so, we monitored the database.

The following console log is running on the Redis Enterprise Cloud cluster. The “queue” entries prove that the Mastodon/Sidekiq job queues reach the correct database instance rather than running locally on the Mastodon server.

The four-shard cluster seemed like overkill, so we tried again with a smaller, single-shard Redis Enterprise Cloud database (rated for 25,000 ops/sec and 5 GB deployment), which showed 15 to 30 ops/sec load from the Sidekiq queues coming from Mastodon.

The chart at the upper left shows the database load in operations/second; the chart at the upper right shows the database latency. Low latency is good. The bottom four charts break out the load into reads, writes and other operations.

Modifying the Sidekiq Load Testing Tool

Sidekiq has two benchmarking tools in its repository. We chose the simpler one, which resides at bin/sidekiqload.

The Sidekiq load test tool creates 100,000 no-op jobs and drains them as fast as possible. As the code is written, it also uses toxiproxy to simulate network latency against a local instance of Redis. Since we were testing against a remote Redis Enterprise Cloud cluster, we didn’t need toxiproxy; we commented out that code.

Then we added the following Ruby code to read the Redis password, port and host from the environment, and we used it to configure the Redis connection for the benchmark.

Performing Sidekiq Benchmarks against a Single Shard on Redis Enterprise Cloud

Running that (modified) Sidekiq load test showed about 13,000 ops/sec and a latency of 0.06 milliseconds.

The chart at the upper left shows the database load from the Sidekiq load test in operations per second; the chart at the upper right shows the database latency. Low latency is good. The bottom four charts break out the load into reads, writes and other operations. The benchmark tool reported running 100,000 jobs in 7.8 seconds, meaning that each job took 78 microseconds to complete. That isn’t shabby at all.

The single-shard 5 GB/25,000 Redis cluster used two m5.xlarge instances, one m5.large instance, and a 119 GB EBS volume.

In our experiments, we increased the number of jobs from 100,000 to 5 million. As the screenshot illustrates, the throughput is about the same (about 13,000 ops/sec) and the latency is about the same (about 0.06 ms), although the Redis memory usage increased to about 1.3 GB. The increased Redis memory usage from the larger queue is not a surprise.

These small charts show the detailed Redis performance during the load test. The most significant tests are highlighted: steady load that is well below the database capacity and very low latency.

The load test tool reports that processing 5 million jobs took 400 seconds, so each job took 80 microseconds to complete, very slightly higher than the smaller queue.

Clearly, the bottleneck for the Sidekiq queue is not the Redis Cloud shard, which never reached its 25,000 ops/sec capacity.

More Sidekiq benchmarks using Redis OSS and Redis Enterprise Cloud

In the previous tests, we load-tested Sidekiq against a single Redis Enterprise Cloud shard. What happens when we test against Redis OSS?

We set up a single Redis OSS database in an m5.large AWS instance. That should be roughly comparable to the single-shard Redis Enterprise Cloud even though it lacks the Redis on Flash and Active-Active features. We re-ran the Sidekiq load test with 5 million jobs. This time, the test was completed in 427 seconds, meaning that the average time to complete a job was 85 microseconds.

We also set up another four-shard Redis Enterprise Cloud database to the same specifications as the 10 GB, 100K ops/sec cluster configuration we first showed you. This configuration completed the 5M-job Sidekiq load test in 387 seconds, giving us an average time to complete a job of 77μs. It also showed lower latency.

To summarize: Redis OSS is a little slower and has higher latency than a similarly sized single-shard Redis Enterprise Cloud instance, while a four-shard Redis Enterprise Cloud instance is a little faster, has higher capacity and has lower latency.

Future Explorations

What all that tells us: Removing the Redis database from the virtual machine that runs Mastodon and Sidekiq should make Mastodon handle high loads more gracefully, with fewer stalls and posting failures.

To prove that conclusively, we plan to set up a production Mastodon node with an external Redis Enterprise Cloud cluster to handle the job queues and perhaps the PostgreSQL cache as well, and we will monitor how it scales with lots of users.

If you’d like to get ready to try all this yourself, you should start by exploring Redis Enterprise Cloud. A free tier instance might not be enough to use for your own high-capacity Mastodon server, but it certainly allows you to become familiar with setting up and using the database, and it will only cost you a little time.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.