Today’s Internet is full of networks that allow users to create feeds from user-generated content. Twitter, Instagram, Facebook and others all allow users to create custom feeds from various sources. The nature of these social graphs presents a unique problem of mapping these relationships at Internet scale.
For the following example, we will outline how a news feed application can be built with Redis, a key-value store and Cassandra, a fault-tolerant, replicable database. The goal is to show how a developer can build their own news feed as a feature in the developer’s app. Cloud services such as Amazon Web Services will provide infrastructure easily but the management overhead of using a database cluster is still there. Even using a traditional content management system (CMS) approach, the developer must create multiple virtual machines to run a cluster of databases.
The approach detailed in this article will reduce the infrastructure required to build such a service. What we describe can be deployed on any cloud-based infrastructure provider. In short, this article should be taken as a blueprint to make a service similar to what Twitter or Facebook offers.
The Task at Hand
Before we start, let’s jot down overall features of a feed-based application. Let’s assume it is an application where a user can post updates and interested people can subscribe to the feed. Every user would have a dedicated feed that will show the latest content from the user’s individual subscriptions, sorted in chronological order. Here is the list of features that we plan to build:
- A publishing capability
- Feed creation on user basis
Traditionally such applications would be built like content management systems. A user creates a post, which goes into the database. With a feed update, a database query will fetch the list of all subscriptions and show the latest content sorted by publication date.
This approach has two problems:
- As the list of users grow, the provider will have to fortify the infrastructure to ensure fast execution of database queries. It may also require the adoption of database clusters to ensure lower throughput times.
- If you have to sort the feed using an algorithm not supported by the database, you will end up writing another layer of logic on top of the database to perform your custom search. This would further increase the throughput time.
The Message Box
Let’s try to step away from a traditional approach of fetching results from the database for every content feed request and try to look at alternatives. On closer examination, one would notice that such applications are very close to message broadcasting. When a user publishes content, it should be broadcasted to all subscribed users. A simple solution will be to maintain a message box for all users. When a subscribed publisher creates a post, it goes and sits in the user message box. For a user content feed generation, we simply fetch the posts from the message box and show them to the user. This reduces the database queries required to generate the content feed. It also reduces the load on the database server and increases throughput time, as we now need only one or two database queries to fetch all the posts for the user content feed. This approach requires less hardware as the message box can be implemented easily using a key-value store.
Redis is an advanced key-value store that provides in-memory datasets. Redis supports data structures including lists, which is of interest for our application. We will alter our traditional model and add Redis to the stack, so that each user has a dedicated individual message box. When a new post is created, we check for the subscribers and push the post_id into each subscriber’s message box. A user’s message box is a simple Redis list, with a unique user_id being the identifying key. For every new subscribed post we simply insert post_id into the list representing the user’s message box.
Our updated stack now looks like this:
We can also maintain multiple message boxes per user. For example, Facebook allows you to view either “most recent” or “top stories.” This can be achieved by creating two different message boxes per user and updating the salient message box based on a custom algorithm.
For generating a user’s feed, we fetch their message box and query the database for full posts, comments and all other attributes.
The new content generation cycle will look like this:
We will prevent the user message box from growing beyond a certain size by adding a limit to the message box. For instance, Twitter only shows tweets for the last seven days.
Scale it Further with Cassandra
While Redis is a good option to start with, it has its own limitation. Redis’ advantage is its in-memory datasets, which is its disadvantage, too. With in-memory datasets, Redis works best when all data fits in a single machine’s memory. With cheap RAM available, one can grow to a few terabytes but you never know when you will run out of memory. This is where Cassandra can step in. Cassandra can handle data across machines and data that will not fit into memory. Cassandra supports clusters and can handle machine failures, rebuilding machines easily.
But Redis is still highly applicable as it does suffice for most users. Alternatives like Cassandra can simply bolster the service when Redis runs out of steam.
More Key Value
While we looked at two key value stores –Redis and Cassandra– in this article, others are not ruled out. These are the two widely used key value stores for message boxes, so we focused on these two only. But you can experiment, mix and match key value stores to come up with a stack that suits your needs. And if you build an application using the blueprint described here, please drop us a line :)