Real-Time Recommendations with Graph and Event Streaming
Real-time data is becoming increasingly important to enterprise success. Successfully collecting incoming data while reacting both quickly and strategically is paramount.
However, the data collection process is often far from trivial. Write contention is a common bottleneck with many large-scale architectures. As data storage infrastructure evolves, developers are further abstracted away from the critical areas of the write path. This can make troubleshooting any issues that crop up with write durability and performance difficult.
The question becomes: How can we ensure that all data is being stored while not overloading the storage layer? Here, we’ll explore using DataStax Astra Streaming to help with some of the pitfalls of ensuring real-time data delivery.
The Tech Stack
Let’s say that we’re supporting a video service. Once a user finishes watching their favorite show or movie, they’re prompted to give it a star rating between one star and five stars. From that, we take a data-driven approach to infer a few recommendations that we “think” they might like. This one small rating action helps improve the future accuracy of this inference for everyone.
How can we architect a system to accomplish this? We will need some simple components, most notably a database and a service layer to query it.
For our data storage layer, we will use DataStax Enterprise Graph, a Gremlin/TinkerPop property graph database built on top of Apache Cassandra. Graph databases are perfect for use cases where the relationships between the data are just as important as the data itself.
In our case, we’re focusing on the relationship between our users and the movies they like. This way, we can help to point them toward additional movies they might enjoy.
Using DSE Graph, we can track data about users and movies, storing them as “vertices” in the database (figure 1). Whenever a user rates a movie, we can add a “rated” edge (with the rating value as a property) from the user to the movie.
When we want to get a recommendation, we can use a particular movie as the entry point and “walk” the graph out to other users, then further out to movies that they’ve rated similarly.
The Service Layer
To interact with the database, we’ll build a simple, restful service using Java Spring Boot. Inside the controller, we’ll build out two services:
- addUserRating – Takes a new rating for a movie from a user and adds an edge to the graph.
- findRealtimeRecommendationsByMovieId – Takes a movie (id) and returns a list of similarly rated movies.
The services are largely self-explanatory. One writes user ratings to the database (stored as an edge in graph). The other returns recommendations based on the movie provided, using item-based collaborative filtering to match like-rated movies. (For more on item-based collaborative filtering, check out chapter 10 of “The Practitioner’s Guide to Graph Data.”)
Astra Streaming Topic
Astra Streaming is a distributed streaming-as-a-service built atop Apache Pulsar. We’ll use a streaming topic to handle the incoming write traffic. Calls to the addUserRating service will send a user’s new rating for a movie to the topic (as shown in figure 2). We will then have a process “subscribing” to the topic, which will consume the data and write it as an edge into the graph.
Using Astra Streaming in this approach gives us a few advantages:
- Topics can provide message delivery guarantees – In case of a failure event, this helps to ensure that our “in flight” data is persisted.
- Protection from write back pressure – If the application has periods of unpredictable or “bursty” write activity, an Astra Streaming topic can help to throttle down the write throughput, protecting the underlying storage infrastructure from overutilization.
Streaming Topic Consumer
Once new ratings are sent to the Astra Streaming topic, a consumer process will take over. This process will “subscribe” to the topic and await any messages posted to it. Upon the arrival of a rating message on the topic, the consumer will acknowledge it and write it as an edge into the graph database using the following (Fluent) Gremlin code (figure 3).
As the consumer process is running continuously, it will continue to monitor the topic and apply new “rated” edges to the graph as they come in. The great thing about this is that additional ratings messages will queue up and be applied at a consistent level of throughput.
Traversal and Results
The read path will be built using a simple graph traversal. Using the original movie as an entry point, we’ll move along similar ratings edges out to the users [who submitted them], and then continue to the adjoining movie nodes. Running this traversal with the movie “Back to the Future” as the entry point for our sample data set produces the following results:
|Total # of High Ratings
|Indiana Jones and the Raiders of the Lost Ark (1981)
|Shawshank Redemption (1994)
|Star Wars: Episode IV – A New Hope (1977)
|Forrest Gump (1994)
|Matrix The (1999)
Table 1 – Traversal results for movies that are similarly rated to “Back to the Future.”
We discussed steps to improve our write path into a real-time recommendation system. We’ve implemented our main storage model in a graph database, which offers methods and algorithms that can take data discovery to a whole new level. Likewise, we’ve improved our data persistence guarantees while simultaneously protecting the storage layer from becoming overwhelmed in the event of a spike in user traffic.
While the use case of a movie recommendation system was the example here, the concepts discussed can be applied to many types of real-time systems. It’s common to find event processing and graph databases in use cases for many areas, such as supply chain, cybersecurity, and product-data management. Employing the methods discussed above can help ensure real-time data persistence.
The code for the Java Spring Boot service layer can be found in this repository, and the code for the Java consumer can be found