Apache Cassandra: The Data Foundation for Real-Time AI
The groundswell of attention and media coverage around machine learning and artificial intelligence is driving an urgency around how businesses can best leverage AI to drive impact. One of the most powerful shifts we see is the move to applications powered by AI using real-time data to take advantage of events in the moment, from consumers actively engaged with a business to supply chain operations needing to constantly adapt to changing variables.
Extracting intelligence from data in real time, feeding applications, informing decisions and driving actions as your customers actively engage with your services brings new experiences and heightened context to every customer interaction. But it also brings with it new challenges and places massive demands on the underlying infrastructure to support this intelligent, real-time model.
When it comes to real-time AI, we can take inspiration from the consistent patterns, blueprints and best practices of pioneering organizations that invested time and resources to build their own real-time, AI-powered solutions. Without exception, those leading organizations — including the likes of Netflix, Apple, Uber and FedEx — that take advantage of real-time AI today have chosen to build their solutions on Apache Cassandra.
There’s a range of reasons to build real-time AI on top of Cassandra, from world-class latency and speed to scalability, availability and improved accuracy of predictions and actions. Here, we’ll look at how Cassandra provides a foundation for two of the most important data management categories — features and events — for real-time AI, enabling the delivery of highly accurate insights based on the right data at the right time, to make the biggest impact on your business.
In January, ChatGPT reached 100 million users faster than any other service ever (DataStax Astra DB included, much to my chagrin). Since then, there’s been an explosion of AI literacy, and not just among technologists. One of the more prominent bits of AI jargon that has gone mainstream is the feature.
As tempting as it is, I’m not going to ask ChatGPT to write a paragraph explaining features. Instead, let’s just use the dictionary:
feature, noun: a distinctive attribute or aspect of something
In the context of machine learning, we have data that represent specific values for those attributes or aspects. We call those features, and they’re used to train machine-learning models to recognize the patterns discovered from the data. Features are also used at inference time to provide the current context on which the model should base its inferences.
When our goal is to be able to achieve real-time AI — where we want to make inferences based on the most up-to-date, the most relevant information possible — we need to be concerned with how we keep all of the entities’ features “fresh” as events (more on those below) continuously flow through the system.
It’s critically important that the features we need to be fresh reside in a database that can support very high rates of inserts/updates, while also being able to serve low-latency queries, even as writes are at their peak.
To achieve that, there’s generally a stream processor, like Apache Flink or Spark Streaming that continuously processes events to keep features fresh. We’re also excited to soon release into open source the innovative stream processing technology from Kaskada, a recent DataStax acquisition. The Kaskada technology provides similar capabilities to Flink and Spark and also has a rich feature set for feature engineering.
The events themselves may or may not have a reason to land in the database, but it’s critically important that the features we need to be fresh reside in a database that can support very high rates of inserts/updates, while also being able to serve low-latency queries, even as writes are at their peak.
To complete an inference for an entity, we need to query the features for that entity and complete the inference in under 200 milliseconds for many real-time use cases. Cassandra can retrieve several features at a time in parallel; recent testing we performed for our integration with Feast, an open-source feature store, on Astra DB achieved a p99 of 23ms for a three-table query, leaving the majority of the 200ms for other processing.
Understanding how to best architect for real-time AI first requires clarity on the role of events. If your technology career didn’t intersect with “complex event processing,” you might not be aware that an “event” is a data record with a specific structure associated with it.
Events usually capture a unique identifier for an “entity” — an email address or randomly assigned alphanumeric identifier in the vast majority of cases — as well as a timestamp and a set of key data values associated with that time. So, an event captures a specific state about an entity at a specific time.
This is important because there’s a practical reality to processing data on a continuous basis: Time keeps on ticking. So any real-time calculation we make on the data will always be in the context of a particular time window.
In order to place data into the appropriate time window, a timestamp has to be present in the record. Because these records move around so much these days, the timestamp has to be part of the data. You can’t rely on timestamps from, say, the database
Now, it isn’t always the case that we need to store events in a database. Sometimes it’s fine to just run events through a stream processor, which are often used to calculate summary data in real time and then store records in object storage. Cassandra works particularly well with Apache Kafka and Apache Pulsar both as a sink and a source for records.
Sinks handle the storage of stream records in a database, and, via change data capture (CDC), Cassandra can also be a stream source, populating the stream from database records. The flexibility that the combination of database and stream provider represents is a powerful tool for handling data in a real-time context, where any excessive architectural complexity surfaces later on as latency.
While most event data can be safely processed in a stream and then archived in files, there are cases where having events online in a database is required by the application. Events are often not just the audit trail of what’s happened in an app, but they’re also used to represent the current state of the user’s interactions.
Cassandra makes for a particularly efficient solution in this case as Cassandra partitions map to the event data model very cleanly. Partition keys store the entity key of an event, and the timestamp (or a “timeuuid”) is stored as a clustering column.
Building real-time infrastructure on Cassandra provides the freedom to capture signals in user activities at very high rates and query fresh features with high throughput and low latency.
The result is that Cassandra stores events for each entity in sorted order in the partition, and static columns provide a space-efficient mechanism to store any data common across the partition. Later on, temporally contiguous records for each entity can be retrieved efficiently.
As a point of hygiene, if you’re going to be building machine learning models from your event data later on, make sure that the timestamp reflects the real values of the data that is captured at that time.
Make updates or corrections to those records very cautiously. If the data stored for an event actually reflects values that were only true after the timestamp on the record, the model may perform poorly in production because data “from the future” leaked into the model.
A Data Architecture for Speed and Scale
Machine learning works best with a lot of high-signal data. Building real-time infrastructure on Cassandra provides the freedom to capture signals in user activities at very high rates and query fresh features with high throughput and low latency.
This enables you to learn which signals produce the best models and disable the others to optimize storage costs. Cassandra’s linear scaling properties ensure that machine-learning engineers can easily support the ingestion of as many events as needed and serve continuously fresh features rapidly enough to support real-time interactions.
There’s a reason that Netflix and Uber turned to Cassandra as the foundation of the data architecture that powers their AI systems.
Building real-time, AI-powered apps on the right platform gives you unprecedented power to shape your business, the way you operate and the relationship you have with your markets, from delivering highly personalized viewing recommendations in real time, to instantly adjusting routes to get a driver to a destination most efficiently, to anticipating and eliminating manufacturing and supply chain interruptions.
All of these opportunities go far beyond simply making predictions. Unlike older approaches, which rely on batch processing and costly, time-consuming transformations to bring data to ML, these real-time AI systems drive near-instant actions. The only way to achieve this is with a foundational data architecture built for speed and scale. Cassandra is the perfect choice for delivering this.