A Look at DataStax’s AI and Push Cache for Data Access at Scale
DataStax has aggressively been piecing together tools and platforms during the past year or so to offer organizations a comprehensive data AI solution that encompasses distributed database management at scale. Known for its Astra DB cloud database, which is built on Apache Cassandra, the company is now positioning itself as a “real-time AI company.”
DataStax’s objective is to empower organizations with faster and deeper insights, enabling better decision-making. It aims to achieve this by leveraging the speed and accuracy of its AI and machine learning capabilities, combined with tools for streaming and processing updates in distributed databases that are often spread across diverse and heterogeneous environments. In recent weeks, the company has introduced new platforms and services to facilitate this vision.
By integrating vector search with Cassandra (for details, see CEP-30) to utilize AI models, organizations are in a better position to maximize the value extracted from their data by streamlining data management, while reducing costs, DataStax says.
A key capability associated with DataStax’s platform is the “push cache” capability. Without it, it is hard to get a consolidated view of data from various data stores in a way that guarantees the data is fresh and up to date. This is one of the key problems a push cache solves. DataStax’s intention is to remove many of the pain points associated with managing data from various databases and systems, such as relational databases, data warehouses and Cassandra.
While applicable for any organization that must manage and access distributed databases at scale in near real time, online travel agents serve as a good case example of such requirements. An online agent might need to provide information about a customer’s last reservation. To do this, they would need to gather operational details about the flight or hotel and feed that information into the language model. However, retrieving this data can be time-consuming, leading to slow responses and a poor user experience if there is any lag in data access.
Cassandra’s scalability is largely relied on to allow DataStax’s platform to handle high volumes of change events and to retrieve data rapidly. “Our cloud-based solution eliminates operational overhead, making it easy to implement. Additionally, for more complex events like Clickstream data or triggering updates based on specific user activities, stream processing becomes crucial,” Chris Latimer, general manager of streaming for DataStax, said. “By combining eventing, fast retrieval, and stream processing, we offer a sophisticated solution that surpasses traditional approaches.”
The process can help developers avoid wasting “tons of time” creating their own caching solutions to enable application users to constantly interact with the latest data, Torsten Volk, an analyst at Enterprise Management Associates (EMA), said. “Wouldn’t it be much simpler to be able to request this data directly from an API and let the database do the dirty work of data retrieval and continuous synching,” Volk said. “There are many more aspects where traditional databases meet their limits, but real-time databases shine: vector searches to match concepts instead of keywords, mixing SQL and NoSQL architectures, continuously synching database tables across locations with different latencies and temporarily unpacking data for fastest possible decision making are a few examples. Having one API that takes care of all that and is available to everyone as a service can make a significant difference for many product teams.”
One of DataStax’s latest releases is the GPT-based schema translator integrated into its Astra Streaming cloud service. Astra Streaming, based on Apache Pulsar, excels at scaling data streaming for high-speed processing and orchestration. The new Astra Streaming GPT Schema Translator utilizes generative AI to automate the creation of “schema mappings.” This eliminates the tedious process of manually constructing and maintaining event streaming pipelines, DataStax says. Under the hood, Astra Streaming utilizes Apache Pulsar as its multiprotocol core, with support for Kafka while DataStax has plans to accommodate RabbitMQ.
“We’ve noticed that many customers are looking to modernize their messaging and eventing infrastructure, so we’ve made it effortless for them to transition from legacy protocols to a scalable solution that provides comprehensive event streaming capabilities,” Latimer said.
The streaming GPT schema translator is a critical capability of many organizations since the demand for data engineers skilled in building streaming data pipelines surpasses the available supply, Latimer said. “We aim to simplify the process by eliminating mundane and repetitive tasks, ensuring that highly skilled resources are not consumed by trivial operations,” Latimer said.
In addition to the schema translator, DataStax recently launched Luna ML, a support service for the open source Kaskada platform. Kaskada is an event-processing engine designed for real-time machine learning. DataStax acquired Kaskada in January 2023 and subsequently open-sourced it in March.
“When it comes to stream processing technologies, they typically push users in one of two directions: either toward adopting a database-like approach using SQL or relying on programming language libraries for stream processing,” Latimer said. “However, Kaskada aims to provide its own language for defining stream processing behaviors, similar to how SQL is used for databases. We believe that this space requires its own dedicated language, and that’s where Kaskada’s structure, called FENL, comes into play for defining stream processing use cases.”
Regarding the learning curve, Kaskada is designed to be user-friendly, striking a balance between low-code simplicity and the flexibility required for more advanced scenarios. It caters to individuals with various backgrounds, falling somewhere between those familiar with NoSQL and those experienced with low-code solutions. “We aim to make the learning process straightforward and accessible,” Latimer said.
Kaskada’s event processing capabilities enable stateful stream processing using a declarative query language specifically tailored for analyzing events in bulk and real time. The platform’s query language combines what DataStax says are the “best” features of SQL in order to provide a concise and composable way to compute over events. Leveraging Kaskada, helps business users to gain more dynamic insights as events and results change in real time, the company says.
Access to straightforward SQL queries can be a welcome capability for developers. “Letting app developers use simple SQL queries for real-time event processing could significantly widen the accessibility of these features and therefore open up use cases that would have been previously too expensive or difficult to capture,” Volk said.
So, how does Luna ML assist users in utilizing Kaskada for machine learning?
“Luna ML helps users implement cutting-edge, open source event processing for machine learning. The support offering empowers customers to use Kaskada with expert assistance from DataStax. Enterprises utilizing Luna ML receive mission-critical support with response times as short as 15 minutes,” explained Davor Bonaci, CTO and EVP, DataStax, in an email response. “Furthermore, customers can escalate concerns to the core Kaskada engineering team, ensuring professional attention to any issues encountered during Kaskada deployment. This support is particularly crucial as AI adoption continues to increase, leading to the emergence of new software applications containing ML models and services.”