Apache Kafka + Spark + Database = Real-Time Trinity
As technology fits into our lives and onto our wrists, demands increase for intelligent and real-time mobile applications. These applications need to deliver information and services that are relevant and immediate. To keep up with the flow of information coming in, applications must stream data with a real-time infrastructure to capture, process, analyze and serve massive amounts of data to millions and sometimes billions of users.
Today’s leading deployments combine three distributed systems to create a real-time trinity:
- A messaging system to capture and publish feeds.
- A transformation tier to distill information, enrich data and deliver the right formats.
- An operational database for persistence, easy application development and analytics.
Together, these systems create a comprehensive, real-time data pipeline and operational analytics loop. Let’s explore in more detail.
Real-time pipelines often begin with data capture and use a messaging system to ensure every record makes it to the right place. Data might come from logging information, sensor data, financial market streams, mobile applications or other sources. It will be written to file systems, object stores and databases.
Apache Kafka is one example of such a messaging system. According to the Apache Kafka web site: “Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.” It acts as a broker between producers and consumers of message streams, as outlined in the following diagram:
Because it is a distributed system, Kafka can scale the number of producers and consumers by adding servers or instances to the cluster. Kafka’s effective use of memory, combined with the commit log to disk, provides great performance for real-time pipelines plus durability in the event of server failure.
With message ingest and propagation in place, it is time to move to the next tier of transformation.
The transformation tier allows data to be manipulated, enriched and analyzed before it is persisted for use by an application. Today, Apache Spark is one of the most popular transformation tiers.
Spark is also a distributed, memory-optimized system, and therefore a perfect complement to Kafka. Spark includes a streaming library, and a rich set of programming interfaces to make data processing and transformation easier.
With Spark you can ingest data from Kafka, filter that stream down to a smaller data set, run enrichment operations to augment data, and then push that refined data set to a persistent data store.
Spark’s support for a wide range of operators to facilitate data transformation allows broad functionality within a single system, and makes it an ideal part of real-time pipelines. However, Spark does not include a storage engine and therefore works well when paired with a persistent datastore, such as an operational database.
Real-time data streams provide the most value when analysis spans both real-time and historical data. In order to do that, data must be persisted beyond the streaming aspects like messaging and transformation into a permanent datastore. While that could be an unstructured system like the Hadoop Distributed File System or Amazon S3, neither of those solutions provide instant analytics.
An in-memory operational database, however, provides persistence for real-time and historical data as well as the ability to query both together.
As one example, MemSQL, an in-memory database for transactions and analytics that our company developed, can ingest and persist data from Spark. This allows applications to be built on an operational database using the freshest data. Now with familiar SQL, the lingua franca for data programming, enterprises can quickly and easily build real-time applications and run analytics.
Easily Remove Duplicates With an Operational Database
One of the challenges with streaming pipelines is the appearance of duplicates. In an effort to guarantee no data loss, messages and data points will sometimes be propagated multiple times. This can adversely affect capacity and complicate analytics.
With an operational database, and a mutable dataset, operators can be included such as:
INSERT ... ON DUPLICATE KEY UPDATE
which will add a new record, or update an existing record, avoiding the duplicates challenge.
Completing the Real-Time Trinity
Companies such as Pinterest have seen the power of this software infrastructure combination, showcasing the results at this year’s Strata + Hadoop World.
The real-time data pipeline also enables advanced analytics from application data. Data from the operational database can go to Spark for advanced analytics, and be persisted back to the operational database for application use.
Economics make this workflow accessible. As each piece is a distributed system, users can deploy on commodity hardware or the cloud using the most cost effective infrastructure. Performance scales with the number of nodes for flexibility. Entire new deployment methods are emerging too, like Mesosphere, to simplify the deployment of multiple distributed systems.
It is hard to see our appetites for immediate information abating. Fortunately, cost-effective architectures like the real-time trinity support this demand today.
Eric Frenkiel is the founder and CEO of MemSQL, a distribued in-memory database.