How MemSQL Enables Exactly-Once Semantics with Apache Kafka
MemSQL is a sponsor of The New Stack.
Apache Kafka usage is becoming more and more widespread. As the amount of data that companies deal with explodes, and as demands on data continue to grow, Kafka serves a valuable purpose. This includes its use as a standardized messaging bus due to several key attributes.
One of the most important attributes of Kafka is its ability to support exactly-once semantics. With exactly-once semantics, you avoid losing data in transit, but you also avoid receiving the same data multiple times. This avoids problems such as a resend of an old database update overwriting a newer update that was processed successfully the first time.
However, because Kafka is used for messaging, it can’t keep the exactly-once promise on its own. Other components in the data stream have to cooperate – if a data store, for example, were to make the same update multiple times, it would violate the exactly-once promise of the Kafka stream as a whole.
How MemSQL Works with Kafka
At MemSQL, we make fast, scalable, relational database software, with SQL support. MemSQL works in containers, virtual machines, and in multiple clouds — anywhere you can run Linux.
If you aren’t familiar, this relatively novel combination of attributes — the scalability formerly available only with NoSQL, along with the power, compatibility, and usability of a relational, SQL database — makes MemSQL a leading light in the NewSQL movement, along with Amazon Aurora, Google Spanner, and others. The ability to combine scalable performance, ACID guarantees, and SQL access to data is relevant anywhere that people want to store, update, and analyze data, from a venerable on-premise transactional database to ephemeral workloads running in a microservices architecture.
Of course, we think NewSQL is important. NewSQL allows database users to combine the main benefit of NoSQL — scalability across industry-standard servers — and the many benefits of traditional relational databases, which can be summarized as schema (structure) and SQL support.
In our role as NewSQL stalwarts, Apache Kafka is one of our favorite things. One of the main reasons is that Kafka, like MemSQL, supports exactly-once semantics. In fact, Kafka is somewhat famous for this, as shown in my favorite headline from The New Stack: Apache Kafka 1.0 Released Exactly Once.
What Is Exactly-Once?
To briefly describe exactly-once, it’s one of three alternatives for processing a stream event — or a database update:
- At-most-once. This is the “fire and forget” of event processing. The initiator puts an event on the wire, or sends an update to a database, and doesn’t check whether it’s received or not. Some lower-value Internet of Things streams work this way, because updates are so voluminous, or may be of a type that won’t be missed much. (Though you’ll want an alert if updates stop completely.)
- At-least-once. This is checking whether an event landed, but not making sure that it hasn’t landed multiple times. The initiator sends an event, waits for an acknowledgment, and resends if none is received, repeating until it gets an acknowledgment. However, the initiator doesn’t bother to check whether one or more of the non-acknowledged event(s) got processed, along with the final, acknowledged one that terminated the send attempts. (Think of adding the same record to a database multiple times; in some cases, this will cause problems, and in others, it won’t.)
- Exactly-once. This is checking whether an event landed, and freezing and rolling back the system if it doesn’t. Then, it will resend and repeat until the event is accepted and acknowledged. If an event doesn’t make it, all the operators on the stream stop and roll back to a “known good” state. Then, processing is restarted. This cycle is repeated until the errant event is processed successfully.
How MemSQL Joins In with Pipelines
The availability of exactly-once semantics in Kafka gives an opportunity to other participants in the processing of streaming data, such as database makers. MemSQL saw this early. The MemSQL Pipelines capability was first launched in the fall of 2016, as part of MemSQL 5.5; you can see a video here. (There’s much more about the Pipelines feature in our documentation — original and updated. We also have specific documentation on connecting a Pipeline to Kafka.)
The Pipelines feature basically hotwires the well-known ETL (Extract, Transform, and Load) process by connecting to a data source, handling some limited changes to data as it streams in, and loading it into the MemSQL database.
From the beginning, Pipelines have supported exactly-once semantics. When you connect a message broker with exactly-once semantics, such as Kafka, to MemSQL Pipelines, we support exactly-once semantics on database operations.
The key feature of Pipelines is that it’s fast. That’s vital to exactly-once semantics, as it comprises a promise to back up and try again whenever an operation fails.
Like most things worth having in life, exactly-once semantics places certain demands on those who wish to benefit from them. Making the exactly-once promise make sense requires two things:
- Having few operations fail.
- Running each operation so fast that retries, when needed, are not too extensive or time-consuming.
If these two conditions are both met, you get the benefits of exactly-once semantics without a lot of performance overhead, even when crashes occur. If either of these conditions is not met, the costs can start to outweigh the benefits.
MemSQL 5.5 met these challenges, and the Pipelines capability is popular with our customers. But to help people get the most out of it, we needed to widen the pipe. However, note the “limited” word above — the Pipeline “handles some limited changes to data.” For Pipelines to really replace the whole ETL process, we needed to, well, widen the pipe.
So, in the recent MemSQL 6.5 release, we announced Pipelines to stored procedures. This feature does what it says on the tin: you can write SQL code and attach it to MemSQL Pipelines. Adding custom code greatly extends the transformation capability of Pipelines.
Stored procedures can both query MemSQL tables and insert into them, which means the feature is quite powerful. However, in order to meet the desiderata for exactly-once semantics, there are limitations on it. Stored procedures are MemSQL-specific; third-party libraries are not supported, and developers have to be thoughtful as to overall system throughput when using stored procedures.
Because MemSQL is SQL-compliant, stored procedures are written in standard ANSI SQL. And because MemSQL is very fast, developers can fit a lot of functionality into them, without disrupting exactly-once semantics.
Fast and Flexible
The Pipelines capability is not only fast; it’s also flexible, both on its own and when used with other tools. That’s because more and more data processing components can support exactly-once semantics.
For instance, here are two ways to enrich a stream with outside data. The first is to create a stored procedure to do the work in MemSQL.
The following stored procedure uses an existing MemSQL table to join an incoming IP address batch with existing geospatial data about its location:
CREATE PROCEDURE proc(batch query(ip varchar, ...))
INSERT INTO t
SELECT batch.*, ip_to_point_table.geopoint
ON ip_prefix(ip) = ip_to_point_table.ip;
(For a lot more on what you can do with stored procedures, see — wait for it — our documentation, which also describes how to add SSL and Kerberos to a Kafka pipeline.)
You can also handle the transformation with Apache Spark, and you can do it in such a way as to support exactly-once semantics, as described in this article. (As the author, Ji Zhang, puts it very well: “But surely knowing how to achieve exactly-once is a good chance of learning, and it’s a great fun.”)
Once Apache Spark has done its work, stream the results right on into MemSQL via Pipelines. (Which were not available when we first described using Kafka, Spark, and MemSQL to power a model city.)
Try it Yourself
You can try all of this yourself, quickly and easily. MemSQL software is now available for free, with community support, up to a fairly powerful cluster. This allows you to develop, experiment, test, and even deploy for free. When you need more power, or when you want dedicated support — or if you want to discuss a specific use case — you can contact MemSQL.
Feature image via Pixabay.