WebAssembly Brings Inline Data Transformations to RedPanda Kafka Streaming Platform
RedPanda, the Kafka-compatible event streaming platform from Vectorized.io, has always focused on improving both performance and the developer experience. Although it’s fully API compatible, it’s written in C++ rather than the Java deployed by Apache Kafka itself. Not only does that create a single binary that’s easier for developers to deploy, but the language choice allowed the RedPanda developers to write low-level code that bypasses the Linux kernel and talks directly to NVMe SSD drives for performance, while putting the Node.js V8 engine on each core to take advantage of the increasing core count on modern CPUs.
“That allows us to invert the relationship between data and compute, Vectorized CEO and founder Alexander Gallego explained to the New Stack, “Normally what people do is send data to Kafka, and then have a separate process; that could be Spark Streaming, Apache Flink, Kafka Streams, and so on. Say you want to provide a GDPR filter for all the data coming into your network. What you do is save your data into Kafka and then you have Spark Streaming consume the data from Kafka, filter the data and push it back into Kafka or into other downstream systems like Elasticsearch. The problem with that is that your data is really ping-ponging around your network all the time, even for simple things.”
Vectorized may be ahead of the curve, but expect to see Wasm used more for this kind of added inline functionality. Although it’s still a nascent technology, Jason McGee, Chief Technology Officer of IBM’s cloud platform, told us at Kubecon 2021 that WASM is a good fit for these kinds of serverless and edge scenarios where footprint and low latency startup are important. “When how long it takes to get things up and running matters, being able to use things like WASM to inject code into already running containers is an interesting experiment.”
Adding functionality to an existing system is exactly the kind of place where WebAssembly is going to be useful, the Cloud Native Computing Foundation‘s Chief Technology Officer Chris Aniszczyk agreed. “I think we’ll see any project that has an extension-type mechanism will probably take advantage of WASM to go about doing that.”
Think of this as a modern approach to a very traditional database technique, stored procedures, Gallego suggested. “You can write it Go or Rust, you compile it to WebAssembly on your machine, and then you ship the code to RedPanda and as data comes in, it’s filtered inline. No longer using the network to ping pong your data provides this huge performance benefit.” Bringing the compute to the data addresses issues of data gravity, and appeals to developers who are getting using to serverless approaches.
This isn’t suitable for all streaming scenarios, he noted, particular not if they’re stateful. “It is designed to solve one-shot transformations, where you have your payload and then you want to do something to it and then you want to, forward it along to a different topic or a different partition, or maybe take an object and enrich it with IP information. These are simple things that are actually really hard to do in the enterprise today.”
Inline transformation is particularly useful for handling machine learning training data (RedPanda’s customers in financial services use this for functionality like credit scoring). Format unification is a common machine learning problem; data might be in CSV, JSON and other formats and the WASM code can transform that inline to a uniform format for the machine learning workflow.
The speed of doing inline data transformation with WASM makes it suitable for near-real-time tasks like fraud detection for e-commerce. “There’s a very high SLA that says the point of sale has to return whether the credit card should be accepted or denied within two seconds,” Gallego said. “Often that gets pushed to a system like Kafka and then Spark Streaming and then back into Kafka. With WebAssembly, you can ship the logic that says ‘yes, this is fraudulent or no, this is not fraudulent’. It allows you to make some real-time experiences actually feel real-time.”
One customer is using Kafka to stream data from sensors in crude oil pipelines. If there’s too much jitter, the system needs to shut the pipeline down before any damage occurs; WebAssembly code could send a message to a topic that Kafka publishes to trigger an alert.
Cheaper, Simpler, More Secure
Putting V8 on each core reduces the numbers of brokers required to do data transformation; in one case from 100 computers to 37, which is an obvious cost-saving. Wasm might reduce that even further because it uses a cheaper, faster-calling mechanism for data.
Developers can still take advantage of the rich Kafka ecosystem, but they can now use it for more use cases, Gallego suggested. “As you start to ship these capabilities to the storage engine itself, you raise the level of abstraction for the consumers of your data, and you can start to give computational guarantees about the output of your data streams.”
A chief security officer could be certain that all their data streams are GDPR compliant because they shipped the certified Wasm script to the storage engine. Financial customers also appreciate the inherent security of Wasm.
Designed as a secure sandbox for untrusted code, Wasm’s bounded memory model (it can only address 4GB of memory) avoids memory overflow issues and the capability model of Wasm offers Vectorized a lot of control. “It allows us to expose specific functions to interact with the outside world. We vet which specific functions are allowed to execute in the Wasm context; those are encryption functions and compression functions and things that we know are sound and safe.”
As always, there’s a tradeoff between developer productivity and security, but using Wasm allows RedPanda users to set policy about what can be done with data. “This kind of control is very powerful. In a security audit with the bank, we can explain exactly which Wasm functions their developers can execute on data. That’s a mental model we can give to the security officer and the security team as a way to reason about what is possible, what are the actual failures that can happen or how much data can potentially leak. They can make the decisions themselves on what are the guarantees that they want to give the consumers of their data.”