Cloudflow: A Framework for Streaming Data Pipelines on Kubernetes

Portworx sponsored The New Stack’s coverage of KubeCon + CloudNativeCon in San Diego.
Lightbend, the company that created the Scala programming language and Akka middleware, has launched Cloudflow, an open source framework to make it easier to develop and deploy streaming data pipelines on Kubernetes.
The release is part of a flurry of Kubernetes news coming out of KubeCon + CloudNativeCon North American 2019 this week.
Cloudflow provides a streaming data-specific framework that simplifies installation and integration of streaming technologies, provides a programming model for creating streaming data pipelines in microservices architectures, and is optimized for deployment on the Kubernetes stack.
“The most interesting new opportunities around AI and training machine learning models are putting a lot of pressure on developers to get streaming data into new applications, and to have predictability on the uninterrupted flow of that data once in production,” said Mark Brewer, CEO of Lightbend.
The complexity comes when users employ multiple streaming engines, like Spark, Akka Streams and Apache Flink, he said.
“These technologies, they’re, you know, they’re widely adopted, they’re pretty easy for a developer to get started with. But once you start plugging them together, it becomes much more complicated,” he said.
“Cloudflow is all about producing that, that challenge of setting these things up, getting them all interoperating with each other and dealing with a constant flow of streams of data.”
He pointed to example customers using technologies like Spark and Akka: Capitol One using data streams to determine risk for offering auto loans; Norwegian Cruise Lines providing personalization on its onboard passenger app; and Disney, making recommendations for its new streaming service.
“The hardest part about all this is not just setting it up and get it running, but to keep it running on a cluster that could be variable in size. … Anytime there’s new content, you’re going to see a spike in demand. Well, the system needs to scale up appropriately. And then, of course, scale down. That’s the part of Cloudflow that winds up saving a ton of time for operators and developers,” he said.
Schema first
Cloudflow helps developers choose the right streaming engine for each processing phase, such as Akka, Spark and Flink; focus solely on core business logic; and eliminate the burden of building boilerplate. Cloudflow handles data durability, serialization, and connections between stages.
Chris Merz, principal technologist at NetApp, previously pointed out that managing state in cloud native applications has more recently fallen to developers, rather than the ops team.
In deployment, Cloudflow enables operators to manage data flow with a simple blueprint file and deploy multistage pipelines with one command. It configures all connections between stages for you and can automatically surface HTTP service endpoints.
Using Cloudflow, you can easily break down your streaming application into small composable components and wire them together with schema-based contracts.
The small stream processing units are called Streamlets, each representing a self-contained stage of the application logic. The data streams are partitioned to allow for parallel processing. Streamlets can be combined into larger systems using blueprints, which specify how Streamlets form a topology.
Cloudflow takes a schema-first approach for building streaming data pipelines, with an Avro schema as the starting point.
You can choose what programming language to generate their schemas into by defining settings in the sbt project.
Streamlets expose one or more inlets and outlets, all driven by schema to ensure data flows are always consistent and that connections are compatible. The data sent between Streamlets is persisted in the underlying pub-sub system, allowing for independent lifecycle management of the different components.
Applications are deployed as a whole. Cloudflow takes care of deploying the individual Streamlets and that data is flowing appropriately between connections. Streamlets can be scaled up and down to meet the load requirements of the application.
The Cloudflow application development toolkit provides:
- An API definition for Streamlets.
- An extensible set of runtime implementations for popular streaming runtimes, like Spark’s Structured Streaming, Flink, and Akka.
- A Streamlet composition model driven by a blueprint definition.
- A sandbox execution mode for testing applications.
- A set of sbt plugins for packaging your application into a deployable container.
- The Cloudflow operator, a Kubernetes operator that manages the application lifecycle on Kubernetes.
- A command-line interface — a kubectl plugin for manual and scripted management of the application.
The New Stack and Lightbend earlier this year teamed up on a survey to look at the drivers for real-time data, and the barriers for developing and managing applications on streaming data infrastructure. You’ll find the full 2019 Streaming Data survey results here. The New Stack analyst Lawrence Hecht discusses the results in an episode of The New Stack Context podcast.
The Future of Data Is in Streaming
KubeCon + CloudNativeCon North American 2019 and NetApp are sponsors of The New Stack.
Image by Johannes Plenio from Pixabay.