Stream Processing 101: What’s Right for You?
Over the last decade, the growing adoption of Apache Kafka has allowed data streaming — the continuous transmission of streams of data — to go mainstream.
To run operational and analytics use cases in real time, you don’t want to work with pockets of data that will sit and go stale. You want continuous streams of data that you can deal with and apply as they’re generated and ingested. That’s why so many companies have turned to data streaming, but the reality is that data streaming alone is not enough to maximize the value of real-time data. For that, you need stream processing.
What Is Stream Processing and How Does It Work?
Stream processing means performing operations on data as soon as it’s received. Processing data in flight allows you to extract its value as soon as it arrives rather than waiting for data collection and then batch processing.
By default, most systems are designed with high latency. Batch jobs are strung together to periodically move data from one place to another, like a Rube Goldberg machine. But that doesn’t have to be the case. Organizations gain an advantage when they architect for faster processing, especially in use cases designed to improve an organization’s responsiveness.
The TV streaming apps many of us use are a great example of how stream processing can improve both frontend experiences and backend processes. Every button pressed on a remote control provides information about viewing behavior that can inform the categorization of content to improve the user experience.
At the same time, the app can be designed to ensure viewing quality by monitoring streams of data on rebuffering events and regional outages. Compare that to a system or app that can only provide data on interruptions in predetermined intervals, minutes, hours or even days apart. That’s the difference between using batch-based versus streaming data pipelines to capture the data that runs a business. And once an organization makes the jump to data streaming, incorporating stream processing into the new pipelines they build is the only thing that makes sense.
Organizations that adopt data streaming without taking advantage of stream processing are left dealing with more latency and higher costs than they have to. Why bother to capture data in real time if you’re not going to process and transform it in real time too?
Although not every application you build requires processing data in flight, many of the most valuable use cases such as fraud detection, cyber security and location tracking need real-time processing to work effectively.
When streaming data isn’t processed in real time, it has to be stored in a traditional file system or a cloud data warehouse until an application or service requests that data. That means executing queries from scratch every time you want the data to be joined, aggregated or enriched so it’s ready for downstream systems and applications.
In contrast, stream processing allows you to “look” at the data once rather than having to apply the same operations to it over and over. That reduces storage and compute costs, especially as your data-streaming use cases scale over time.
Stream Processing in the Real World
Once you have stream processing pipelines built, you can connect them to all the places your data lives — from on-premise relational databases to the increasingly popular cloud data warehouses and data lakes. Or you can use these pipelines to connect directly to a live application.
A great example of the benefits of stream processing is real-time e-commerce. Stream processing allows an e-commerce platform to update downstream systems as soon as there’s new information available. When it comes to data points like product pricing and inventory, there can be multiple operational and customer-facing use cases that need that information.
If these platforms have to process data in batches, this leads to greater lag time between the information customers want — new sales and promotions, shipping updates or refunds — and the notifications they actually receive. That’s a poor customer experience that businesses need to avoid if they want to be competitive, and something that’s applicable across every industry.
But before companies and their developers can get started, they need to choose the right data-stream-processing technology. And that choice isn’t necessarily a straightforward one.
Common Stream Processing Technologies
Over the last seven or eight years, a few open source technologies have dominated the world of stream processing. This small handful of technologies are trying to solve the problem of putting data to work faster without compromising data quality or consistency, even if the technical, architectural and operational details underneath differ.
Let’s look at three commonly used stream processors.
- Apache Flink is a data-processing framework designed to process large-scale data streams. Flink supports both event-driven processing and batch processing, as well as interactive analytics.
- Kafka Streams, part of the Apache Kafka ecosystem, is a microservices-based, client-side library that allows developers to build real-time stream-processing applications and scalable, high-throughput pipelines.
- Apache Spark is a distributed engine built for big data analytics using micro-batches, and is similar to the real-time processing achieved with Flink and Kafka Streams.
Each of these technologies has its strengths, and there are even use cases where it makes sense to combine these technologies. Whether considering these three technologies or the many others available in the broader ecosystem, organizations need to consider how this decision will further their long-term data strategy and allow them to pursue use cases that will keep them competitive as data streaming becomes more widespread.
How Organizations Can Choose Their Stream-Processing Technologies
Organizations adopting stream processing today often base this decision on the existing skill set of their developer and operations teams. That’s why you often see businesses with a significant community of practice around Kafka, turning to Kafka Streams, for example.
The developer experience is an important predictor of productivity if you plan to build streaming applications in the near future. For example, using a SQL engine (Flink SQL, ksqlDB or Spark SQL) to process data streams may be the right choice for making real-time data accessible to business analysts in your organization. In contrast, for developers used to working with Java, the ease of use and familiarity of Kafka Streams might be a better fit for their skill set.
While this reasoning makes sense for not blocking the way of innovation in the short term, it’s not always the most strategic decision and can limit how far you can take your stream-processing use cases.
How to Get Started with Stream Processing Today
Getting started with stream processing looks different from a practitioner perspective versus an organizational one. While organizations need to think about business requirements, practitioners can focus on the technology that helps them launch and learn fast.
Start by looking at side-by-side comparisons of the streaming technologies you want to use. While a company might evaluate several technologies at once, I’d recommend against that approach for developers — you don’t want to do a proof of concept (POC) on five different technologies. Instead, narrow down your list to two options that fit your requirements, and then build a POC for each.
The easiest way to do this is to find a tutorial that closely matches your use case and dive in. A great way to start is by building streaming pipelines that ingest and process data from Internet of Things (IoT) devices or public data sets like Wikipedia updates. Here are some places to start learning:
- Stream Processing Simplified is about Flink for Kafka Users.
- Learn Flink: Hands-On Training is about using Flink’s APIs to manage time and state.
- Get started with Flink in Java with this hands-on exercise.
- Apache Flink 101 discusses Flink’s core concepts and architecture.
- Build a real-time fraud detection pipeline with Kafka Streams.
- Build a real-time stream-processing pipeline with Spark and Kafka.
Developing streaming applications and services can be challenging because they require a different approach than traditional synchronous programming. Practitioners not only need to become familiar with the technology but also how to solve problems by reacting to events and streams of data, rather than by applying conditions and operations to data at rest.
While the technology you choose today may not be the one you use tomorrow, the problem-solving and stream-processing skills you’re gaining won’t go to waste.