An Introduction to Stream Processing
Industries across the globe produce a staggering amount of data, and it continues to multiply at an exponential rate. This big data often comes in the form of live streams, known as streaming data, which has become a critical part of modern enterprise data architecture and a core source of data for analytics and data science. This live data can come from server logs, IoT sensors, and clickstream data from websites and apps. Tracking and analyzing this data has become integral to supporting data science in the enterprise.
However, it’s tricky to work with streaming data for two reasons. First, you have to collect large amounts of data from streaming sources that generate events every minute. Second, in its raw form, streaming data lacks structure and schema, which makes it tricky to query with analytic tools.
Today, there’s an increasing need to process, parse, and structure streaming data before any proper analysis can be done. For instance, what happens when someone uses a ride-hailing app? The app uses real-time data of location tracking, traffic data, and pricing to provide the most suitable driver. It also estimates how long it will take to reach the destination based on real-time and historical data. The entire process from the user’s end takes a few seconds. But what if the app fails to collect and process any of this data on time? The app has no value if the data processing isn’t done in real-time.
Traditionally, batch-oriented approaches are used for data processing. However, these approaches are unable to handle the vast streams of data generated in real time. To address these issues, many organizations are turning to stream processing architectures as an effective solution for processing vast amounts of incoming data and delivering real-time insights for end-users.
What Is Stream Processing?
Stream processing is a paradigm that continuously collects and processes real-time or near real-time data. It can collect data streams from multiple sources and rapidly transform or structure this data, which can be used for different purposes. Examples of this type of real-time data include information from social media networks, e-commerce purchases, in-game player activity, and log files of web or mobile users.
As I mentioned in my earlier explanation of stream processing, the main characteristics of data stream processing include:
- Data arrives as an ongoing stream of events
- It requires high throughput processing
- It requires low latency processing
Stream processing can be stateless or stateful. State here refers to the state of the data, that is, how previous data affect the processing of current data. In a stateless stream, the processing of current events is independent of the previous ones. Suppose you are analyzing weblogs, and you need to calculate how many visitors are viewing your page at any moment in time. Since the result of your preceding second doesn’t affect the current second’s outcome, it’s a stateless operation.
With stateful streams, there’s context as current and preceding events share their state. This context can help past events shape the processing of current events. For instance, a global brand would like to check the number of people buying a specific product every hour. Stateful stream processing can help process the users who buy the product in real-time. This data is then shared in a state so that it can be aggregated after one hour.
How Does Stream Processing Work?
Stream processing can use a number of techniques to process unbounded data. It partitions data streams by taking a current fragment so they can become fixed chunks of records that can be analyzed. Based on the use case, this current fragment can be from the last two minutes or the last hour, or even the last 200 events. This fragment is referred to as a window. You can use different techniques to window data and process the windowing outcomes.
Next, data manipulation is applied to data accumulated in a window. This can include the following:
- Basic operations (e.g., filter)
- Aggregate (e.g., sum, min, max)
This way, each window has a result value.
Stream Processing vs. Batch Processing
Batch processing is all about processing batches containing a large amount of data, which is usually data at rest. Stream processing instead works with continuous streams of data where there is no start or endpoint in time for the data. This data is then fed to a streaming analytics tool in real time to generate instant results.
Batch processing requires that the batch data is first loaded into a file system, a database, or any other storage medium before processing can initiate. This doesn’t mean that stream processing cannot deal with large amounts of data. However, batch processing is more practical and convenient if there’s no need for real-time analytics. It’s also easier to write code for batch processing. For example, a fitness-based product company goes through its overall revenues generated from multiple stores across the country. If it wants to look at the data at the end of the day, batch processing is good enough to meet its needs adequately.
Stream processing is better when you have to process data in motion and deliver analytics outcomes rapidly. For instance, the fitness company now wants to boost brand interest after airing a commercial. It can use stream processing to feed social media data into an analytics tool for real-time audience insights. This way, it can determine audience response and look into ways to amplify brand messaging in real time.
Stream Processing Use Cases
The ability of stream processing architectures to analyze real-time data can have a major impact in several areas.
Stream processing architectures can be pivotal in discovering, alerting, and managing fraudulent activities. They go through time-series data to analyze user behavior and look for suspicious patterns. This data can be ingested through a data ingestion tool (e.g., Striim) and can include the following:
- User identity (e.g., phone number)
- Behavioral patterns (e.g., browsing patterns)
- Location (e.g., shipping address)
- Network and device (e.g., IP information, device model)
This data is then processed and analyzed to find hidden fraud patterns. For example, a retailer can process real-time streams to identify credit card fraud during the point of sale. To do this, it can correlate customers’ interactions with different channels and transactions. In this way, any transaction that’s unusual or inconsistent with a customer’s behavior (e.g., using a shipping address from a different country) can be reviewed instantly.
Accenture found that 91% of buyers are more likely to purchase from brands that offer personalized recommendations. Today, businesses need to go the extra mile and improve their customer experience by introducing workflows that automate personalization.
Personalization with batch processing has some limitations. Since it uses historical data, it fails to take advantage of data providing insights into a user’s real-time interactions that are happening at the very moment. In addition, it fails at hyper-personalization since it’s unable to use these real-time streams with customers’ existing data.
Let’s take a seller that deals in computer hardware. Their target market includes both office workers and gamers. With stream processing, the seller can process real-time data to determine which visitors are office workers that need hardware like printers and which are gamers who are more likely to be looking for graphic cards that can run the latest games.
Log analysis is one of the processes engineering teams use to identify bugs by reviewing computer-generated records (also known as logs).
In 2009, PayPal’s network infrastructure faced a technical issue, causing it to go offline for one hour. This downtime led to a loss of transactions worth $7.2 million. In such circumstances, engineering teams don’t have a lot of time; they have to find the root cause of failure quickly via log analysis. To do this, their methods of collecting, analyzing, and understanding data in real time are key to solving the issue. Stream processing architecture makes it a natural solution. Today, PayPal uses stream processing frameworks and recently processed 5.34 billion payments in the fourth quarter of 2021.
Stream processing can improve log analysis by collecting raw system logs, classifying their structure, converting them into a consistent and standardized format, and sending them to other systems.
Sensor-powered devices collect and send large amounts of data quickly, which is valuable to organizations. They can measure a wide variety of data, such as air quality, electricity, gases, time of flight, luminance, air pressure, humidity, temperature, and GPS. After this data is collected, it must be transmitted to remote servers where it can be processed. One of the challenges that occur during this process is the processing of millions of records sent by the device’s sensors every second. You might also need to perform different operations like filtering, aggregating, or discarding irrelevant data.
Stream processing can process data from sensors, which includes data integration from different sources, and perform various actions like normalizing data and aggregating it.
While batch processing has its time and place, it’s usually reserved for processing data at rest. Stream processing instead works with continuous streams of data where there is no start or endpoint in time for the data. This data is then fed to a streaming analytics tool in real time to generate instant insights.
As more and more enterprises turn to data science to compete more effectively, stream processing is likely going to move more into the spotlight. So many applications now rely on real-time data that this progression is nigh unavoidable. While batch processing is going anywhere, we will likely see both augment one another in a mix of applications and use cases. What’s clear is that stream processing has immense potential and a bright future in enterprise data science.