Real-Time Stream Processing Apps, Edge Computing, and Kafka
It was 3 a.m.
In a network operations center (NOC), engineers were alert and ready to spring into action as data points flowed across their screens. Occasionally, they lifted their heads to look up at giant monitors populated with colorful dashboards.
No one spoke. After a few minutes, something flashed across the computer screens, and the dashboards lit up. The engineers yelled in excitement, and customer support reached for the hotline.
These engineers belong to a CDN company’s live-streaming command center that has been set up to monitor live OTT (over-the-top) streams produced by a major sports broadcaster. The objective for the CDN company was to alert the customer of possible live-stream failures before the customer realized that a fault occurred. The dashboard alerts showed the occurrence of TCP (Transmission Control Protocol) retransmits, and the sports broadcaster was thrilled that it was alerted a full minute before its own in-house network operations center received the warning.
It might seem magical that real-time alerts like these are even possible for large-scale online events, but for companies using the power of edge computing combined with Apache Kafka, it’s just another routine monitoring event.
So, What Is Apache Kafka?
Apache Kafka is an open source distributed event-streaming platform that enables real-time stream processing applications at the edge. Applications in industrial IoT, telecommunications, healthcare, automotive and other use cases can benefit from Kafka’s low response times and ultra-low latency.
Working with Kafka requires an understanding of its architecture as well as its terminology. Central to Kafka’s architecture is the Kafka broker, generally configured as a cluster of servers. A Kafka broker receives and stores data from producers. Producers are sources of data that send data to a Kafka broker. Data sources could be connected cars, IoT devices in factories, other smart devices, patient health data and other sources.
Topics are where data is written and logically divided. These streams of related information and messages are categorized into groups. Topics are written to multiple partitions spread across numerous brokers to achieve parallelism, resulting in better throughput and lower latency. A retention time can be configured, at the end of which messages in a topic expire. This keeps storage space available for new incoming data.
Kafka consumers read data from one or more topics and process the stream of events producers publish. Producers and consumers are decoupled, which gives producers the freedom to publish data to topics without worrying about whether consumers are consuming them.
Slow consumers are not affected by fast producers. A consumer can come online once, consume messages from a topic and go offline again. Messages are the individual data elements written into log files spread across multiple partitions.
When we write data into a partition, we create a log of data. The log is immutable. It will remain there until its retention time expires. The log allows us to structure and track our data over time as changes occur. Producers can write to the log, and the data can be read by hundreds of different consumers at different points in the log.
Each data point recorded in the log is referred to as an event. What distinguishes Kafka as an event-streaming platform is the continuous delivery of these events by producers and consumers’ continuous processing of these events.
On the consumer side, multiple consumers read messages from various partitions across brokers. So instead of reading messages serially, various messages are read at the same time.
This leads to a scalable data pipeline. The parallelism at the producer and consumer ends allows us to scale the system quickly and efficiently, providing the high throughput and low latency environment required by real-time streaming applications.
Kafka brokers use Zookeeper to manage cluster membership and to elect a cluster controller.
A resilient Kafka deployment at the edge will consist of a minimum of three Kafka brokers, each running on a separate server, and a Zookeeper ensemble of three members.
This allows us to replicate each Kafka partition at least three times and have a cluster that will survive the failure of two nodes without data loss.
Apache Kafka at the Edge
The architecture for Kafka at the edge can be thought of as a three-layer architecture.
- The client layer: consists of IoT devices and the applications running at the customer location (retail stores, coffee shops, hospitals, etc.). IoT devices generate and send data to the Kafka edge layer.
- The edge layer: A small cluster of Kafka servers deployed at the customer’s site or in edge data centers processes data sent to it by IoT devices and returns it to applications that use it for monitoring and reporting.
- The cloud Layer: Larger Kafka clusters residing at the cloud layer receive data from the edge layer and combine it with data from other customer sites for further aggregation and analysis.
Let’s look at an example of how Kafka is used in real-time stream processing applications at the edge.
Monitoring and Alerting for Live OTT Streaming
CDN providers can leverage their base CDN offering to provide their customers with deep visibility, 24\7 support and mitigation for live OTT streams.
A fraction of the live streaming data (~10%) is sent over a special-purpose network of servers, spread across edge data centers for monitoring and alerting on stream failures like TCP retransmits, buffering, latency issues and other failures.
Here’s a high-level architectural diagram:
Producers send live event data to Kafka clusters that store this data in multiple topics across several partitions distributed across multiple Kafka brokers.
Real-time stream processing applications like Spark Streaming divide the incoming streams into micro-batches of specified intervals and process them to return a DStream (discretized stream). A DStream is represented by an underlying set of RDDs (resilient distributed datasets).
Each RDD in a DStream contains data from a specified interval.
DStreams are processed, and the results can be pushed to NoSQL databases for further processing, aggregation and reporting. In this case, Spark streaming calculates aggregates and stores the results in an Apache Cassandra database. An example of an aggregate would be “the total number of retransmits within the last 10 seconds.”
Applications query the Cassandra database and generate alerts if the results of those queries confirm that the metrics being monitored have breached predefined thresholds. Sending live streams to edge data center servers rather than centralized cloud servers can reduce the latency and response times required by real-time monitoring and alerting systems like this.
Note: Applications intolerant of latencies in the seconds range can use Kafka Streams in Spark Streaming.
Edge computing combined with Kafka can improve processing speed, lower network costs and enable developers to build large-scale real-time stream-processing systems with zero data loss.