How Kafka and Redis Solve Stream-Processing Challenges
Although streams can be an efficient way of processing huge volumes of data, they come with their own set of challenges. Let’s take a look at a few of them.
1. What happens if the consumer is unable to process the chunks as quickly as the producer creates them? Let’s look at an example: What if the consumer is 50% slower than the producer? If we’re starting out with a 10 gigabyte file, that means by the time the producer has processed all 10GBs, the consumer would only have processed 5GB. What happens to the remaining 5GB while it’s waiting to be processed? Suddenly that 50 to100 bytes allocated for data that still needs to be processed would have to be expanded to 5GB.
Picture 1: If the consumer is slower than the producer, you’ll need additional memory.
2. And that’s just one nightmare scenario. There are others. For example, what happens if the consumer suddenly dies while it’s processing a line? You’d need a way of keeping track of the line that was being processed and a mechanism that would allow you to reread that line and all the lines that follow.
Picture 2: When the consumer fails.
3. Finally, what happens if you need to be able to process different events and send them to different consumers? And, to add an extra level of complexity, what if you have interdependent processing, when the process of one consumer depends on the actions of another? There’s a real risk that you’ll wind up with a complex, tightly coupled, monolithic system that’s very hard to manage. This is because these requirements will keep changing as you keep adding and removing different producers and consumers.
For example (Picture 3), let’s assume we have a large retail shop with thousands of servers that support shopping through web apps and mobile apps.
Imagine that we are processing three types of data related to payments, inventory and webserver logs and that each has a corresponding consumer: a “payment processor,” an “inventory processor” and a “webserver events processor.” In addition, there is an important interdependency between two of the consumers. Before you can process the inventory, you need to verify payment first. Finally, each type of data has different destinations. If it’s a payment event, you send the output to all the systems, such as the database, email system, CRM and so on. If it’s a webserver event, then you send it just to the database. If it’s an inventory event, you send it to the database and the CRM.
As you can imagine, this can quickly become quite complicated and messy. And that’s not even including the slow consumers and fault-tolerance issues that we’ll need to deal with for each consumer.
Picture 3: The challenge of tight coupling because of multiple producers and consumers.
Of course, all of this assumes that you’re dealing with a monolithic architecture, that you have a single server receiving and processing all the events. How would you deal with a microservices architecture? In this case, numerous small servers — that is, microservices — would be processing the events, and they would all need to be able to talk to each other. Suddenly, you don’t just have multiple producers and consumers. You have them spread out over multiple servers.
A key benefit of microservices is that they solve the problem of scaling specific services depending on changing needs. Unfortunately, although microservices solve some problems, they leave others unaddressed. We still have tight coupling between our producers and consumers, and we retain the dependency between the inventory microservices and the payment ones. Finally, the problems we pinpointed in our original streaming example remain problems:
- We haven’t figured out what to do when a consumer crashes.
- We haven’t come up with a method for managing slow consumers that doesn’t force us to vastly inflate the size of the buffer.
- We don’t yet have a way to ensure that our data isn’t lost.
These are just some of the main challenges. Let’s take a look at how to address them.
Picture 4: The challenges of tight coupling in the microservices world
Specialized Stream-Processing Systems
As we’ve seen, streams can be great for processing large amounts of data but also introduce a set of challenges. New specialized systems such as Apache Kafka and Redis Streams were introduced to solve these challenges. In the world of Kafka and Redis Streams, servers no longer lie at the center, the streams do, and everything else revolves around them.
Data engineers and data architects frequently share this stream-centered worldview. Perhaps it’s not surprising that when streams become the center of the world, everything is streamlined.
Picture 5 shows a direct mapping of the tightly coupled example you saw earlier. Let’s see how it works at a high level.
Picture 5: When we make streams the center of the world, everything becomes streamlined.
- Here the streams and the data (events) are first-class citizens as opposed to systems that are processing them.
- Any system that is interested in sending data (producer), receiving data (consumer) or both sending and receiving data (producer and consumer) connects to the stream-processing system.
- Because producers and consumers are decoupled, you can add additional consumers or producers at will. You can listen to any event you want. This makes it perfect for microservices architectures.
- If the consumer is slow, you can increase consumption by adding more consumers.
- If one consumer is dependent on another, you can simply listen to the output stream of that consumer and then do your processing. For example, in the picture above, the inventory service is receiving events from both the inventory stream (Purple) and also the output of the payment-processing stream (Orange) before it processes the inventory event. This is how you solve the interdependency problems.
- The data in the streams are persistent (as in a database). Any system can access any data at any time. If for some reason data wasn’t processed, you can reprocess it.
A number of streaming challenges that once seemed formidable, even insurmountable, can readily be solved just by putting streams at the center of the world. This is why more and more people are using Kafka and Redis Streams in their data layer.
This is why data engineers view streams as the center of the world.
Learn more about how Kafka and Redis deal with these complex data challenges. Download this free book that features 50+ illustrations to help you understand this complex topic in a fun and engaging way.