Decoding Kafka: Why It’s Worth the Complexity
Born out of LinkedIn’s need to handle real-time data streams, Apache Kafka is a distributed event streaming platform with robust and diverse capabilities. It’s an excellent choice for use cases like stream processing, high-performance data pipelines, real-time analytics, log aggregation, event storage and event sourcing.
Numerous companies have embraced Kafka as the backbone of their event-driven architecture, yet others are reluctant to include Kafka in their tech stack. That’s because its steep learning curve and operational complexity can be daunting. This might persuade some organizations to opt for other technologies that are easier to manage. The question is, do simpler alternatives provide the same advantages as Kafka?
The Kafka Advantage
Since its inception over a decade ago, Kafka has matured into the de facto standard for data streaming, because it has the following advantages:
- Scalability — up to trillions of messages per day, thousands of topics split into tens of thousands of partitions and hundreds (or even thousands) of brokers.
- High performance — up to millions of messages and multiple gigabytes of data per second with consistently low latencies (in the single-digit milliseconds range).
- Fault tolerance and high availability — replicas of each partition are maintained across multiple brokers, ensuring no single point of failure. You can even replicate entire Kafka clusters and these replicated clusters can be deployed in different data centers or even different regions.
- Data integrity — guaranteed message ordering (at a partition level), exactly-once semantics and long-term data retention.
- Rich ecosystem — Kafka Streams for stream processing, Kafka Connect for integrations with source and destination systems and client libraries in many programming languages.
Due to these characteristics, thousands of organizations spanning industries like IT, finance, manufacturing, telecommunications, retail, healthcare, transportation and many others have adopted Kafka as a key technology for handling high-volume, high-frequency data streams.
One example is R3, a company operating in the financial services space. One of R3’s main products is Corda, a distributed ledger technology (DLT) platform that enables you to build financial applications for trade, loans, asset management and insurance. Kafka is one of the technologies R3 used to engineer Corda 5 (also known as Next-Gen Corda).
“When designing the runtime infrastructure for Next-Gen Corda, the primary goal was to achieve a hot-hot, high-availability configuration, with automatic work sharding, maximizing throughput and reducing costs.”
— Divya Taori, senior developer evangelist at R3
A few other options were considered before choosing Kafka, including a message bus, Apache Flink, or using an Akka cluster. However, the selection team concluded that Kafka is the best choice for Corda 5 because it “implements all the required functionality and is widely used in production at scale,” it reported. Furthermore, “Kafka’s industry-standard status for high availability and low latency messaging further solidified its suitability for Next-Gen Corda”.
Choosing Kafka as part of Next-Gen Corda’s stack seems to have been a good decision, leading to positive outcomes.
“By leveraging Kafka as the backbone of Corda’s communication infrastructure, Corda 5 achieves the desired high availability, horizontal scalability, and reduced total cost of ownership, ultimately delivering on the rigorous needs of our customers.”
— Divya Taori
Another company that’s relying on Kafka is MoEngage, a customer engagement software provider. Kafka was first introduced in 2016 for a small use case. In time, though, Kafka has become pervasive; nowadays, MoEngage uses Kafka for messaging, stream processing, log aggregation, changelog streams and state management.
MoEngage initially used one large Kafka cluster and with very little monitoring. This setup worked well for a while. However, as the organization grew and the volume of data increased, using a monolithic Kafka cluster became problematic — it introduced a single point of failure, was hard to scale, and made it hard to split load equally across brokers. The MoEngage team ultimately redesigned their Kafka setup, following a multicluster model. It wasn’t an easy task, but it seems it was worth it:
“Our new Kafka setup has brought a lot of reliability into our system. […] we can uphold our SLA commitment to our customers much more than we did with our older cluster, and the kicker is that we can do it with a 20% reduction in costs.”
— Amrit Jangid, data engineer at MoEngage
These were just a couple of examples, but the list of companies that depend on Kafka is much longer. It includes well-known names like LinkedIn, X (formerly Twitter), PayPal, Netflix, Spotify, Uber, Cloudflare, Airbnb, Skyscanner, Slack, Goldman Sachs and more. A good number of organizations have shared how and why they’re using Kafka, at what scale and the benefits they are reaping — I encourage you to check out their experiences.
How Complex Is Kafka?
First off, learning Kafka requires time and dedication. Newcomers might take a few days or weeks to grasp the basics and months to master advanced features and concepts. In addition, you need to constantly monitor and learn from the cluster’s performance as well as keep up with Kafka’s evolution and new features being released.
Setting up your Kafka deployment can be challenging, expensive and time-consuming. This process can take anywhere between a few days and a few weeks, depending on the scale and the specifics of the setup. You may even decide that a dedicated platform team will need to be created specifically to manage Kafka. Here’s a taste of what’s involved:
- Installing multiple Kafka brokers in a cluster, creating topics and partitions, and developing producers and consumer applications. Managing multiple Kafka clusters adds layers of complexity. For example, see how challenging it was for Uber to build a multiregion Kafka infrastructure that provides redundancy and allows for cross-region failover.
- Hundreds of configuration parameters involve trade-offs. For instance, a higher replication factor enhances data durability, but it also inflates storage requirements. Another example: exactly-once semantics can decrease throughput and increase latency.
- Configuring additional components, such as connectors so you can stream data to other systems, a stream processing component such as Kafka Streams, and either ZooKeeper or KRaft nodes for coordination between Kafka brokers.
- Implementing security, monitoring and testing mechanisms and managing the underlying hardware or virtual machines.
- Continuously monitoring, maintaining and optimizing Kafka post-deployment, which is often more difficult and expensive than all of the above.
To sum it up, Kafka can be hard to host and manage. This is especially true when you’re using it at scale. In addition, some misconceptions make Kafka sound more complex than it actually is:
It’s Too Complicated for a Message Broker
Kafka is more than a simple message broker. It offers additional capabilities like stream processing, durability, flexible messaging semantics and better scalability and performance than traditional brokers. While its superior characteristics increase complexity, the trade-off seems justified. Otherwise, why would numerous companies worldwide use Kafka? Some enterprises are migrating from simpler message brokers to the more reliable Kafka, despite the increased sophistication and operational difficulty.
I Have to Use Zookeeper, Which Complicates Things
Kafka has traditionally relied on ZooKeeper for metadata management and coordination between brokers. However, there’s an ongoing effort to incrementally remove the ZooKeeper dependency and replace it with KRaft, which moves metadata management into Kafka itself. This simplifies Kafka’s architecture and enhances its scalability.
KRaft has been production-ready for new Kafka clusters since Apache Kafka v3.3 (October 2022). With the recent release of Apache Kafka v3.6, it’s even possible to upgrade ZooKeeper-based clusters to KRaft. Meanwhile, ZooKeeper was deprecated in v3.5, and its complete removal is planned for Apache Kafka v4.0.
Kafka Is Only for Java Developers
Kafka is written in Java (and Scala) and it’s beneficial to have at least one developer on your team who is familiar with Java and the JVM. But this doesn’t mean that only Java developers can use Kafka. Quite the contrary — plenty of Kafka client libraries exist in other languages, such as Python, C/C++, Go, .NET, Ruby, PHP and Node.js. These clients allow you to produce, consume and process data in Kafka, as well as integrate and manage Kafka’s ecosystem components.
Kafka Is Only Suitable for Tech Giants
It’s true that large tech companies like LinkedIn, Netflix and Uber leverage Kafka’s capabilities to manage vast amounts of data at scale (and have dedicated teams to do so). But Kafka is just as worth it for small and medium-sized enterprises looking to future-proof their backend architecture and make it more efficient, modular and reliable. Plus, if you don’t have the resources and knowledge required to operate Kafka in-house, there’s always the option of offloading Kafka management to one of the many Kafka vendors out there.
When Simpler Isn’t Enough
Considering Kafka’s complexity, you might be tempted to use simpler event-driven tools instead. For instance, RabbitMQ (check out this comparison to understand the differences and similarities between the two technologies). But does RabbitMQ provide the same advantages as Kafka? Not quite.
AppDirect, a business-to-business platform for selling technology services, decided to switch from RabbitMQ to Kafka. While RabbitMQ initially performed well, its performance deteriorated when AppDirect moved from a monolith to a microservice architecture and started ingesting data from numerous new sources.
“RabbitMQ became unstable with this surge of data volume and required significant fine-tuning. These changes addressed the scale issue temporarily, but it was not sufficient. With new microservices and new data sources being added, the overall platform latency kept increasing.”
— Abid Khan, senior staff backend engineer at AppDirect
After a seven-step migration process, AppDirect felt the benefits of using Kafka instead of RabbitMQ.
“With Kafka in place, AppDirect is now better positioned to process large volumes of events. The additional tracing and observability system within the new message broker tool will guarantee high availability.”
— Abid Khan
Another example of a company that chose Kafka over RabbitMQ is Livestorm, a provider of web conferencing software. Despite Livestorm developers having more experience with RabbitMQ, and RabbitMQ being a simpler solution, Kafka was preferred due to factors like its huge community, high-quality libraries and superior reliability and throughput. Furthermore, Kafka was considered to be a technology that has the potential to serve Livestorm long term, while RabbitMQ was regarded as more of an interim step:
“A lot of companies decide to go with a RabbitMQ solution because it’s simple to set up and developers understand it quite fast. When it comes to communicating between services, it does that very well. Eventually, as companies grow, they generally shift from this kind of solution to data streams.”
— Laurent Schaffner, principal engineer at Livestorm
Using RabbitMQ for a while and moving to Kafka later down the road would have been problematic:
“[…] when we decide to shift, it’ll be painful and we’ll struggle to get rid of the message queues in place. We’ll have to deal with legacy technology, and it’ll just add complexity to the minds of our developers.”
— Laurent Schaffner
Streamlining Kafka’s Adoption
Not everyone has the time, resources or desire to deal with Kafka’s complexity. But that doesn’t mean they can’t benefit from Kafka’s capabilities. There are vendors out there that simplify the setup, maintenance and use of Kafka deployments.
The most well-known is Confluent. Founded by the creators of Kafka, Confluent comes in two different flavors: Confluent Platform and Confluent Cloud. The former is a self-managed Kafka distribution that offers additional capabilities compared to vanilla Apache Kafka. These include Schema Registry, for managing message schemas and for serializing and deserializing data over the network; pre-built connectors for integrating Kafka with various data sources and sinks; ksqlDB, a SQL interface for stream processing, and self-balancing clusters.
Meanwhile, Confluent Cloud is the fully managed, cloud native version of Confluent Platform that abstracts away much of the operational and infrastructure management overhead.
Other Kafka vendors include Amazon MSK, Aiven, Instaclustr, Cloudera, IBM Event Streams, Microsoft Azure Event Hubs and Quix. Each has different strong points. For instance, Cloudera specializes in big data analytics, while Quix excels at serverless stream processing and data pipelines using Python.
It’s also worth mentioning Redpanda, a vendor that provides compatibility with the Kafka API and protocol. You can think of the Redpanda platform as a C++ Kafka clone. For more insights, see how Kafka compares to Redpanda.
Overall, there are plenty of Kafka providers to review and test. There are different factors to consider when making your choice, such as pricing, integrations, features, security and compliance, management tools, the number and location of data centers and vendor lock-in.
The initial learning curve and operational challenges of managing Kafka in-house are steep. However, when you need to reliably handle data streams at scale, simpler alternatives like RabbitMQ often fall short. Kafka brings a mixture of advantages: scalability, high performance, fault tolerance, high availability, data integrity guarantees and a rich, modular ecosystem. These capabilities have been extensively battle-tested over the years and are being used by thousands of companies, from small and medium-sized enterprises to large tech giants. Dealing with Kafka doesn’t have to be a pain. As we have seen, there’s a good number of vendors who enable you to reap the rewards of using Kafka, while removing parts or all the complexity associated with it.
Please do check out Quix and let me know what you think. Quix is a Confluent Partner offering a fully managed platform that simplifies the development of Kafka-based event streaming applications with an open source Python client library (Quix Streams). To learn how Quix can help you extract value from streaming data, we’ve made it easy for you to see the Quix platform in action, without an account, with our interactive templates.