Data Streaming: Where Redpanda Differs From Apache Kafka
Still streaming data like it’s 2011?
Think back to 2011. Netflix was still making the transition from DVDs to streaming. Google+ was still a thing. And you could even occasionally catch someone chatting on a Windows Phone.
The world looked a little different back then for engineers. For one, the kind of high-performance, low-cost hardware we have today was still a ways out — $1,000s per TB was standard for SSDs 10 years ago, versus approximately $200 per TB for 1,000 times faster NVMe drives today. Further, the modern cloud native concept of disaggregating compute and storage to improve scalability while reducing costs was not yet commonplace.
Cloud computing infrastructure was much more primitive, too. Today you can go to a hyperscaler and provision a 225-core vCPU virtual machine (VM), 20 times taller than what you could get a decade ago from the same cloud provider.
Due to these limitations, the reigning software paradigm of 2011 was to use virtualization to exploit low-cost commodity spinning disks. This was the environment in which the LinkedIn engineering team open sourced its distributed publish-subscribe system, Apache Kafka. Kafka grew like crazy, and it is still the most popular streaming data platform, used by 80% of the Fortune 100 (including Netflix), and a top-10 most popular dev framework.
But the world has changed since 2011. Today’s global, mobile, AI and edge applications can process up to trillions of events per day. They can maintain terabytes of state, and they can run on thousands of cores. In this environment of ultra-intensive data requirements, many organizations are seeking alternatives to legacy Kafka.
Apache Kafka and its various commercial permutations just aren’t enough anymore for many builders, due to design elements like page caching and garbage collection, and their dependency on the Java Virtual Machine (JVM), which inherently limits the performance Kafka can get out of modern hardware.
We are in a new paradigm of software development — rooted in the Raft consensus algorithm, multi-core optimization, and an “everything is async” approach — that is finally catching up to modern hardware.
Redpanda was built from the ground up as a streaming data platform with no virtual memory, no page cache, and a thread-per-core architecture that squeezes all the potential out of today’s superscalar CPUs and network cards. This is software designed for microsecond writes versus the milliseconds of a decade ago.
The Downsides of Kafka
So, what makes Redpanda fundamentally different from Kafka? It all comes down to short-circuiting the inherent limitations of Kafka. Kafka’s limitations in the high-performance, high-throughput, low-latency world are threefold:
Kafka is overly complex to deploy and manage. Kafka is a “componentized solution” with software dependencies like the JVM and Apache ZooKeeper (or ZooKeeper’s successor, KRaft). As a result, running Kafka at scale comes with a lot of operational complexity and often requires teams of consultants.
This becomes an even bigger problem at global scale. Think managing five different components is a pain? Try doing it in a multiregion global cloud deployment, or in a supercluster on-premises with hundreds of thousands of partitions. Yikes.
Kafka was designed for old hardware. As discussed at length already, Kafka’s distributed storage system was designed to exploit low-cost commodity spinning disks. This was a huge advantage in the early 2010s, when storage was the main performance bottleneck, but hardware has evolved — CPUs have more cores and cache; disks are 1,000 times faster and 100 times cheaper; and networks are 10 to 100 times faster.
Kafka is cost prohibitive when running at scale. Kafka complexity is not just an issue for operational agility and performance: it also drives up cost.
Clusters require infrastructure not just for the Kafka brokers, but for the additional components like ZooKeeper and Schema Registry. Add in Cruise Control to handle cluster rebalancing, plus the storage costs for historical data, and infrastructure costs quickly escalate for large production deployments.
The Case for Redpanda
The world needed a new streaming data platform. To fill those needs, Redpanda offers cloud native simplicity and ease of use, a performance architecture optimized for modern hardware, and highly cost-effective deployments.
Simplicity. Redpanda is deployed as a self-contained, single binary. A typical Kafka cluster may consist of a set of data brokers, an auxiliary ZooKeeper cluster (or KRaft consensus plane), and separately deployed resources for REST proxy and schema registry services.
By contrast, with Redpanda, schema registry, HTTP proxy, and message broker capabilities are built-in, with no need for JVM, ZooKeeper and KRaft dependencies. This is easier to support and lowers infrastructure costs, whether Redpanda is running on your infrastructure or as a fully managed cloud service.
Redpanda also employs native anti-entropy mechanisms to maintain your cluster in its optimal state through data imbalances and node failures. It intelligently redistributes data partitions, a manual process in Kafka that normally involves writing out the partition reassignments for each topic, or using a separate set of tools to administer the cluster.
And because the best thing about Kafka is its robust ecosystem, Redpanda is fully Kafka API-compatible, so it works with the entirety of Kafka streaming apps and tools. plus a developer-first CLI, and simple but powerful web console for visibility into data streams.
The result: Redpanda deploys in minutes, spins up in seconds, and runs efficiently wherever you develop — containers, laptops, x86 and ARM hardware, edge platforms, cloud instances, etc.
Performance. Written from scratch in C++, with a completely different internal architecture than Apache Kafka, Redpanda is designed to keep latencies consistent and low. After more than 200 hours of testing various configurations and permutations, our most recent benchmarking results have confirmed that Redpanda performs at least 10 times faster than Kafka at tail latencies (p99.99). On the same hardware, Kafka simply cannot sustain the same throughput.
And while Kafka suffers severe performance degradation when running on ARM-based hardware, due to previously documented problems running Java on ARM, Redpanda works on ARM-based hardware with no problem. (Run your tests to evaluate the performance of a self-hosted Redpanda cluster using our benchmarking guide.)
Further, Redpanda uses an optimistic approach to the Raft consensus protocol for managing its replicated log, giving you sound primitives for configuration and data replication. This provides data safety at any scale, without sacrificing performance.
Lower total cost. In this economic climate, costs are top of mind more than ever. Benchmarking that we have carried out on public cloud infrastructure, running both Apache Kafka and a Redpanda cluster for real-world data streaming use cases, found the following:
- Redpanda is between three to six times more cost-effective than running the equivalent Kafka infrastructure and team, while still delivering superior performance.
- Redpanda Enterprise brings a number of features designed to make operating clusters easier, with Redpanda’s tiered storage delivering infrastructure savings of between $70,000 and $1.2 million, depending on the workload and size of the cluster. That means infrastructure savings of eight to nine times compared to Kafka.
The Present and Future of Streaming Data
Kafka exploded because it enabled organizations to grow their revenue, engage customers faster, improve security outcomes, power new machine learning models, unify analytics systems, manage data at the edge, and much more. But it’s not 2011 anymore.
These days, engineers and operations teams are challenged to support increasingly demanding service-level objectives and service-level agreements for messages processed per second, cluster availability, low latency, and more for their GBps+ applications. In this environment of ultra-high-performance requirements, new alternatives to Kafka are needed.
Our founder, Alex Gallego, likes to quote one of his advisers: “Sometimes you get to reinvent the wheel when the road changes.” Redpanda is building a platform purpose-built for the present and future of streaming data.
Modern applications are in the middle of a real-time renaissance, and Redpanda ensures that you can accelerate existing streaming data workloads, leveraging compatibility with the Kafka ecosystem, while also bringing net new innovation that lays the foundation for tomorrow’s data-driven apps.