Pulsar Takes on Kafka with Uniform Architecture, Speed
Apache Pulsar, an open source stream processing platform, arrived later on the scene than Apache Kafka, but it’s starting to see growth in its community, vendor support and use cases.
This month, cloud database vendor DataStax released Astra Streaming, its scalable, multicloud messaging and event-streaming platform built on Apache Pulsar. Astra Streaming is integrated with the company’s serverless database, Astra DB, and moves the company closer to a full-stack offering that can handle both data in motion and at rest.
Last Thursday’s initial public offering by Confluent, the company founded by three engineers who created Kafka, is further evidence that enterprises are increasingly using proven open source technologies to build business-critical applications. Software like Apache Kafka and Apache Pulsar are advanced streaming solutions that can meet a wide range of use cases.
Pulsar and Kafka: a Comparison
Apache Kafka, the older project, was originally developed by LinkedIn and was open sourced in early 2011.
Since it’s been around longer, Kafka has a large community around it. For example, StackOverflow has more than 24,000 questions related to the platform, and the platform has nearly 800 contributors on GitHub.
By comparison, there are just over 400 GitHub contributors for Pulsar.
“Developers are normally lazy, and there’s a huge amount of code copying,” he said. “If my business is trying to solve a problem, is there someone with a similar problem who’s solved the problem before?”
But the Apache Pulsar community is showing strength, which bodes well for its growing adoption. At the 2020 Pulsar Summit, for example, 1,600 global attendees represented more than 300 companies, including American Express, Disney, Google, Microsoft, PayPal and Salesforce.
Originally developed by Yahoo, Pulsar was contributed to the open source community in 2016, and became a top-level Apache Software Foundation project in 2018. Pulsar has some notable architectural advantages over Kafka, which have helped to drive further support and adoption.
Apache Kafka is a more consolidated system, which can make it easier to install, but it also means that processing and storage are tied together. That makes it difficult to add capacity in one independently of the other.
In Kafka, partitions are represented as files on the brokers; in other words, topic persistence is tightly coupled with the broker.
By contrast, Pulsar uses a tiered architecture that splits the message-serving layer from the storage layer, which is implemented as a distributed ledger using Apache BookKeeper —completely separate from the broker.
In Kafka, “its partitions are forever tied to its nodes, which hinders customers’ ability to lower costs with less-expensive resources over time,” wrote GigaOm analyst William McKnight in a June report sponsored by DataStax, a vendor that supports the Pulsar platform. “Kafka nodes can’t easily be added or removed, which means customers frequently must size for peak loads.”
In addition, he asserted, “There’s no easy way to logically separate resources for different users of Kafka. You can’t easily give users of a certain business unit free rein to manage just their resources in Kafka without introducing risks that those users may be able to impact others on the platform.”
For a data-driven company like Gong, which offers an AI-powered sales data analysis platform, “Kafka was the easy choice for its ability to handle and process streams of data efficiently — especially at scale,” wrote Nadav Hoze, a senior software engineer at the company, in a December post on Medium.
But, Hoze wrote, Gong still solves the multitenant issue with a workaround that involves re-partitioning, based on using tenant IDs as intermediate topics.
Pulsar’s architecture avoids the pitfalls of Kafka’s approach, which means it is easier to add, say, storage, without having to rebalance the entire system, said Chris Latimer, vice president of product management at DataStax,
In addition, McKnight’s GigaOm report, which included performance test results of the two technologies, showed Pulsar offering faster throughput and lower latency than Kafka in all scenarios tested.
As a result, Pulsar has been getting some high-profile users, including Splunk, Verizon Media, Chinese internet giant Tencent, cloud computing company Nutanix, and, most recently, online retailer Overstock.
Earlier this year, DataStax acquired Kesque, a startup that built a cloud messaging service on top of Apache Pulsar and used it to create its own DataStax Luna Streaming product.
And now, with Astra Streaming, DataStax is expanding its support for Pulsar even further.
Vendor backing can make a platform more appealing for enterprise customers, since they can now get support and value-added functionality.
In fact, Splunk is all-in on Pulsar. Its senior director of engineering, Karthik Ramasamy, talked about its use of the technology at the Scale by the Bay conference in 2020. According to Splunk, a single cluster of Pulsar can support many tenants and use cases with seamless cluster expansion without any downtime. It can reach 1.8 million messages in a single partition and offers out-of-the-box support for geographically distributed applications. In addition, it can support millions of topics, making data modeling easier.
“As users of Apache Pulsar, it is exciting to see DataStax contribute to the Pulsar community and make it easily accessible on a massive scale for companies like ours,” said George Trujillo, vice president of data engineering at Overstock. “DataStax is a trusted partner of Overstock who is advancing the global adoption of Apache Pulsar, enabling our developers to build modern data apps faster with infinite scalability.”
In the first scenario, GigaOm tested both platforms for an enterprise with simple linear growth in data streaming and found a 33% cost savings in infrastructure when using Luna Streaming over a three-year period compared to Kafka. In the second scenario, focusing on peak period workloads, the savings went up to 50%. In the third scenario, with projects that required significant complexity and a high number of topics and partitions, the savings were at 75%. And the savings can go even higher in some use cases.
“Kafka and Pulsar are both streaming solutions,” wrote analyst McKnight. “But with Kafka, you’ll be challenged with throughput, which will result in a potential 81% greater overall cost.”
Emerging Use Cases
There are other situations where Pulsar offers an advantage over Kafka. For example, there’s the scalability issue, said Latimer. Kafka’s approach is to put message brokers and storage in a single node.
“You can add more brokers, but then you’re going to spend time rebalancing all the data you have on each one,” Latimer said, though the Apache Kafka community has developed tools like cruise control that solve this problem.
Latimer drew a distinction between Kafka and Pulsar. “With Pulsar, instead of co-locating computing and storage, we use a distributed system under the covers, so you can scale the brokers and you can scale the storage,” he said. “It’s immediate relief, and you don’t have to do the rebalancing.”
Yahoo!, now Verizon Media, has seen that scalability first-hand, growing from one tenant, Yahoo Finance, and fewer than 100 topics in 2015, to more than 100 tenants and 28 million topics, with a peak of 6 million requests per second.
In the same talk, Joe Francis, who leads the team providing messaging systems for all of Verizon Media, echoed this assessment. “We don’t do any manual operations, we just add hardware. That’s all we do.”
The company has also been moving large use cases from Kafka to Pulsar, Francis added.
“We’re not forcing anyone to go,” he said. “But when it comes time to refresh the stack, they take a look at the choices, and the dollar amount they have to spend and the operational complexity they have to invest in, and they make the choice.”
To learn more about Pulsar vs. Kafka, register for this analyst discussion with GigaOm’s William McKnight on Tuesday, July 20.