Pulsar is a highly-scalable, low-latency messaging platform running on commodity hardware. It provides simple pub-sub and queue semantics over topics, lightweight compute framework, automatic cursor management for subscribers, and cross-data center replication.
Meanwhile, the 2018 Apache Kafka Report, which surveyed more than 600 users, found data pipelines and messaging the top two uses for the technology. It found growing use with the rise of microservices architectures.
“There is a big overlap in the use cases for the two systems, but the original designs were very different,” said Matteo Merli, one of its creators who have since formed Streamlio, a startup offering a fast-data platform.
Yahoo created Pulsar as a single multi-tenant system as a solution to its problems with multiple messaging systems and multiple teams deploying them.
It was released it as open source in 2016 and entered the ASF incubator in June 2017. For around four years, it’s been used in Yahoo applications Mail, Finance, Sports, Gemini Ads and Sherpa, Yahoo’s distributed key-value service.
“Apache Pulsar combines high-performance streaming (which Apache Kafka pursues) and flexible traditional queuing (which RabbitMQ pursues) into a unified messaging model and API. Pulsar gives you one system for both streaming and queuing, with the same high performance, using a unified API.”
Said Merli: “There are differences between streaming and queuing; there are a lot of use cases where you need one or the other, but most people need both for different use cases.”
A two-layer design is key to Pulsar, Merli said. There’s a stateless layer of brokers that receive and deliver messages, and a stateful persistence layer, with a set of Apache BookKeeper storage nodes called bookies that provide low-latency durable storage.
Pulsar was built on the idea of having strong data guarantees, Merli said. It was designed for shared consumption, while Kafka was not. And Pulsar enables users to configure a retention period for messages even after all subscriptions consume them.
Its layered architecture and segment-centric storage provide key advantages:
- You can scale the brokers or the storage layer independently.
- Since the brokers are stateless, a topic can be quickly moved to other brokers. That opens up an efficient way to balance traffic across brokers.
- Can have multiple consumers on the same partition and you can add as many as you want.
Since no data is stored locally, it eliminates the need to copy partition data when expanding capacity and no rebalancing is required. When a partitioned topic is created, Pulsar automatically partitions the data in an agnostic way to consumers and producers.
The broker sends message data to multiple BookKeeper nodes, which write the data into a write-ahead log and also keep a copy to memory. Before the node sends out an acknowledgment, the log is force-written to stable storage, which ensures retention even if you lose power. Topic partitions can scale up to the total capacity of the whole BookKeeper cluster, and you can scale up a cluster by simply adding nodes.
Since entering the incubator, a key focus has been on making it easier to get started with Pulsar, Merli said.
Version 2.0 of Pulsar was released in June, including a “stream-native” processing capability called Pulsar Functions which enables users to write processing functions in Java or Python for data as it moves through the pipeline. Version 2.2 will be released soon, which will feature interactive SQL querying.
Pulsar provides multiple language and protocol bindings, including Java, C++, Python, and WebSockets, as well as a Kafka-compatible API.