Will real-time data processing replace batch processing?
At Confluent's user conference, Kafka co-creator Jay Kreps argued that stream processing would eventually supplant traditional methods of batch processing altogether.
Absolutely: Businesses operate in real-time and are looking to move their IT systems to real-time capabilities.
Eventually: Enterprises will adopt technology slowly, so batch processing will be around for several more years.
No way: Stream processing is a niche, and there will always be cases where batch processing is the only option.

13 Reasons Why Enterprises Should Use Apache Pulsar

If you need data streaming and message queuing, you have options, but they're not all equally suited to scaling needs. Here are the advantages that open source Apache Pulsar offers.
Mar 1st, 2023 7:40am by
Featued image for: 13 Reasons Why Enterprises Should Use Apache Pulsar

If you’re in need of a platform to handle data streaming and message queuing, you have a few options. If you’re an enterprise looking to scale, however, things can rapidly get more complicated.

Apache Pulsar, for instance, was designed from the beginning to be able to handle streaming and message queuing. Apache Kafka is the de facto standard for streaming, and it excels at high-volume, publish-subscribe (pub-sub) messages that are delivered to multiple consumers.

RabbitMQ and ActiveMQ, on the other hand, are great for the competing consumer use case, where publishers send messages into a topic, but only one set of applications consumes each message. It is easy to scale up and scale down the number of consumers on open source RabbitMQ or ActiveMQ.

Implementing the same scenario on Kafka is trickier because while you can use it for competing consumer use cases, it works on the partition-per-consumer level. If you want to increase the number of consumers, you have to increase the number of partitions on your topic. If you want to decrease the number of consumers, you can do that, but you can end up with extra partitions that you don’t need.

This is not a simple way to easily add competing consumers. Consumer rebounding can happen in Kafka,  and things can get complicated after a while.

In Pulsar, not only can you use it for streaming, but it’s really good at handling message queueing. You can use a shared subscription, where you can add consumers and remove consumers automatically, without any change to the topic. The new consumers become part of the round-robin distribution of the messages, making the process seamless.

If a topic is essentially empty and you have all these idle applications, you can scale them down, thus saving money spent on the machines and the application resources that you need to run. Consolidating Pulsar’s features of streaming and queueing simplifies operations — you can have only one platform running, saving costs.

Apache Pulsar is especially suited to solving the unified streaming and messaging challenge. We’ll explore 13 more reasons why Apache Pulsar is a must-have for any enterprise. But first, let’s define the relationship between data in motion and data at rest.

Data in Motion and at Rest

Data in motion is data that is actively being transferred or processed, such as streaming data or messages sent and received in real time. Examples include low-volume message queuing data, high-volume streaming data, web clickstreams, Internet of Things (IoT) sensor data, weather and traffic data, vehicle telemetry, business transactions, microservices interactions, airline reservations, bank transactions, point-of-sale records and more.

Data at rest is data stored in a persistent storage system and not currently being used or transmitted, such as flash storage or cloud storage. Event streaming and messaging systems often involve the real-time transfer and processing of data between different systems or applications, which is then stored as data at rest for further analysis or use.

Apache Pulsar can handle data in motion and offers cost-effective solutions for data that is not frequently accessed — data at rest. Read on as we explore what makes Pulsar ideal for enterprises dealing with both kinds of data.

What Is Apache Pulsar?

Apache Pulsar is an open source, distributed messaging system that was developed by Yahoo and donated to the Apache Software Foundation in 2016. It provides a pub-sub model for the efficient exchange of messages between components of a system in real time.

It also offers a shared log abstraction, allowing for use as both a message broker and an event broker. Pulsar guarantees message delivery at least once (“at-least-once” delivery semantics) and exactly once (“effectively-once” delivery semantics) to subscribers, even in the presence of failures or network issues through message deduplication and transaction-based acknowledgments.

Apache Pulsar offers transaction-based “publish and subscribe” for added assurance. This ensures that the message is either consumed by at least one application or not at all. This is especially useful when data consistency is paramount and all consumers must receive the same data.

Here are 13 reasons to use Apache Pulsar in the enterprise:

1. Pulsar Has Kubernetes — and Cloud-Ready Architecture.

One of the key components of its architecture is Apache BookKeeper, used for storing and managing messages produced and consumed by Pulsar. It provides a high-performance, fault-tolerant, and scalable storage layer, allowing Pulsar to handle large amounts of data with low latency and high throughput.

BookKeeper is a distributed, write-ahead log (WAL) system that allows  Pulsar to use multiple independent logs, called ledgers. This enables the creation of multiple ledgers over time for different topics, enhancing the scalability and performance of the system.

Additionally, BookKeeper offers a range of features, such as data durability and replication, essential for maintaining data integrity and availability in Pulsar.

A Pulsar broker is a key element of Pulsar’s architecture. It receives and routes messages and handles consumer state, topic, and namespace management. It acts as a go-between to BookKeeper and links to the ledgers to read and write messages. The broker stores the metadata and states in a journal disk, and the ledger data is kept in a ledger disk.

Helm charts are available for easy deployment of Pulsar clusters on Kubernetes. Helm, a package manager for Kubernetes, simplifies the installation, upgrading, and managing of Pulsar clusters.

The Pulsar proxy is also vital and enables connection to Pulsar clusters and traffic forwarding to the corresponding Pulsar Broker. It assists with load balancing, routing, and service discovery, and also facilitates scaling and managing Pulsar clusters.

These features make Pulsar stand out as a choice for cloud and Kubernetes deployments. Its ability to separate storage and compute, together with Helm Charts and the Pulsar proxy, make deployment and management of Pulsar clusters easier for cloud native, Kubernetes-native distributed systems.

2. It’s Easy to Modernize Legacy Applications.

There’s no need to rewrite applications written for messaging, like RabbitMQ or JMS. As an example, fast JMS for Apache Pulsar provides a drop-in replacement for JMS, turning your Pulsar cluster into a JMS 2.0-compliant broker. This makes it easy for existing JMS applications to replace their previous JMS broker type with Pulsar.

Moreover, built-in support for long-term storage and tiered storage allows economical retention and replay of large volumes of data. This is especially helpful for applications that need to store historical data for compliance or analytics.

3. It Can Lower the Total Cost of Ownership.

Apache Pulsar is designed to have a lower total cost of ownership (TCO) compared to other messaging and streaming platforms. As seen in a 2021 GigaOM report, Luna Streaming (Pulsar) can save up to 81% in TCO compared to Apache Kafka.

Splunk has also reported that, by replacing Kafka with Pulsar, capital expenditure costs became one and a half to two times lower than with Kafka, and operating expenditure costs were two to three times lower, due to Pulsar’s efficient, layered architecture.

These cost savings and efficiency benefits demonstrate the power of Pulsar’s capabilities. It is important to take into account that the TCO may differ according to the use case, the magnitude of the deployment and the storage solutions employed.

Pulsar’s features, such as multitenancy, tiered storage, load balancing, and integration with different storage solutions can help organizations attain cost savings and improved efficiency in their messaging and streaming deployments.

4. Multitenancy Is No Problem.

Pulsar enables multiple tenants to use the same cluster, each with its own set of assets. This proves advantageous when many teams or associations need to employ the same messaging system. Each tenant can have their own topics, subscriptions, and access control.

Enterprises have constructed an entire multitenancy wrapper on Kafka, involving prefixes for topic names to guarantee that multiple groups can share Kafka but don’t interfere with each other — which is built and supported internally. But with Pulsar, multitenancy capacity comes for free.

5. Pulsar Can Handle High Performance.

Pulsar has been demonstrated to handle millions of messages per second in benchmark tests. It can support up to a million topics or more, making it ideal for high-throughput applications such as real-time analytics and stream processing.

6. Latency Is Consistently Low.

Pulsar is designed for low-latency messaging, with a key design goal of having a producer latency of less than 10 milliseconds, making it ideal for real-time applications. By keeping a limited number of messages in memory and using a log-structured storage layer, Pulsar can minimize the time it takes to deliver messages.

7. Tiered Storage Helps Save Costs.

Apache Pulsar has a tiered storage architecture that divides the compute and storage tiers. The compute tier handles message processing, while the storage tier stores them. This enables significant cost savings by taking advantage of different storage resources for different data types.

Pulsar offers multiple storage choices, such as on-disk storage for high speed and minimal lag, and off-disk for extended storage. It offers an additional layer of storage, by storing information in various cloud services like Amazon S3, Google Cloud Storage, and Azure Blob Storage.

The tiered system permits configuring separate storage policies according to data retention, message rate, and other factors. This way, the frequently used data can be kept in on-disk storage and the data which is rarely accessed can be stored in off-disk storage like cloud storage, thus saving costs by reducing storage needs for the infrequently used data.

It’s worth noting that Pulsar’s feature allows moving data between storage tiers based on access patterns, optimizing storage cost and performance. This also allows taking advantage of cost-effective storage options from cloud storage services, such as Amazon S3, Google Cloud Storage, and Azure Blob Storage, which are more economical than SSD/NVMe disks.

8. Geo-Replication Reduces Risk.

Pulsar facilitates the automatic replication of data across multiple data centers in different regions, making disaster recovery easier and minimizing the effects of outages. It can be managed through Pulsar CLI or REST API and supports active/standby and active/active topologies.

A shared global configuration store is also available, eliminating the need to manually propagate data across data centers.

9. It’s Easy to Scale.

Pulsar is designed to handle a large number of concurrent connections and process a high throughput of messages. It does this by using a segment-based architecture, where each topic is divided into multiple segments and each segment is replicated across multiple brokers. This allows for horizontal scaling of the system, as more brokers can be added as needed.

Brokers in a Pulsar cluster are stateless, with no persistent data or state maintained. This makes scaling a Pulsar cluster up or down easy, by simply adding or removing brokers.

Pulsar also automatically balances the load across the brokers, with “sharding” helping to ensure optimal performance. This feature distributes topic partitions among available brokers, and when a new broker is added, the partitions are reassigned based on the number of brokers in the cluster.

This automatic load balancing, coupled with the ability to easily add or remove brokers, makes it easy to scale a Pulsar cluster to meet the changing needs of an application or service.

10. Pulsar Functions and IO Connectors Enable Real-Time Analysis.

Apache Pulsar features two powerful components for real-time processing and analysis: Pulsar Functions and Pulsar IO connectors. Both components are integrated into Pulsar’s CLI and API, enabling easy deployment, management, and monitoring.

Pulsar Functions are a serverless computing framework used for tasks such as filtering, transforming, and enriching data in real time. Written in popular programming languages like Java, Python, and Go, they can be deployed and managed through the Pulsar Functions API.

Pulsar IO connectors provide pre-built integration of Pulsar with external sources, such as databases, message queues, and file systems, and with external sinks like data warehouses, and analytics platforms.

Pulsar also features Change Data Capture (CDC), which allows tracking and streaming of the changes occurring in a database. Pulsar Functions and IO connectors are built on the foundation of this feature.

For example, the CDC feature of Pulsar can be used to stream the data changes from Snowflake in real time and perform transformations, analytics, and routing using Pulsar Functions.

If an application is using a Kafka connector for Snowflake to stream data, the same connector can be used to stream data from Snowflake to Pulsar with minimal changes. Pulsar is compatible with the same client APIs and protocols as Kafka, so existing Kafka clients can be connected to Pulsar without modification.

Furthermore, Pulsar has an IO connector for Snowflake that facilitates easy integration. With the Snowflake connector, data can be ingested from Snowflake and streamed to Pulsar topics, or data can be streamed from Pulsar topics to Snowflake.

11. It Has a Built-In Schema Registry.

Apache Pulsar includes a built-in schema registry that supports Avro and JSON schemas. Avro is a data serialization format that provides a compact binary representation of data and a way to describe its structure. JSON is a lightweight data-interchange format that is simple for humans to read and write, and straightforward for machines to parse and generate.

Avro schema support in Pulsar enables the encoding and decoding of Avro data in a compact binary format, which is useful for minimizing network bandwidth and storage space. It also allows for defining a schema for the data in Avro format and registering it in the schema registry.

JSON schema support in Pulsar allows for encoding and decoding JSON data in a human-readable format. It also permits defining a schema for the data in JSON format and registering it in the schema registry.

The schema registry in Pulsar ensures that the data produced by producers is compatible with the data consumed by consumers. When a producer sends data to a topic, it attaches a schema to the data, which describes the structure of the data, such as the fields and their data types. This schema is then used by the consumer to deserialize the data and access the fields.

In a microservice architecture, data can be produced and consumed independently by different microservices. However, a change to a schema on a microservice can break multiple downstream microservices that rely on that data, making it difficult to ensure that the data produced by producers is compatible with the data consumed by consumers. This can lead to data incompatibility issues without a schema registry.

12. Pulsar Offers Multiprotocol Support.

Apache Pulsar has integrated support for multiprotocols, including Kafka protocol,  MQTT, and MQTT-over-WebSocket. This makes Pulsar a multiprotocol messaging system, allowing it to work with different kinds of clients and use cases.

A key feature of Pulsar’s multiprotocol compatibility is “Kafka on Pulsar” (KoP), which enables Pulsar to act as a substitute for Kafka. KoP enables Kafka-native clients to connect to Pulsar using Kafka’s protocol and API, and also benefit from Pulsar’s features such as multitenancy, tiered storage, automatic load balancing, and schema registry.

In addition to Kafka protocol, Pulsar also supports MQTT and MQTT-over-WebSocket protocols. This enables the system to accommodate a wide range of IoT and mobile use cases and easily integrate with existing MQTT and MQTT-over-WebSocket clients and libraries.

13. And Finally: It’s Open Source.

Apache Pulsar is freely accessible under the Apache Software Foundation, enabling its utilization, alteration, and circulation. This makes it possible for many businesses to adopt Pulsar without incurring any licensing fees, and also facilitates the community’s contribution to the system’s growth.


To sum up, Apache Pulsar’s features make it an incredibly effective and flexible event and messaging system that can be applied to a wide range of enterprise applications.

Its advantages include improved resource utilization, cost-effective data storage, faster data access, greater availability, and more flexibility when it comes to data organization, all of which can be beneficial for many enterprise use cases.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Enable, Pragma, Real.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.