In a combination seemingly as natural as Reese’s marriage of peanut butter and chocolate, Cloudera and StreamNative have released a new open source integration between Apache Pulsar and Apache NiFi. The two together create a cloud native, scalable, real-time streaming data platform that can ingest, transform, and analyze massive amounts of data.
“[NiFi is] a really nice way to get data in and out of Pulsar very easily and very fast. It’s a really nice way to be able to build streaming applications very simply with low code or no code,” said Tim Spann, developer advocate at StreamNative and a longtime contributor to the NiFi project.
StreamNative was founded by the original creators of Apache Pulsar, and many of the NiFi creators had worked from the technology’s origins at the National Security Agency (NSA), through the acquisition of Onyara by Hortonworks in 2015. Cloudera bought out Hortonworks in 2018.
While there’s been an open source connector between the two for a while, it wasn’t up to date, Spann said, so he decide it was time to do something about that. The two companies working together with the open source communities got the two projects in sync and ran the integration through its paces in the test cases they had out there.
With this update, users can consume and produce messages from Pulsar topics at scale with simple configuration settings within Apache NiFi. Cloudera is making four processors available with its Cloudera Dataflow for Data Hub 7.2.14 and newer.
“Cloudera is putting it out there as the first supported processor from another company, so that’s nice to see,” Spann said. “The NiFi ecosystems growing, the Pulsar ecosystems growing. It’s nice to see that interaction and overlap between the two projects.”
Apache Pulsar is a distributed messaging and streaming platform originally created at Yahoo! and now a top-level Apache Software Foundation project. Its claim to fame is providing scalable messaging and streaming both. While streaming systems like Apache Kafka can scale, they require a lot of work around data rebalancing, Addison Higham, chief architect at StreamNative, wrote in a blog post for The New Stack.
It uses a distributed publish-subscribe pattern designed to route messages from one endpoint to another without data loss. At its core, Pulsar uses a replicated distributed ledger to provide durable stream storage that can easily scale to retain petabytes of data, making long-term retention of event data feasible.
Pulsar makes scaling easy and provides more flexibility, Higham said in an interview.
“It’s very capable as a streaming system comparable to Kafka, so it can move large amounts of data; it can handle lots of parallelism, but it also has some advantages,” he said.
Pulsar clusters can support millions of different topics, offering organizations more flexibility in the way they use it, he said. Messages might be sent by customers, by users, for example.
“Pulsar’s model actually looks more like a messaging API, so it supports a traditional work queue. You can have as many consumers connected as you would like. And you can get higher throughput for out-of-order processing as well as the flexibility to do traditional messaging and fanout workloads, with a lot of consumers getting their own copy of the message,” he said, explaining that makes it a favored technology among marketing companies.
Rather than having a different cluster for each team, organizations can use one Pulsar cluster and with NiFi create a kind of data mesh data platform, making enriched data available to less technical users as well, Higham said.
He describes it as one technology that works across a broad range of different use cases and workloads. And at the same time, it aims to provide a lot of simplicity operationally.
“So [you have] this ability to have millions of topics without degrading performance,” he said.
Its users include Tencent, Verizon Media, Comcast and Overstock. In 2020, Splunk unveiled its Pulsar-based Splunk Data Stream Processor (DSP).
The NSA made NiFi available to the Apache Software Foundation in 2014. It became a top-level project the next year.
NiFi supports powerful and scalable directed graphs of data routing, transformation and system mediation logic.
This visual tool uses flow-based programming, enabling users to construct data flows that automate moving data from various platforms —databases, cloud-storage, messaging systems — to another, making data ingestion fast, easy and secure. It also provides event-level data provenance and traceability, allowing you to trace every piece of data back to its origin.
It takes care of dataflow-management needs including prioritization, back pressure and edge intelligence.
The NiFi platform also includes more than 100 pre-built processors that can be used to perform enrichment, routing and other transformations on the data as it flows from the source to destination.
Why the Combo?
NiFi is focused on making it easy to move data between software systems, rather than doing anything with it long term. Pulsar, meanwhile, was designed to act as a long-term repository of event data and provides strong integration with popular stream processing frameworks such as Flink and Spark.
With NiFi, data can be processed and transformed en route, then routed directly to Pulsar’s durable stream storage for long-term retention and made available for a host of more complex streaming processing and analytics use cases.
“NiFi is designed to do integration, and it’s really good in grabbing a lot of sources, letting you do basic your enrichment, transformation, lookup routing. Pulsar’s great for fast transportation of messaging and a lot of other things,” said Spann. “With the flexibility of Pulsar, once the Nifi messages are in there, so many other options can be used, whether it’s for streaming applications, work use, lots of different styles of messaging.
“And Pulsar also has gateways to a lot of other messaging protocols, which makes it like we’re connecting two gateways together. Once you get data into one or the other system, you can connect pretty much anywhere in the modern data stack. Regardless of what source or sink it is in, between the two of them, you have all the connections you need.”
The integration consists of four processors, two for publishing data to Pulsar — PublishPulsar and PublishPulsarRecord — and two for consuming data from Pulsar — ConsumePulsar and ConsumePulsarRecord. There are also two controller services included as well: one used for creating Pulsar clients and another for authentication to secure Pulsar clusters.
Feature image via Pixabay.