Confluent: Have We Entered the Age of Streaming?
Three years ago, when we posed the question, “Where is Confluent going?”, the company was still at the unicorn stage. But it was at a time when Apache Kafka was emerging as the default go-to-publish/subscribe messaging engine for the cloud era. At the time, we drew comparisons with Spark, which had a lot more visibility. Since that time, Confluent has IPO’ed while Databricks continues to sit on a huge cash trove as it positions itself as the de facto rival to Snowflake for the Data Lakehouse. Pulsar recently emerged as a competing project, but is it game over? Hyperscalers are offering alternatives like Amazon Kinesis, Azure Event Hub, and Google Cloud Dataflow but guess what? AWS co-markets Confluent Cloud, with a similar arrangement with Microsoft Azure on the way.
More to the point, today, the question is not so much about Confluent or Kafka, but instead, is streaming about to become the norm?
Chicken and Egg
At its first post-pandemic conference event last week, CEO Jay Kreps evangelized streaming using electricity as the metaphor. Kreps positioned streaming as pivotal to the next wave of apps in chicken and egg terms. That is, when electricity was invented, nobody thought of use cases aside from supplanting water or steam power, but then the light bulb came on, literally. And so, aside from generalizations about the importance of real-time information, Kreps asserted that the use cases for streaming will soon become obvious because the world operates in real time, and having immediate visibility is becoming table stakes.
Put another way, in retrospect, conventional wisdom will ask why we didn’t think of real-time to start off with.
There’s a certain irony here: streaming is just one piece of Kafka, as the crux of the system is really about PubSub. And the way Apache Kafka is architected, you can plug and play almost any streaming service; you’re not limited to Kafka Streaming. And as to PubSub, what’s really new here? The first PubSub implementation predated the dot com era.
Back in the 1990s, Tibco pioneered a commercial market for technology that capital market firms previously were inventing on their own. Vivek Ranadive, CEO and founder of Tibco, wrote a book about it titled The Power of Now back in the dot com era. Tibco later refined the message to “The 2-Second Advantage.” The premise was that, even if you didn’t have all the information, if you have just enough of it two seconds ahead of everybody else, that should provide competitive advantage if you’re operating with real-time use cases like capital markets tickers; any form of network management (telcos and supply chain come to mind); and — this being the 90s — e-commerce.
Kafka reinvented PubSub messaging for massively distributed scale on commodity hardware, and unlike the Tibco and IBM MQ era, was open source. OK, Kafka wasn’t the first open source messaging system. But its massive parallelism left Rabbit MQ and Java Message Service in the dust. Although not initially conceived for the cloud, Kafka’s scale-out, distributed architecture anticipated it. Apache Pulsar notwithstanding, a critical mass ecosystem of commercial support and related open source projects made Kafka the de facto for PubSub technology for cloud native environments. As to the old idea of the 2-Second Advantage for making decisions before you have all the information, today’s fat pipes and cloud scale make that question academic.
Bottom line? Today, roughly 80% of Fortune 100 companies use Kafka, according to the Apache Software Foundation.
But We’re Talking about Streaming
While PubSub has been the linchpin for Kafka’s success, the overriding theme for the Confluent Current conference last week was about heralding the age of streaming. Confluent’s view is that streaming will become the new norm for analytics because the world operates in real time, and so should analytics. It brought out use cases from reference customers like Expedia, for travel pricing; the US Postal Service, for handling requests for free COVID tests from its website; and Pinterest, for real-time content recommendations.
But is every company like born-online outfits like Pinterest or national utilities like the Postal Service? That’s redolent of the question, is every company a dot com, or whether every company needs what we used to term “big data.” Those are metaphors that Confluent would likely not object to, as today, it’s unthinkable for any B2C or B2B company to not have an online presence, not to mention that handling many of the “Vs” of “big data” has now become pretty routine with cloud analytics services like Redshift or Snowflake.
Nonetheless, the issue of whether streaming has become the new norm was very much up for discussion. We give a lot of credit to Confluent for making room for diverse viewpoints at its event. One hurdle is a matter of perception. In his session, Streaming is Still Not the Default. Decodable CEO Eric Sammer posed the question as to why most analytics systems are still running in batch. He rifled through rationalizations such as that specific use cases don’t require streaming. His comments were reinforced by Snowflake engineers Dan Sotolongo and Tyler Akidau in their session on Streaming 101.
But dig deeper and there is the fear of the unknown. According to Sammer, streaming technology is still inherently more complex than batch. For now, you still have to orchestrate multiple components like message brokers, stream processing formats, and connectors; it’s a problem redolent of the managing of zoo animals that limited Hadoop adoption. The culprit is not the core Kafka technology but the fact that tooling is disaggregated and too low-level. Architects and developers must juggle low-level parameters such as buffer sizing (which impacts anomaly detection), distinguishing between event vs. processing time (to ensure feeds are properly sequenced); and managing transactional consistency with external consistency.
Will Kafka Become Less Kafkaesque?
There’s good news on the horizon. The Apache project is replacing Zookeeper, the same tool for managing distributed configurations from the Hadoop days, with an internal capability called KRaft (based on the Raft consensus protocol) that is now in early access. Sammer said that he would like the Apache project to go further by building its own schema registry. We won’t argue. Just as databases hide a lot of the primitives under the hood, we’d like to see more of that with Kafka. The burden is both on the open source project and the tooling ecosystem. And for Kafka SaaS providers, we’d like them to offer serverless.
Confluent’s growth has not simply rested on Kafka — if so, its cloud service would likely not be more than doubling year over year as each hyperscaler offers their own managed Kafka services. Like Databricks, it has focused on building an end-to-end platform that unifies what would otherwise be a complex toolchain. They have built their own unique implementation of Apache Kafka that abstracts and tiers storage, making the service elastic and more economical than the attached block storage that is otherwise the norm. They have focused on simplifying the user experience for a technology that got its name because the problem was Kafkaesque.
Confluent has made Apache Kafka (which was written for Java) accessible to developers of Python and other languages; offering over a hundred out-of-the-box connectors and offer their own streaming SQL engine. Announcements made at the conference include a new Stream Designer targeted at the no code/low code crowd and a new enhanced edition of Stream Governance. They encompass for stream governance with a globally available schema registry; a catalog that now includes business metadata; and adding point-in-time playback capability to data lineage. In the theme of operational simplification, we would like to see Confluent simplify the setup for observability; for instance, customers should not have to specially set up their own time series database clusters for tracking performance; Confluent states that this capability for most customers right now would be overkill. We say this out of tough love; Confluent has made important strides towards making Kafka enterprise-ready.
Cut to the Chase: Is Streaming Ready for Prime Time?
Confluent, Snowflake, and the hyperscalers have helped demystify and lower the barriers to what we used to term “big data.” Heck, even Cloudera has helped us forget about the Zoo Animals. Yes, there are still operational complexities that are unique to streaming, but then again, we said the same thing about analyzing large volumes and varieties of data. It’s a learning process.
The larger question is about whether streaming will cross the chasm or jump the shark to the enterprise. We view analytics as a spectrum or requirements depending on the use case, and the answers aren’t necessarily cut and dried. For instance, if much of your business is online, that doesn’t necessarily dictate that it must be driven in real time. Probably the biggest example is large, high-value capital goods, where orders are not likely to spike in the same way that a flashy new, long-awaited must-have mobile device might. On the other hand, even if the product or service that your organization delivers might not be volatile to minute-by-minute swings, chances are some aspects of your business might be. For instance, that large capital goods manufacturer is likely to have rapid, transient outliers in its supply chain that could impact long-term planning.
Our answer to Confluent’s call to action? Streaming is not the shiny new thing, but it is one of the new pieces of the puzzle that most organizations will need to run their business.