What Is Data Streaming?
I didn’t know a lot about real time data streaming before attending Confluent‘s Current 2022 conference in October. I was invited last minute and it almost felt like getting dropped into the fire. It’s always a little intimidating but equally gratifying to sit down with experts in the field to discuss their topic of expertise and have the privilege to bring their knowledge to others. Today it is real-time data streaming.
When I got home from Current 2022 I was so hyped up about how everything is data and my friend looked at me like I was crazy, but really, everything is data. Data is a boring, blanket term used to describe very important and (sometimes) interesting information. Data is unique to individuals and corporations but equally important to all. Data is the comments in your LinkedIn news feed, the credits and deposits in your bank account, the last photo of your grandma saved on your iPhone. Data is the reason why you have a password.
But What Is Data Streaming?
Steaming isn’t new even though it’s considered the bleeding edge of technology. It’s actually been around for about 10, 15 years. The name of the game back then was fraud detection. Eric Sammer, CEO of data streaming engineering service Decodable, said of the original use case, “the go [was] fraud detection. If I can detect fraud, I can stop it,” in an interview with The New Stack.
Now it seems like streaming is everywhere. GPS, moment-by-moment flight tracking, high-frequency trading firms, online banking. I remember when Lyft started. It was 2013 and I was living in LA. One day I was calling a cab company to come pick me up the next day I was sitting next to someone in a car with a mustache on their front grill. There are also use cases that impact daily life that I don’t think about.
For example, retail-in-session product recommendations. If you’re actively shopping for something online, the additional recommendations likely come from real-time streaming data.
Sammer explains that the way its happening now is that, “They’re actually going to show you recommendations based on what you searched for in the previous page load,” he goes one step further and explains the reasoning being that using real-time streaming data makes, “more relevant because they are timely are shopping for something.”
Supply chain logistics also use real-time data streaming, something that I, a regular consumer, don’t think about. COVID put a strain on the supply chain as well as labor (that part I do think about). Container movement and supply chains also use steaming data.
Sammer explains, “why are there no Apple laptops in San Francisco? When are they or when’s the next shipment arriving?” In the past, it was OK for that information to be delayed 24 to 48 hours but that is as said before, a thing of the past. “those are the kinds of things that we’re solving today with customers is like, you know, making that real-time because it has an actual value to like somebody downstream of it.”
Microservices and machine learning are two major use cases for data streaming that won’t be covered in this article. A large consumer of real-time data is other machines which are partially why the tech stacks differ so greatly from the familiar tech stacks of other data systems.
The Challenges of the Streaming Stack
Real-time data streaming isn’t a super fast data batching system — that is micro-batching. Micro-batching is the fastest version of batch where the batch happens in a matter of minutes. Streaming is a completely different stack and the data streams in real-time. And the steep learning curve of completely new tech presents a large issue with streaming adaptation.
Of the new stack, Sammer made the comparison of moving from on-premises to the cloud a few years ago.
“Building on-prem software was weird when it moved to the cloud, okay, because like, the kinds of technologies changed, right was that file or was [Amazon Web Services‘ S3 bucket] and like, that works differently? Yeah.
It wasn’t that it was like launching rockets into space. But it meant, like a bunch of applications couldn’t just port they weren’t a drop-in replacement and so part of this is an education challenge.” Looking at the vastness of the data ecosystems today, it’s fair to say we have the large providers and main ecosystems that the smaller more niche groups fit and work with today. Everyone has a place but there is certainly a set of best practices.
But streaming today is what big data and Hadoop were like in 2010, mastery to some and a great unknown to others. Chen Qin, an engineering manager at Pinterest shared a little insight with The New Stack on what that learning curve might look like for a streaming adaptation, “[Engineers] need to learn how to use Kafka. They need to use learn how to use a streaming framework like Flink. They need to learn how to do RPC calls to different services.”
And that just covered the open source tools. Then the conversation moved to how to mold and query the data, “You need to manage your schema. What does [the] request look like?” Among a few other schema and data shape specifics. Then the infrastructure and tooling specifics.
Static Versus Dynamic Data
Once that was covered Quin began talking about the differences between static and dynamic data. It’s a lot. Especially when none of the challenges are abstracted away. In a more casual explanation, Sammer referred to the streaming stacks as “weird” in comparison to the more familiar tech stacks.
The education has been hard is a sentiment echoed by Sammer. He even goes one further and wants to streaming community to meet the larger tech community in the middle in terms of their stacks and education. Sammer says, “the real-time world has been trying to teach people about it. I actually think that we’ve hit the limit on how much we can teach. I think what we have to do is change the way these systems work [and make them] sympathetic to the user. In order for streaming to be the primary way data is processed, streaming will need to be accessible to a larger number of engineers so that it isn’t Ph.D. level stuff anymore.”
This is why Decodable chose SQL as the language, to bank on the tech community’s pre-existing understanding of SQL.
Ryanne Dolan, Senior Staff Software Engineer at LinkedIn, did throw a nod into using SQL while speaking to The New Stack when he said, “the trend today is to abstract both batch and streaming behind the same constructs the same language something like SQL. ”
But as of now there is no streaming comparison to popular big data companies like Snowflake or Databricks and there is room for one and I imagine one will emerge in the next few years. Sammer hopes it will be his company, Decodable as they currently do fit that role for its customers, explaining “And so we at decodable think that, you know, customers don’t want a collection of open source projects.”
Sammer thinks that streaming will progress forward faster once there is an “agreed upon” stack that greatly softens the learning curve.
Late Arriving Data
Late-arriving data is any record or data that is identified outside of the real-time window.
Dolan shared that people are surprised about it when they hear about it for the first time. And I’ll admit that I was but when you think about it, he adds and I agree that, “it’s not unfathomable when you’re processing trillions of events a day it’s not it’s not unfathomable, that one might get missed.”
That’s not the only example of late-arriving data though. Sammer gave the example of someone who performed a lot of tasks on their smartphone, turned it on airplane mode for a flight, then all the tasks were completed once the flight landed. What does that mean for the data?
Let’s take a step back and consider how this is handled in batch because late-arriving data isn’t a new problem.
Dolan gave a great analogy. The best way to think about the late-arriving data in streaming vs the late-arriving data in a batch is to consider a strongly typed programming language vs a dynamically typed programming language. “You know, you can look at at a high level and say, Oh, I don’t have to worry about types in this language. Well, the types are there. You just don’t see them in every line of code,” He said.
He solidified this as a persisting challenge by saying, “If you don’t, if you don’t believe that, then try running your batch job at any time of the day. [People will say] no, we’ve only run this batch job at this time. It has to finish before this time,” he said. And I will attest to this as I have absolutely run into this problem on more than one occasion.
Batch has outwardly expressed rules. Streaming deals with windows and theory and high-level concepts and that’s where it gets complicated. Think about this, Apache Spark watermarks late-arriving data so that it can maintain the state of data that arrives, store it in memory, and update it accurately by aggregating the data that arrived late. Apache Flink can handle late-arriving data but that data already went to the CFO in a PDF with a revenue report. So what does that mean? It’s complicated.
There’s no real answer though because streaming is dynamic, constant, and ever-changing. In terms of late-arriving data, the question isn’t always “what time did it happen” but more of “when is this event complete?”
Because different steaming events require different results. The 2022 financial records of a corporation happen in real-time transactions but so do steaming click recordings of an email campaign but the two results are not the same. One eventually has a calendar cutoff and can land you in some hot water if you miss something major and the other well does not. I pooled a few people on their thoughts on late-arriving data and everyone had a different answer. Everyone I asked was qualified as an expert in the field.
A question I have been wondering since Current 2022 is are the use cases pushing the marketplace forward or were always there and now the wider adaptation of steaming allows them to come forward out into the open. I never really thought a lot about data or real-time data but post-Current 2022 I have been thinking a lot about it.
There were a lot of people at Current 2022 who had opinions on streaming vs batch and what will be the primary way to transfer data in the next three, five, or 10 years. There were a large number of people there who want to see streaming as the go-to way to process data and they are working hard to get the industry there, but it won’t happen right away. There is so much room for expansion, growth, and knowledge. I will definitely be watching over the next few years.