Historical Data and Streaming: Friends, Not Foes

Real-time event streaming has become one of the most prominent tools for software engineers over the last decade. In Stack Overflow’s 2022 Developer Survey, Apache Kafka, the de facto event-streaming platform, is ranked as one of the highest-paying tech skills and most-loved frameworks.
While obscure at its outset, there are now countless stories of companies using it at massive scale for use cases like gaming and ride-sharing, where latency must remain incredibly low. Because these examples are talked about the most, many people believe event streaming — also called data streaming — is only appropriate for use cases with demanding real-time requirements and not suitable for older, historical data. This thinking, however, is shortsighted and highlights a missed architectural opportunity.
Regardless of how fast your business needs to process data, streaming can make your software more understandable, more robust and less vulnerable to bugs — if it’s the right tool for the job. Here are three key factors to think about when you consider adding streaming to your architecture.
Factor 1: Understand Your Data’s Time/Value Curve
How valuable is your data? That’s a trick question. It depends on when the data point happened. The vast majority of data has a time/value curve. In general, data becomes less valuable the older it gets.
Now, older data hasn’t commonly been something people talk about in the same breath as streaming. Why? Until somewhat recently, most streaming platforms were created to have relatively small storage capacity. This made sense for their initial homes in bare-metal data centers but has become an unsound pattern since nearly everything has moved to the cloud. The cloud’s access to object storage provides near-limitless storage capacity.
Many streaming platforms integrate directly with those stores and carry through the same storage capacity improvements. This matters because it takes forced retention decisions out of the equation when it comes to streaming. You no longer need to decide how long you can keep data in a stream — you simply keep it as long as it makes sense.
One of the most exciting use cases for historical streams is backtesting online machine learning models. Teams often find that when they deploy a trained model to production, they need to change it in some way. But how can they be sure their new model works well? The very best outcome is to test it against all of the historical traffic, and because streaming is lossless, that is exactly what you get.
If your data’s time/value relationship makes sense, streaming is a great way to get value out of both ends of the curve.
Factor 2: Decide on the Direction of Data Flow
In the old days of software engineering, many things were written with polling — periodic checks to see if something happened. For instance, you might periodically poll a database table to see if a row was added or changed. For a lot of reasons, this is a recipe for disaster because many things can change since the last time you checked, and you won’t know what all the changes are.
Streaming’s superpower is that it forces you to think in terms of lossless, unidirectional dataflows instead of mutable, bidirectional process calls. This gives you a simple model for understanding how systems communicate, regardless of whether data is real time or historical. Instead of polling, you can listen for updates from a system and guarantee that you’ll see every change that happens in the order they occurred. To address the example above, change data capture has become the de facto solution for listening to database changes.
When you think about whether streams are useful for your problem, set aside latency and ask yourself: Does my system benefit from this kind of push model? Are lossless updates important?
Factor 3: Pick an Expiration Strategy
Unbounded, historical streams are great, but there will always come a time when it makes sense to delete your data, perhaps due to GDPR compliance or changes to your business. How do you reconcile streaming’s key primitive — an immutable log of data — with deletion, a mutable operation?
There are two common ways to address this. The first is to implement expiry policies which enable data systems to get rid of data after a certain time period, like a time-to-live (TTL). A variation on that is compaction, where a record’s historical revisions get purged after a certain timeframe.
The second is a bit more sophisticated and uses encryption. An encrypted payload is only useful if you have the decryption key. In general, deleting a payload’s encryption key is seen as a mistake, but not if you want to prevent anyone from ever seeing that data again! In some systems, intentionally deleting encryption keys, and then later deleting the actual dataset, is a simple solution to taking data offline.
It’s hard to predict the future of software, but one constant is that there will always be new technologies on the horizon. When you consider streaming for your use case, it’s important to think about these key questions: Is a push model helpful? Is ordered access to older data useful? Is there a simple way to delete old data? If the answer is yes to these questions, you’re investing in streaming technology for the right reasons, and it’s hard to go wrong when you do that.