Ditching Databases for Apache Kafka as System of Record
Databases have long acted as a system of record, with most organizations still using them to store and manage critical data in a reliable and persistent manner.
But times are changing. Many emerging trends are influencing the way data is stored and managed today, forcing companies to rethink data storage and offering lots of paths to innovation.
We store all our data in Kafka, allowing us to cost-effectively and securely store tens or even hundreds of petabytes of data and retain it over many decades.
Instituting this approach not only provided immense flexibility and scalability in our data architecture, it has also enabled lean and agile operations.
In this article, I’ll explain why organizations need to think differently about data storage, the benefits of using Kafka as a system of record and offer advice for anyone interested in following this path.
Why Data Storage Requires ‘Outside the Box’ Thinking
A modern flexible data architecture that enables companies to harness data-driven decisions has become more important than ever. And having robust, reliable and flexible data storage is a key component for success.
But the rise of big data, distributed systems, cloud computing and real-time data processing, just a few examples of the emerging trends I mentioned earlier, means traditional databases can no longer keep up with the velocity and volume of data being generated every second.
That’s because databases are not designed for scale. Their inherent rigid structure only impedes the flexibility that businesses need from their data architecture.
As an operator of different business-to-business financial trade repositories globally along with complementary modular services, we deal with a ton of data. Our data-streaming-first approach is what differentiates us from our competitors. Our goal: to revolutionize the way the derivatives market and global regulators think about trade reporting, data management and compliance.
This means putting Kafka at the core of our architecture, which enables us to capture events instead of just state. And storing data in Kafka, rather than a database, and using it as our system of record enables us to keep track of all these events, process them and create materialized views of the data depending on our use cases — now or in the future.
While other trade repositories and intermediary service providers often use databases like Oracle Exadata for their data storage needs, it can be super expensive and pose data management challenges. While it allows you to do SQL queries, the challenge lies in managing large SQL databases and ensuring data consistency within those systems.
Being in the business of global mandated trade reporting means you are serving multiple jurisdictions, each with its own unique data model and interpretation. If you consolidate all data into a single schema or model, the task of uniformly managing that becomes increasingly complex. And schema evolution is challenging without a historical overview of the data, as it is materialized in a specific version of the state — further adding to data management woes.
Plus, the scalability of a traditional database is limited when dealing with vast amounts of data.
In contrast, we use Confluent Cloud for our Kafka and its Infinite Storage, which allows users to store as much data as they want in Kafka, for as long as they need, while only paying for the storage used.
While the number of partitions is a consideration, the amount of data you can put in Confluent Cloud is unlimited, and storage grows automatically as you need it without limits on retention time.
It allows us to be completely abstracted from how data is stored underneath and enables a cost-effective way to keep all of our data.
This enables us to scale our operations without limitations and to interpret events in any representation that we would like.
Powering the Ability to Replay Data
One of the remarkable advantages of using Kafka as a system of record lies in its ability to replay data, a native capability that traditional databases lack. For us, this feature aligns with our preference to store events versus states, which is critical for calculating trade states accurately.
We receive a whole bunch of deltas, which we call submissions or messages, which contribute to the trade state at a given point in time. Each incoming message or event modifies the trade and alters its current state. If any errors occur during our stream-processing logic, it can result in incorrect state outputs.
If that information is stored directly in a fixed representation or a traditional database, the events leading up to the state are lost. Even if the interpretation of those events were incorrect, there’s no way of revisiting the context that led to that interpretation.
However, by preserving the historical order of events in an immutable and append-only log, Kafka offers the ability to replay those events.
Given our business’s regulatory requirements, it is imperative to store everything in an immutable manner. We are required to capture and retain all data as it was originally received. While most databases, including SQL, allow modifications, Kafka by design prohibits any changes to its immutable log.
Using Kafka as a system of record and having infinite storage means we can go back in time, analyze how things unfolded, make changes to our interpretation, manage point-in-time historical corrections and create alternative representations without affecting the current operational workload.
This flexibility provides a significant advantage, especially when operating in a highly regulated market where it is crucial to rectify any mistakes promptly and efficiently.
Enabling Flexibility in Our Data Architecture
Using Kafka as a system of record introduces remarkable flexibility to our data architecture. We can establish specific views tailored to each use case and use dedicated databases or technologies that align precisely with those requirements, then read off the Kafka topics that contain the source of those events.
Take customer data management, for instance. We can use a graph database designed specifically for that use case without having our entire system built around a graph database because it’s just a view or a projection based on Kafka.
This approach allows us to use different databases based on use cases without designating them as our system of record. Instead, they serve as representations of the data, enabling us to be flexible. Otherwise you’re plugged into a database, data lake or data warehouse, which are rigid and don’t allow transformation of data into representations optimized for specific use cases.
From a startup perspective (KOR was founded in 2021), this flexibility also allows us to avoid being locked into a specific technology direction prematurely. Following the architectural best practice of deferring decisions until the last responsible moment, we can delay committing to a particular technology choice until it is necessary and aligns with our requirements. This approach means we can adapt and evolve our technological landscape as our business needs evolve and enable future scalability and flexibility.
Apart from flexibility, the use of Schema Registry ensures data consistency so we know the data’s origins and the schema associated with it. Confluent Cloud also allows you to have a clear evolution policy set up with Schema Registry. If we instead put all the data in a data lake, for instance, it gets harder to manage all the different versions, the different schemas and the different representations of that data.
Kafka as a System of Record: It’s More a Mindset Shift than a Technology Shift
To successfully adopt Kafka as a system of record, a company must foster a culture that encourages everyone to embrace an event-driven model.
This shift in thinking should also extend to the way applications are being developed by stream processing. Failure to do so will result in a compatibility mismatch. The goal is to help everyone on your team understand that they are dealing with immutable data, and if they’ve written something, they can’t just go in and change it.
It’s advisable to start with a single team that comprehends stream processing and the significance of events as a system of proof. By demonstrating advantages within that team, they can then act as ambassadors to other teams, encouraging the adoption of events as the ultimate truth and embracing stream processing with states as an eventual representation.
Watch this on-demand webinar to learn more about how KOR Financial leveraged Kafka and Confluent Cloud to cost-effectively store and secure all data to stay in compliance with financial regulations.