The Coming Era of Data as Code
(This month, The New Stack examines the management of data in cloud native systems, with a series of news posts, contributed essays and podcasts. Check back often on the site for new content).
Over the last few years, the defining problem with data management has shifted from accommodating volume to creating flexibility. While we’ll never truly leave the Big Data Era — “Big Data” is now just … data — the evolution of cloud services and new infrastructure like Kubernetes has changed the central question around data from “What do we do with all this?” to “How do we use it?”
In our current microservice-rich, cloud native world we are developing and deploying distributed applications in containers, each with their own datastores. Data is still “big,” but it also has to be flexible. Like the code that creates the applications it feeds, data must be able to be used across different environments, sharable, and versionable.
We need a new era of Data as Code.
What Is Data as Code?
Data as Code is an approach that gives teams — from DevOps to DataOps, Data Scientists and beyond — the ability to process, manage, consume, and share data in the same way we do for code during software development. It empowers end users to take control of their data to accelerate iterations and increase collaboration.
The DevOps revolution empowered developers and caused a “shift left” that focused on acceleration and problem prevention while sprouting a new generation of tools like GitHub, Jenkins, CircleCI, Gerrit, and Gradle that allowed end users to ship software. What comparative tooling do we have for data? What enhanced processes do we have?
Think about the end users in each scenario.
When an application needs to be deployed, a DevOps Engineer simply deploys it via automated pipelines. When they need storage provisioned, they programmatically request it from the cloud provider and attach it to their application. When they need to expose application access across the network, they create a service endpoint and call an ingress gateway.
But what happens when a developer or application owner needs data? The developer asks the DataOps team or hosting application owner for the data. What happens when they need to share that data with colleagues or move it between clouds? They wait for DevOps engineers to help them. What happens when they want to synchronize their datasets across lifecycles? They wait for DevOps engineers to help them.
These processes are largely manual, locking entire workflows into an outdated request-and-wait cycle. Much like a manufacturing line at a factory, these manual processes only work if everyone is available. If one link in the chain is missing, requests get stuck in wait.
By taking a Data as Code approach, companies can manage data programmatically, set up automated continuous integration and deployment pipelines for data, add the ability to version, package, clone, branch, diff and merge data, and also make it collaborative across different clouds and workspaces — just as they do with their code and deployments.
Flexible Data Means Empowered End Users
Despite our best efforts, data is still largely kept in silos. Some of those silos are monolithic and some are distributed, but they’re still silos. As happens with siloed data — even in modern cloud environments — different teams manage each repository and require different processes to access the data inside.
While we are getting better at connecting systems through APIs, we have added entire DataOps teams whose job is to manage the data pipeline alongside the data user. As much as we try to “jazz it up,” we are still doing ETL (extract, transform, and load).
The way we approach data management fundamentally opposes the way we need to use it today. We don’t need silos, and data or storage admins. Instead, we need to think of data in terms of end user publishers and subscribers with a third party that could define regulations, access control lists and other admin responsibilities while versioning and differencing the data.
Much like what GitHub does for code in developer workflows, taking this approach to data management would allow us to move ownership of data to the app level and make data inherently more mobile and shareable. Most notably, it would empower the people who work with data every day.
Pipelines Aren’t Just for Code
Perhaps nobody feels the pain of outdated data workflows more acutely than data scientists. No other applications are as reliant on data as machine learning and artificial intelligence, but the people that build those apps are stuck using outdated processes. Today, when data scientists build and train models, they share new data with their machine learning colleagues and begin iterating their model development in tools like Jupyter Notebook, Visual Studio Code, or R Studio. Those models get tweaked and changed, all using copies of the same data. Invariably, the data needs to be modified, or an updated version needs to be requested from the application team.
When that happens, data science teams have to manually keep track of model experimentation against both a model and data version, while also training updated models against the entire data set from scratch. It’s an enormous waste of time and resources.
What if, instead, they were able to build, train, and tune their models and push them toward deployment, completely packaged up so the production DevOps and MLOps engineering teams can simply release via familiar CI/CD pipelines?
We need this shift left in the data equation. Data as Code gives data scientists and machine learning engineers the capability to manage data across any cloud, to collaborate on branches of versioned data sets, and continuously retrain their models by merging differential sets as they gather more inputs, just as DevOps has done for software development.
Democratized Data Means Better Everything
In 2002, Jeff Bezos sent out a company-wide email at Amazon that became known as the “Bezos API Mandate.” It directed that every team in the company interact with one another through interfaces over the network — every piece of data, every function, no matter what. It was a call to organize the company around getting things done, get rid of the stasis of the request-and-wait mentality.
Software development has undergone a similar reckoning over the past decade due to the DevOps revolution. Now, with the start of the Data as Code era, it’s time to do the same for data management. DevOps Engineers and Site Reliability Engineers no longer rely on request-and-wait style ITIL-based workflows for infrastructure administrators, and there’s no reason we can’t do the same for people that work every day with data.
An organization where data access is democratized — where everyone has secure access to shareable data whenever they need — is an organization where important decisions are made faster and more intelligently. It’s an organization where products get shipped more frequently at lower cost and at higher quality, and where everyone working on those products is empowered to be the best at what they do.
This is the promise of the Data as Code era.
Data as Code will require a complete philosophical realignment in our approach to data management. We’ll have to throw away a lot of current processes and practices in order to reorient them around truly flexible data, but we have the infrastructure available to make this happen. Kubernetes in particular has unlocked the pathways that make Data as Code possible. It’s the future of the application control plane and will be the foundation for the technologies that will create the future data control plane.
We’ve already been through a radical shift in how applications are made. It’s time for another radical shift in the way they’re fed.