Data in 2023: Revenge of the SQL Nerds
Yesterday, we cast our predictions for the biggest operational challenges facing the data world in 2023. Today we take on data management.
On one hand, the issue of Data Mesh continues to dominate the data management discussion. Last year, the dialog around data mesh hit a critical mass. The topic was sufficiently new to grab attention, and given the explosion of data that enterprises are accumulating in the cloud, quite timely. Last year, we forecast that the concept would face its first serious scrutiny, and in some quarters, backlash.
If data mesh is an issue that has occupied the foreground, the Data Lakehouse is a concept that over the past year emerged in the background. While it has not drawn the intensity of discussion compared to data mesh, in 2022, ecosystem development began picking up critical mass momentum. And for SQL nerds, it provides a form of vindication: while the lakehouse welcomes queries in all languages, it is based on the use of a relational table structure imposing order on the data lake.
In this post, we offer our predictions on what to expect with these core data management disciplines in the coming year.
Data Mesh: A Work in Progress
Data mesh is the topic that refuses to go away. For the past couple of years running, no posts have drawn more responses than our takes on data mesh. For instance, this LinkedIn post I wrote garnered roughly 10x more responses than our typical average. Clearly, data mesh continues to hit a nerve.
A year ago, I predicted that data meshes would face their first real scrutiny, and anecdotal evidence, both from published content out in the wild and comments to my LinkedIn posts alone indicate significant backlash.
There are good reasons why we’re continuing to have the data mesh discussion. Data continues to accumulate in cloud storage and data lakes and there is concern about what data is getting in there. We used to term the problem as data swamps.
It’s the major driver behind data lakehouses, as organizations seek to gain confidence that their data lakes are not chock full of corrupted or obsolete data. The going notion behind data meshes is that the people who know the most about specific tranches of data should not only take formal ownership, but adopt a product mentality toward managing the lifecycle of that data. The going concern is that, not properly governed, this could lead to reinforcing or building of new data silos.
I believe that, at minimum, teams, business units, or subject matter expert “domains” seeking to embrace data mesh practices need to speak a common language for describing data and data products, and that common language is metadata. Establishing some sort of common metadata backplane, such as through a data fabric (admittedly, a technology that is not yet fully defined), is an important first step.
While data mesh is about people and processes, technology gets involved because it is the means for scaling human effort. Yet there, confusion remains rampant that data mesh is a technology.
In her book, Zhamak Dehghani, who conceived data mesh, used the term “Data Mesh Platform” as shorthand for “Self-Serve Data Platform.” It is unfortunately terminology that adds to the confusion. We hope that the technology message around data mesh will get more clarity this year.
Teams with shared context will be the best candidates for embracing data mesh. That context could come from working in related business areas or domains, or from prior experience of collaboration.
The most likely adopters of data mesh will be diversified organizations that have a mix of subject-specific domains, but also the need to share some common information. Examples could include a property and casualty insurer that offers separate lines of coverage for automotive and homeowners; it will have separate data products that are specific to the respective businesses but shared data products on the journeys of joint customers.
The same could be applied to a diversified travel booking business that includes customer reviews, hotel, airfare, tour, and restaurant bookings; or a life sciences company that specializes in families of related medications.
But the biggest hurdle will be defining the federated governance that will be necessary to provide business domains the autonomy for building and managing the lifecycles of their data products while ensuring conformance with corporate standards, policies, and regulatory mandates regarding data quality, privacy, security, and so on.
There are challenges at the people level — achieving federated governance in practice will be trial and error, and in many organizations, requiring culture change. And in the embrace of data mesh, will teams get carried away proliferating data products to the point where they start overlapping and duplicating each other? How do you define a policy for that? Then there are technology limitations — is the self-service technology for “self-registration” or data products with built-in observability and discoverability ready for prime time? And by the way, whose data catalog will be used — as most organizations are likely to have multiple catalogs.
That said, 2023 will continue to be a learning process for data mesh. The debate will remain live, and nope, we won’t be able to declare victory in 12 months from now and move on.
Data Lakehouse: Revenge of the SQL Nerds
Over the past year, we’ve seen a considerable acceleration in the building of a new commercial ecosystem around data lakehouses. That shouldn’t be surprising given the uptake during 2022 in Google search activity.
An idea that’s been percolating for roughly five years, as the name implies the data lakehouse is the hybrid of the data warehouse and data lake. It’s supposed to deliver the best of both worlds: the scale and flexibility of the data lake with the SLAs, repeatability, and mature governance of the data warehouse.
As a term first popularized by Databricks in 2019, the trigger for data lakehouses was the need to bring ACID transactions to the data lake. It was not to transform it into a transaction database necessarily, but rather, to build confidence that the data in the data lake is current and not corrupted. It imposes a software-defined table structure atop cloud object storage. With that table structure, you get lots of other goodies, such as improved performance that can approach that of data warehouses, and more granular governance going as far down, not only to column, but also row level.
Just as open source commoditized core infrastructure, from the OS to the file system and container orchestration levels, it could have the same effect on data warehouses, where differentiation moves up the stack from underlying table structure to the analytic engine.
While there are proprietary data lakehouse table structures, introduced by AWS, Oracle, and Teradata, a hive of activity (pun intended) has surfaced on the open source side where there are three competing formats: Delta Lake, introduced by Databricks, along with community-originated Apache Hudi and Apache Iceberg. It has become the newest battleground between Databricks and Snowflake, as Databricks finally open sourced the remainder of the Delta Lake project, and Snowflake going all in on Iceberg.
And support for the lakehouse is coming from the hyperscalers, with AWS and Google Cloud announcing plans for the rollout of support for Iceberg across their analytics portfolios. Down the pike, they will also support Delta Lake and, later on, eventually Hudi. Keep your eyes on Delta Lake and Iceberg, as they have the headstart on building commercial ecosystems. But also keep your eyes on where established names including IBM, Microsoft, Oracle, SAP, SAS, and Teradata place their bets. We expect them to plant their stakes in the ground this year.
While the data lakehouse is intended for all comers — SQL developers and Python programmers alike — the bulk of the appeal is likely to be with the SQL community as the lakehouse imposes a relational schema on the data lake. In 2023, we expect to see an uptake in lakehouse proof of concepts. In the long run, lakehouses will not replace data lakes, where data scientists need the freedom to roam without the encumbrances of relational table structures for spotting patterns and developing models; neither will they replace specialized data warehouses or data marts created specifically for reporting. But we expect that lakehouses will in the long run coopt enterprise data warehouses with their support for polyglot data, analytics, and in-database machine learning.