The Game-Changing Appeal of Data Mesh
This article is first in a four-part series.
Data shapes the modern organization from top to bottom, so much so that a voracious appetite for data often forms the starting point of nearly every business decision.
But as our data-driven ambitions have soared, the architecture for the way important business data is stored, accessed and used across an organization hasn’t kept up.
The so-called democratization of data has largely failed to live up to its promise. Data is still hard to access and often is just a “reach in and grab it for yourself” sort of thing. This has led to a form of data anarchy.
That’s where the data mesh comes in.
If you’ve been anywhere near this site in the past year or so, you’ve probably bumped into the concept of data mesh. It was developed more than a year ago by Zhamak Dehghani, a technology consultant at Thoughtworks, to correct what she saw as major flaws in the way data is generated and consumed in today’s business world.
Data mesh is the latest phase of an ever-evolving process to more intelligently access and use data to make better strategic decisions and serve our customers better. I believe that not only is it designed to become a key part of the business intelligence process, but to serve operational processes as well.
Broadly, it’s a strategic and tactical construct for designing a more reliable data platform by closing the gap between the operational and analytical planes of each business domain, rejiggering both how data is produced and how it’s consumed. It pulls in ideas from domain-driven design (used to develop microservices), DevOps (automation and self-service infrastructure), or observability (logging and governance) and applies them to the data world.
The data mesh is a formulation of important principles that, when followed, fundamentally change the way organizations produce, use and distribute data. This article is the first of a four-part series designed to lay out the need for data mesh and then advise on how you must adjust your thinking and workflow to make it happen. It provides an outline for starting your own data mesh project, from covering the basic ideas to running a prototype system in your organization.
So … What Is It?
Data is now generated continuously at almost every point in an organization. This has led to widespread event stream processing (ESP), the practice of taking action on a series of data points that originate from a system that never stops generating data. (“Event” refers to each data point in the system, and “stream” refers to the ongoing delivery of those events.)
Events consist of something business-related that has happened in the organization, such as a user registration, a sale, inventory changes or employee updates. These events are then sequentially organized into a stream, which is used to facilitate ongoing delivery.
Event streams are updated as new data becomes available, and their data can be generated by any business source — sales, streaming video and audio, and text data, to name just a few. ESP enables all forms of operational, analytical and hybrid information to be pooled, and it arrives in many different forms, both structured and unstructured. Event streams play an essential role in most data mesh implementations.
At many organizations, that steady stream of data from all these various systems is poured into a data lake, a repository of info stored in its natural/raw format, or data warehouses, which combine and store data from disparate sources. From there, a team of data analysts cleans up the information so it can be used by different people and in many other different contexts.
Merging these petabytes of information into a single system means, theoretically, those insights develop faster. The insights might lead to analytics that predict future events based on patterns in the data, or as another example, to enrichment that combines data sources to create more context and meaning.
A typical data warehouse has many sources spread across a company, with varying levels of quality. There will be many ETL (extract, transform, load) jobs running in different systems and pulling data sets back to the central warehouse. The analytics teams clean up and fix a lot of the data. Extracting and loading take up the remaining time.
The data warehouse model is a system designed to be scalable, reliable and durable, but it is fraught with troubles. The problem is that we’ve asked a lot of our data over the past few years. We want it to meet all the requirements for strategic business intelligence. But we also need it for designing apps, keeping customers happy and optimizing operational workflows.
Meanwhile, analytical insights inform every aspect of our business, from the product manager who must understand the behavior of their customers to build personalization recommendations to engineers who build those solutions.
We’ve tried to tackle the scope of this rapidly increasing data volume with solutions like Apache Hadoop. But those of us in the data space are unfortunately very familiar with the scarcity of consistent, stable and well-defined data. This often shows up as a disparity in analytical reporting: For example, analytics reports that 1,100 product engagements occurred, but the customer was billed for 1,123 engagements. Operational systems and analytic systems do not always agree, and this is in large part due to sourcing data from multiple divergent sources.
Data architecture often lacks rigor and evolves in an ad hoc way without as much discipline or structure as we’d like. Users know that when they reach into the data lake to grab data for further processing and analysis, the information can be brittle. Older software may appear reliable but fails when presented with unusual data or is altered. And as the software in a given project grows larger and larger and develops a larger base of users who handle it, it becomes less and less malleable.
The data warehouse or data lake strategy, in short, has become error-prone and unsustainable. It leads to disconnected data producers, impatient data consumers and an overwhelmed data team struggling to keep pace. Most importantly, it simply doesn’t provide an adequate support structure for where we are today and where we are heading.
If you want any system to scale, you need to reduce the number of coupling points, the places of synchronization. Following that logic, data architectures can be most easily scaled by being broken down into smaller well-defined components oriented around domains. Other teams and products can subscribe to that data, assured that it is the definitive source of truth, sourcing directly from their peers in a peer-to-peer fashion. Hence, the data mesh.
A Nervous System for Data
The mesh is designed to make a premium product of the important business data in an organization. It does this simply. Data mesh places the onus of responsibility for providing clean, available and reliable data on the crew that generates, uses and stores the data — not on a centralized analytics team. It puts the responsibility for clean data on those who are closest to the data. In other words, by those who understand it best.
In a data mesh, ownership of an asset is given to the local team that’s most familiar with its structure, purpose, and value and who owns the production of it. In this decentralized approach, many parties work together to ensure excellent data. The parties that own the data must be good stewards of that data and communicate with others to make sure their data needs are met.
Data is no longer treated as a byproduct of applications, but instead is envisioned as a well-defined data product. Think of data mesh as the antithesis to the data warehouse. Data products are sources of well-formed data that are distributed around your company, each treated as first-class products in their own right complete with dedicated ownership, life cycle management and service-level agreements. The idea is to carefully craft, curate and present these to the rest of the organization as products for other teams to consume, providing a reliable and trustworthy source for sharing data across the organization.
Event streams are the optimal solution for powering the vast majority of data products. They are a scalable, reliable and durable way of storing and communicating important business data and bridging the ever-blurrier gap between analytical and operational processing. They put the consumer in control of an ever-updating, read-only copy of that data to process, remodel, store and update as they see fit (think microservices).
The prevalence of cloud storage and computing products makes this easy to accommodate; analytics consumers can sink data in a cloud object store for massive parallel processing, while operational users can consume the data directly, acting on events as they occur. This eliminates multiple sources of the same data set that so often cause issues with older data acquisition strategies.
But there’s a lot more to implementing the data mesh, and I’m going to explore the main considerations over the next three articles:
∙ How Data is Produced: Data as a product and domain ownership
∙ How Data is Consumed: Self-serve data and federated governance
∙ How to Organize the Workforce: A teamwork approach to the optimal mesh
Each organization will find that its data mesh implementation may differ in its supported data product types, technical design, governance model and organizational structure.
But one thing is certain: As the demands of data consumers continue to diversify, and the scale of our needs accelerate, I believe that data meshes — with their focus on distributed domain data sets provided through event streams — will become increasingly common and a critical part of our data-driven future.