Data Mesh Demands You View and Handle Your Data in a New Way
This is the second post in a four-part series. Here is Part 1.
Data mesh is designed to make data accessible, secure and interoperable at scale, unlocking access to a growing amount of information from far and wide across your growing organization. Data mesh accomplishes this by squarely addressing problems that can bedevil centralized data storage systems and pipelines. In those centralized systems, analytical and operational data is often served from different sources and hybrid workflows. This leads to frustrating inconsistencies and divergent sources of information. The result is a data distribution network that can become sclerotic, the result of overlapping ad hoc systems.
A data mesh can flush out the cobwebs and bring order to an aging model that feels improvised and suboptimal, creating a more manageable and scalable data architecture. In Part 1 of this series, we discussed why the data mesh design is a great solution to data architectures that have become brittle and error-prone. Here, in Part 2, we’ll focus on the purpose and intended users of the data and how to best structure the system to meet those needs.
Data mesh has several fundamental organizational and technological dimensions, and I’ll go into two of the essentials below — data as a product and the need for domain ownership.
Creating a Data Product
Your journey to data mesh begins with a fundamental understanding: Data must be viewed as a stand-alone product. It should be handled like any high-quality product that you’d release to the world. For this reason, you should apply the same rigor to its conception, creation and management. You should no longer leave it up to other teams to figure out how to gain access to your public domain data (keep your private domain data within your boundaries) — instead, provide a well-defined, reliable, consistent and trustworthy endpoint for them to access it. It must be presented clearly so that other teams in your organization (think of them as your customers) can use it to power their analytics and applications.
Data products each belong to a particular domain and act as a definitive source of information related to that part of the business. Data can be made up of an event stream, a database, an application or any number of data sets (novel or created from existing data) that are packaged for controlled access. It is curated by the team that owns it for the benefit of other teams. The data thus flows from data product to consuming application as needed — and thus from domain to domain.
Event streams are one of your best options for providing data products. They bridge the operational and analytical divide, a distinction which is getting blurrier with each passing day. Consumers can react to new events in real time, or they can sink the data to a batch data store for future processing. Consumers independently remodel the event streams, including mixing multiple data products together to come up with domain-specific use cases. They can then store their results in a data store best suited for their query patterns, such as a key-value store for fast lookups, or as an event stream to power other consumers.
Depending on your business, you may also choose to use alternative options. A Request/Response API on top of a database is one such option, though it can be tricky to support all of the query options and performance requirements of a wide number of consumers. Alternately, old style batch extract, transform and load tools can create nightly snapshots of data to be served as a data product, though these can be quite slow and have a significant impact on database performance.
Creating and populating the data product is one part, but we also need to provide metadata (literally, “data about data”) that communicates specific information about the data product. This includes general information about ownership, domain, schemas, update cadence and quality metrics, but can also include specific information about the data product substrate (such as how many events per second to expect from an event stream).
Here’s a pro tip: When you publish a stream and share your data, you must use schemas to put data contracts in place that ensure appropriate backward and forward compatibility. That way, when you add a new data field to a stream, any existing consumers of that stream can continue to operate without having to make any changes.
Through this design, data producers limit their coupling to downstream consumers to that of the data product API. Producers compile and ensure the quality of the data, while users consume the data. It’s a matter of write once, read many. One of the easiest ways for a data producer to participate in the data mesh is to publish its public data as a stream of well-defined events, so that other teams can consume it.
Ownership of this data product is the responsibility of the team that is most familiar with the data’s structure, purpose and value. This team is responsible for being good stewards of that data, curating its usage and ensuring that the needs of its consumers are adequately met. This may even lead to a new job title: Data Product Owner. This owner would be responsible for building, maintaining and serving the domain’s data products. Note that this is an inverted model of responsibility, compared with past data control paradigms. In this new ownership model, the accountability of data quality is upstream, as close to the source of the data as possible.
Domain ownership is implied in the “ownership” metadata of the data product. Should a prospective or existing consumer need to communicate with the data product owner, they would simply contact the owner through the preferred communication mode — such as email, instant message or telephone. The data product owner remains responsible for the data product management and lifecycle.
A team owns data similar to the way teams would own the set of services for a specific slice of the business. That team must engage in product thinking about the data. They’re wholly responsible for its quality, representation and cohesiveness. Each domain team may serve one or multiple data products.
Domain ownership is an organizational construct that requires buy-in from the teams and managers participating in the data mesh. Each team owns and is responsible for the creation, publishing and serving of any data product from that domain. Although each team has full autonomy for deciding both the contents and the mechanism of data product delivery, they must consult with intended and prospective consumers of the data to ensure that needs are met.
Generally, only the team responsible for a data set needs to be concerned with its internal processes. Everyone else only cares about the public interface. How that public data is created — implementation details, language data infrastructure and so on — doesn’t matter as long as it meets the criteria of being addressable, trustworthy and of good quality.
The promises of data mesh are huge. And with this structure in mind, you’re ready to begin your journey. We’ll explore in our next installment how data is consumed and the need for federated governance in a self-service data system.