Data Management on a Decentralized Data Mesh
Data mesh is a hot architectural concept, now listed as a dominant market trend. It is a reaction against the lack of speed to deliver data for decision-making in large organizations thirsty for data, where many data sources, use cases, and user types on ever-changing, complex data landscapes must be reckoned with. It blames centralized data management because of its reliance on a single data team having to develop many data pipelines feeding some central database — a warehouse, lake or master data management (MDM) hub.
Data mesh organizes its software teams by domain: Each team handles its own pipelines and provides its data via APIs as “products,” moving from a single to multiple points of consumption. The architecture’s technology-agnostic principles also address two hurdles arising in decentralized systems: duplication of efforts to maintain pipelines and infrastructure (through a central self-service platform), and non-interoperable data products (via federated governance policies such as standardizing data formats, metadata representation and global identity). These two “layers” work in tandem: Expressing such policies and standards should use computational facilities available within the platform.
We believe data mesh popularity is deserved: It addresses a real problem with sound principles. However, in our own experience and as reported here, the business frequently needs combined insights from its various units, which translates into a need that goes beyond basic interoperability based on a data product’s API.
For data products to be useful for analysis, query access must be opened for consumption. For the data to be fully understood (be joinable to other data products) and relied upon, much agreement needs to happen on common data elements beyond what data mesh prescribes in the platform and federated governance layers (model descriptions and formats).
Indeed, as we discuss below, it should include some important capabilities provided over the years by data management systems dealing with the difficult data integration problem. This article suggests some capabilities that could be adapted within a data mesh platform.
The Quest for Data Integration
Simply put, data integration involves combining data from several disparate sources into a unified view, enabling the analysis of combined datasets. Businesses very often want insights across, or to test hypotheses involving, multiple business processes. For example, in a consumer product company, find the impact of defects of products of some category on sales, customer satisfaction and call center personnel, by region.
This requires data managed in processes from different business units that generally use different operational systems. Obtaining this insight needs data integration capabilities. Data mesh should support data integration.
But data integration is hard. We believe this is because of the following challenges:
- Data heterogeneity. In our company example above, there are disparate operational systems housing data of disparate formats and inconsistent semantics.
- Data source explosion in terms of variety and velocity, explained by pervasive use of new technology such as cloud computing, IoT device, and mobility.
- Poor data quality. On top of the previous problems, a root cause is little ownership for data. Data quality has been found to be very deficient in most organizations.
- Mergers and acquisitions. Many organizations grow this way, inheriting the mess of the companies acquired.
Because of its versatility, data warehousing has been the most popular form of integration for analytics. It strives to homogenize every aspect of this heterogeneity problem by moving data to a central storage and providing a common unified view across all functional areas. However, according to this survey, only 22% of organizations over the past two years fully realize any data warehouse ROI. Most data warehousing project failures can be attributed to the challenges described above.
Data lakes ingest data requiring no conformance to a unified view. Modelling and conformance via cleansing/transforming is delayed until there is a business need. Data lakes don’t solve the problem: Today, adoption has waned, and they also have big failure rates — over 85%, according to Gartner, which cites lack of governance, outdated and irrelevant data as failure scenarios to prevent.
Federation, or virtualization, systems implement another form of integration that leaves data in place and presents a virtual model as unified view, but has limitations on query processing capabilities and scale, uniform history management and ability to cleanse data without updating it, so usage has been low. However, this is changing, as explained below.
Finally, we have master data management systems, which help organizations integrate a subset of the data that has the most data quality problems: their official master data assets, those “big-level” entities — customers, products or places — that appear in more than one operational application.
They provide a central, authoritative hub where individual entity records are standardized with common, unified models and with rules to eliminate incorrect (i.e., dirty, duplicate, inconsistent) data from entering the hub. MDM projects have also been associated with high failure rates, although implementations seem to be maturing.
How Data Mesh helps Data Integration
By organizing software teams by domain and by decentralizing ownership, data mesh squarely addresses the data quality challenge. Teams are closer to the source of the data, understand it better and should know what their consumers expect, so they are in the best position to fix the main root causes for bad data quality.
The federated governance team determines what to measure — data quality dimensions and KPIs — and the data platform team supplies the “how” part — technology and best practices for automating measurement.
With no central data delivery team, consumers now must look for data within data products, find the data of interest and, if needed, perform their own integration — much like in many companies, there’s shadow-IT integration happening beyond that of its central team.
However, the situation is much better thanks to the “data as products” principle and federated governance standards: Data products are self-describing via standard metadata, discoverable, accessed uniformly via APIs and securely by global access control, and benefit from managed federated identity.
Developers doing their own integration — aggregating domains or aligning with consuming use cases — must nevertheless build a model for it, which is simpler than building a global model for an MDM hub or a data mart, let alone a warehouse, as they are built iteratively and organically, as opposed to upfront and with a cross-organization aim. They must also address data quality within the new data product, including addressing the potential inconsistencies among records referring to the same entity.
Improving Data Integration Capabilities in a Data Mesh
MDM system capabilities
Let’s start with MDM system capabilities, as they help with the entity resolution problem and with addressing inconsistencies. But first, we believe it is important to highlight the differences between the MDM domain and data mesh domain concepts. As mentioned above, master data domains refer to “big-level” entities managed across multiple applications and organizational units.
Instead, a data mesh domain arises from decomposing a complex system into “bounded contexts” along organizational units. The dominant factor drawing boundaries between contexts is human culture. An MDM domain corresponds then to several data mesh domains dealing with the “same concept”:
- Some domains aligned with the source (e.g., per service line or business unit)
- Aggregate data domains can be created when some degree of aggregation of source domains into a more holistic concept is needed.
The italicized terms in the previous sentence are important. Data mesh architect Zhamak Dehghani explicitly warns against creating MDM-style, ambitious aggregate domains, capturing all facets of a particular concept, such as “Customer 360,” implying a unified cross-organizational model, especially if teams from different cultures participate.
If capturing all facets of a complex domain to publish a trusted master data hub is one of your requirements, then data mesh will not help you. It will even impose an extra level of burden — setting up a self-service infrastructure, implementing data as products — that will go against delivery speed.
In her data mesh principles article, Deghani identifies global identity as a basic concern from the platform and the federated governance layer. Identity sounds simple, but for complex domains such as Customer, it requires the functionality of a matching engine that performs pair-wise record similarity computations through authored rules or machine learning methods using data element inputs such as person names, organization names, addresses, identifiers and dates.
The market provides general-purpose matching engines (IBM, SAP, Informatica), specialized engines (D&B’s corporation data matching) and registry-style MDM systems providing a matching capability.
This service could also deliver global identifiers for the domain elements by wrapping such engines with batch matching and real-time lookup before creation APIs. Both source-aligned domains and aggregated domains could call such service, each choosing their own matching policy.
Consistency of copies in aggregated domains
When two records refer to the same entity, and they belong to the same (data mesh) domain, the domain must decide which record or record values should be retained. Most recent update is a frequently used policy in MDM systems. When the records for the same entity belong to different domains and their models overlap, the inconsistency problem appears when records don’t agree on the overlapping part.
Aggregate domains could borrow capabilities from consolidation-style MDM systems: They could model the overlap, detect same-entity inconsistent records and decide, based on a predefined survivorship policy, how to resolve the conflict. Frequent policies include domain source, value set and, again, update recency. In fact, implementations of aggregate domains could be done wrapping consolidation-style MDM systems.
Synchronization of copies within source-aligned domains
The previous section talked about the consistency problem at the aggregate domain level, but it remains unsolved in source-aligned domains. Other more ambitious MDM styles and capabilities were introduced to solve this problem:
- Co-existence-style systems, which help in synchronizing changes within operational systems, and
- Transaction-style systems, where the MDM hub becomes the system of record for the master domain(s), providing read access to all operational systems via services; the latter no longer create or edit master data.
If the problem of inconsistency of copies on source-aligned domains generates too much pain, a solution is to add co-existence-style synchronization services to a mesh platform and use them as follows:
- Source-aligned domains subscribe to entity changes occurring within an aggregate domain,
- Upon a change, the aggregate domain pushes the changed entity to domains that subscribed, and
- Race conditions on concurrent updates are solved by governance-level synchronization rules.
Transaction-style hubs and data meshes are poles apart; we recommend against simultaneously implementing both.
Analysts want query access
The basis for interoperable data products is providing access through APIs, ensuring developers can connect to domains from other domains. Data product developers should provide both operational APIs and analytical APIs, the latter for consumer developers to, for instance, get all current domain records, filter records by criteria and get historical data over time. APIs are both good and necessary because they are technology agnostic; via encapsulation, they limit cascading effects of internal data structure changes; and they promote security.
However, API-only access for analysis is only adequate for data science workloads or small datasets. BI-type analysts prefer ad-hoc query access: Data providers can’t anticipate all possible ways they want to filter, slice and dice the data. Without direct query access, they must read the entire contents via API and move it to another data product.
If data from a domain is needed for analysis, query access should be opened to consumers. Note that all business intelligence (BI) tools on the market require a database connection. API-based sources outnumber available connectors for BI tools by several orders of magnitude.
Dealing with the impact of changes
If query access is opened, data model changes internal to a data product will have a cascading effect in all consuming queries (consider de-normalizing a model). A way to control the impact of such changes is through data lineage technology available in today’s data catalog products. These products gather metadata from BI tools, databases and data pipeline middleware, stitch them together, and establish dependency chains of queries on tables or views down to the column level. Such functionality is a welcome addition to a data platform. Of course, there’s another obvious reason to adopt data catalogs in the data platform, and that is to support data product discovery.
Supporting cross-business process analysis
We turn now to analysis on multiple data products, which is how the original quest for data integration is formulated on a data mesh. Our example above — impact of product defects on sales and consumer satisfaction— needs the identification of common data elements having significance across data products supporting sales and customer support business processes. These include region codes, time calendars and product hierarchies.
Supporting data interoperability implies the ability to relate KPIs sliced on, say, the same region and product category. This needs agreement on the values of these common data elements (this is Kimball’s conformed dimensions equivalent). Authoring, tracking changes and distributing reference data values to data products must be governed: This capability, akin to, and supported by, MDM systems, is a candidate to include in a data mesh platform.
In a decentralized data mesh, domain teams have the autonomy to store their data products in the systems of their choice. We mentioned above that the use of virtualized systems was limited, but recent open source query engines — such as Trino, now Starbust — have made significant progress toward becoming agnostic to storage technology, yet scalably process data using parallel processing techniques.
Some organizations can be less liberal, have teams store their data in assigned areas of the same underlying data engine and still be faithful to data mesh’s decentralized ownership principles, as discussed here. In this case, query engines have less data to move during processing, so better performance can be expected.
Data mesh has good chances of succeeding because it has strong and sound principles, and because there is enabling technology that has matured over the years to solve many related data management problems. However, to fully establish itself, there needs to be evidence that data mesh projects are capable of overcoming the data integration problem. A data mesh with a data platform layer containing data management capabilities like the ones described here would be a good place to start.