Modal Title
Cloud Services / Culture / Data Science

Analytics in 2022 Means Mastery of Distributed Data Politics

Data is distributed and dispersed by its very nature. How organizations acknowledge and accommodate that reality is a matter of politics and pragmatism.
Dec 29th, 2021 6:00am by
Featued image for: Analytics in 2022 Means Mastery of Distributed Data Politics

Andrew Brust
Andrew Brust is founder/CEO of Blue Badge Insights, providing strategy and advisory services to data, analytics, BI and AI companies, as well as their partners and customers. Andrew has written about Data and Analytics for over 10 years and is a lead analyst for GigaOm in that same space. He also co-chairs the Visual Studio Live! series of developer conferences, is a Microsoft Regional Director and Data Platform MVP, an entrepreneur and consulting veteran.

For about a decade now, businesses and the tech industry have been obsessing over big data, analytics, machine learning and the establishment of data-driven practices. Meanwhile, COVID-19 lockdowns and their aftermath have accelerated efforts and mandates around digital transformation, which itself is premised on data-driven culture and operation. While this has driven a ton of innovation by vendors and investment by customers, it’s also led to inevitably high project failure rates, a hallmark of damn-the-torpedoes tech adoption.

The year 2021 showcased this breakneck speed adoption and fallout. But in 2022, organizations will need to shift from mere implementation of analytics and machine learning technology to their successful application, cultural change, user satisfaction and adoption of data driven-practices. In other words, organizations must move from investment to return. And that can be much, much harder.

The good news is that various architectural and methodological approaches have emerged to help avoid and mitigate unsuccessful implementations. But are the approaches customer-ready yet? Are they pragmatic, rather than purist? Have they been stress-tested? Are they even well-understood? In this analysis, I’ll take you through two such leading architectures: Data Fabric and Data Mesh.

A mere explication of the two methodologies wouldn’t be very valuable, since many good ones already exist. So I’ll try to go beyond an explainer, by contextualizing the Data Mesh and Data Fabric models through analogy with types of political government. Perhaps considering past events on the geopolitical side can stoke our analytical thinking and allude to lessons on the tech side that can help mitigate technology risk and enhance success.

The Crux of the Problem

One key to project failure is the mismatch between platforms’ requirements and customers’ realities. Most analytics platforms are premised on data being available, discoverable, cleansed and consolidated. And the reality of customer data contrasts sharply with that profile.

While virtually every vendor and practitioner know that data starts out in an unrefined and scattered form, many view that initial state as a defect, to be corrected, and subsequently avoided. While this more complicated reality is acknowledged, it’s also discouraged, disparaged and looked upon as a remedial target.

From there, ETL (extract, transform and load), data pipelines, bespoke code and point-to-point integrations are applied to correct this imperfect state and get the organization on a steady footing of well-integrated data from which a bevy of analytics tasks can be executed. Then, once everything is fully up-and-running, with a single data repository ready to go, organizations can exhale and, implicitly, pretend that the prior, imperfect state never really existed.

In addition to being Utopian and — really — kind of snobby, this approach is deluded and otherwise flawed. The good news is that the collective industry and customer consciousness is absolutely waking up to this futility and denial, and has begun to address it. In 2022, organizations will need to take this issue head-on.

Analytics Inconvenience Isn’t Operational Negligence

The dispersed nature of operational data isn’t a flaw. It’s not an imperfection. It’s not the result of poor planning. Dispersal is operational data’s natural state. The overall operational data corpus is supposed to be scattered. It got that way through optimization, not incompetence or lack of forethought.

Data Fabric and Data Mesh both acknowledge the futility of trying to centralize all data physically. They both recognize that data volumes are only growing and data sources only multiplying. As a result, the two have tended to get conflated at times. But they are quite distinct, both in philosophy and implementation, and there are strong merits to each approach.

Data Fabric: Vendors’ Olive Branch

Data Fabric platforms grant the physical reality of dispersed data, and they seek to mitigate it by creating virtualized access layers that still integrate the data logically. This logical unification means that a central authority — be it IT, the Chief Data Officer or an analytics Center of Excellence — can still manage the data, govern it, and conform it to corporate-wide standards.

The Data Fabric also collects an array of technologies used for data transformation and analysis and makes them available to organizational business units, on an a la carte basis, to enable self-service analytics. In this way, it correlates with vendor stacks, especially for vendors that offer full data platforms.

In many ways, the Data Fabric is an implementation of a vendor data platform that is sensitive to, permissive of and flexible around the imperfect curatorial state of the customer’s data estate. Rather than selling and installing the platform and making the customer conform to its requirements, the customer’s circumstances and requirements drive the implementation.

Data Mesh: Cooperative Autonomy

Data Mesh, on the other hand, says that different subsets of data should be fully managed by teams within the business domains that work with it most. These teams should make the data available as event streams, tables or API-driven services, to other teams in other business units/domains, and should make them as easy to use as building blocks that can be combined with other data.

In her two major pieces (here and here) outlining the Data Mesh architecture, Thoughtworks’ Zhamak Dehghani, who pioneered it, describes its rigors. Dehghani says the Data Mesh architecture is based on the principles of domain-oriented, decentralized data ownership and architecture; data as a product; self-service infrastructure as a platform; and federated computational governance. Furthermore, Dehghani says the data products produced by each domain-oriented team should be discoverable, addressable, trustworthy, and possess self-describing semantics and syntax. They should also be interoperable, secure and governed by global standards and access controls.

In other words, relatively small, cross-functional teams own the development, deployment and maintenance of all data assets belonging to their business domain.  Domain data sets, services and APIs are developed with a product-driven mentality, putting emphasis on discoverability and usability. Consumers of the data sets represent the team’s customers; their levels of satisfaction and adoption constitute important metrics for the domain team’s success. Infrastructure implementation, provision and maintenance are centralized, as are governance standards and controls. The rest is under the control of business domain teams.

The thinking behind Data Mesh is very similar to that behind the service-oriented architecture (SOA) movement of the mid-2000s and the microservices of today. It asserts that tightly-coupled, monolithic architectures are brittle, lacking in agility and, ultimately, obsolete. Instead, it’s better to refactor analytics data into loosely-coupled building-block services that can be easily understood, adopted, used by developers and combined with other such services to create something of higher value.

Domain teams in the data mesh universe are similar to development teams in the software world, in that the latter are cross-functional and assume full responsibility for the software products they design, develop and deliver. On the downside, differences in implementation style, semantics and development approach, between dev teams and their codebases, can of course occur.

Side-by-side list comparing data fabric and data mesh

Summary and observations of the Data Fabric and Data Mesh architectures

Governance and Government

Essentially, the Data Mesh is a model comparable to a federated government, and the data fabric is one analogous to a centralized one. I don’t offer this geopolitical analogy just to be novel or clever, but rather in earnest. While the management of data and government aren’t the same, they both involve people, politics, competing interests, preferences, fears and pressure.

If we think about centralized governments, they are often resented by residents in the localities subject to their rule. Laws, taxes and regulations can feel arbitrarily imposed, ill-informed and sometimes unjust or at least impractical. It is for this very reason that regional autonomy is often granted by a central government. Think of the United Kingdom (UK), where Scotland, Northern Ireland and Wales have devolved parliaments. Or consider Spain, with its autonomous communities.

In the federated case, consider the United States, made up of 50 states and additional territories whose own judicial systems have primary authority for its residents. And, of course, the European Union (EU) is an even looser confederation of sovereign entities, which sometimes seems to operate more like an association than a government, per se. Now compare all these autonomous and federated governments to data meshes. While not a perfect analogy, it can help clarify the concept.

Sovereign and autonomous governments often have happier citizens. They may feel more in control and be more compliant with laws and expectations, given they were crafted locally. In some countries, autonomous governments conduct business in the region’s local dialect or language. In the case of Spain, while most citizens have full proficiency in Castilian Spanish, they may be more culturally comfortable speaking to the government in the same language they speak at home.

Similarly, organizations subscribing to the Data Mesh approach may have happier employees, who feel more in control of their own data, able to consume and use it more intuitively. And the semantics of domain-driven data services may feel more comfortable than working with some corporate-wide data model.

Essentially, citizens of autonomous regions get customized government and Data Mesh participants get customized data assets. Both provide comfort, nurture and respect. Both make tactical operation more straightforward, with less friction. Instead of a one-size-fits-all approach, things are more tailored, more sensitive and more…indigenous. This can lead to greater inclusion, more enthusiastic participation and richer dialog between government and citizenry, or between data assets and their users. This reduces friction and enhances political participation or technical adoption. It’s all good… to a point.

Trouble in Paradise

Autonomy can also create problems. For governments, conducting business in several languages can impede communication and multiple levels of government can lead to duplication of efforts, roles and responsibilities. Bureaucracy can increase, and operating in the broader national sphere can be harder, with ensuing economic impact. On the tech side, things are much the same. Co-existence of multiple, cross-functional teams can lead to duplication of efforts, incompatibilities and collective technical debt.

Worse yet, cultural autonomy and weak central governments can lead to efforts of outright secession. That has been the case with Catalunya in Spain and Quebec in Canada, for example. Even at the municipal level, this can come into play, as it did with Staten Island’s 1990s movement to secede from New York City. Secession can even have a cascading effect, as it did with the UK’s exit from the EU and Scotland’s subsequent bid for secession from the UK.

Counterparts for each of these phenomena exist in tech. If each business domain is in charge of its own data, multiple standards can emerge. Even if friendly APIs are established in all domains, their semantics may differ, causing so-called impedance mismatches as knowledge workers move from data set to data set across domains. And some APIs may follow such different semantics as to form siloed services, separated from the others. Suddenly, the devolved power and control that reduced friction at the domain level add friction at the organizational one. While the Data Mesh architecture calls for interoperability of domain services as a core tenet, that may become a neglected detail.

Say what you will about the centralized model and its potential for tone-deafness and lack of suitability, but it does avoid situations like those mentioned above. Autonomy may be great, but anarchy is not. Proponents of the Data Mesh approach need to keep that in mind. Sometimes a central mandate to follow a common model is what’s needed to ensure compatibility. Such dictatorial mandates may seem regressive, but they may also be necessary, especially since, unlike many governments, companies are not democracies, functionally or even rhetorically. As enlightened as the Data Mesh architecture may seem, various elements of the Data Fabric approach may be more realistic and feasible. Realpolitik applies in corporate structures as well as political ones.

Now What?

Like so many things in technology, methodologies will be altered to suit particular customer circumstances and tastes and will be implemented with varying degrees of strictness. Even when a given methodology is philosophically compatible with a corporate culture, it may still be impossible to implement in an orthodox manner. Sometimes it’s best to establish an architecture as a baseline and an ideal, with the recognition that variance will occur and explicit permission for it to do so. Elements from other architectures may even be incorporated as modifications to the baseline. And that’s OK.

If an architecture is implemented dogmatically, it will most likely fail. In 2022, organizations must find approaches that work in harmony with the production and consumption of data in their organization and the reality of how different groups share, interact and compete. Operational data are simply point-in-time recordings of people, objects, processes and transactions. They will vary in structure and semantics as much as the teams producing them differ in their demeanor, business philosophy and practices.

Assembling and analyzing data across such groups can be as difficult as governing a diverse set of communities. Similarly, balancing the data and analytics needs of various business groups to the satisfaction of each will take patience, finesse, diplomacy and a willingness by the groups to reach positive cooperative outcomes.

2022 is the year this difficult work must start in earnest. Governments can’t just build and staff institutions; they have to lead, govern and make those institutions work for their constituents. Similarly, organizations can’t just build data infrastructure; they must facilitate the availability, use and analysis of data and mentor their employees in using it even for routine, workaday business decisions.

Enlightened architectures, empowerment and autonomy will help. But they must be grounded in the realities of enforcing compliance and compatibility while avoiding technical debt and fragmentation.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.
TNS owner Insight Partners is an investor in: Island.