Data Virtualization in the Context of the Data Mesh
The data mesh concept continues to pick up momentum as an approach where domains-oriented teams own “data products” and have a self-serve data infrastructure platform that both delivers the data product to consumers and allows the data to be analyzed/consumed.
After spending a week in the newly created Data Mesh Learning community, I can report the consensus among experts is that a data fabric and a data mesh both provide an architecture to access data across multiple technologies and platforms; data fabrics are technology-centric, while a data mesh focuses on organizational change.
There is less agreement about data virtualization versus data fabric, but the former term is usually focused only on the abstraction of storage across multiple locations. All of these approaches address the same pain points and throw around the word “platform” so much that it becomes meaningless so it is unsurprising that there is confusion about definitions.
A new survey by Varada, an Israeli data management company that utilizes Presto SQL/Trino, provides a bit of semantic clarity. When 130 data virtualization users were asked how they define data virtualization, 64% said it is the ability to seamlessly connect to any data source or platform, 19% defining it as the ability to run any query without the need to model data and 17% thinking of it as a data lake query engine. If the question had provided more options or had been open-ended, the results may have been significantly different.
Users demonstrated what they truly believe about data virtualization when describing its purported benefits. Organizations with more than 10TB in their data lake or data warehouse say reducing and simplifying DataOps is the top benefit, but those with less than 10TB are more likely to focus on the ability to run all queries on a single platform and the ability to enable self-serve access to data consumers. Most of the study consisted of organizations with smaller data footprints.
According to Ori Reshef, Varada vice president of products, believes data virtualization promotes data democratization by enabling anyone from the organization, subject to proper governance policies, to access any dataset in a data mesh. The end result is that more business units can monetize massive amounts of data. That is an optimistic viewpoint, but there are also a few obstacles, most notably queries to the data platform that needs to be re-written for each and every domain-specific use case.
Data mesh calls for cross-functional teams organized around use cases. If each use case requires its own re-write of queries, would this make data virtualization impractical in organizations with cross-functional teams developing their own data pipelines? Does this just mean that the resources to fund the building and maintenance of decentralized data infrastructure platforms will need to be reallocated? How all these questions play out will impact how you assess emerging architectures for modern data infrastructure and the component selection when automating data pipelines.
- Data Mesh Paradox: An ability to address some aspects of data proliferation may result in other problems. Zhamak Dehghanit believes data mesh addresses the “proliferation of sources of data, diversity of data use cases and users, and speed of response to change.” However, without a strong chief data officer, it may actually accelerate increases in the number of data pipelines, data platforms or data engineers.
- Diverse Data Pipeline Challenges: At 32%, combining data in motion with data at rest is the top challenge to building and deploying data pipelines according to 400 respondents in a survey conducted by Enterprise Management Associates and sponsored by Starburst Data. Since most of those respondents are trying to integrate static data in data warehouses with dynamically changing data, that’s not surprising. However, the other seven answer options were also cited by at least a fifth of respondents.
- Actual Time Benchmarks Don’t Appear That Bad: Time spent dealing with break-fix issues and manual coding are cited as challenges, but creating and deploying data pipelines is takes much less time required for creating and refining machine learning models. More than half (54%) of the EMA study create their average data pipeline in less than a day, with 48% saying it takes less than a day to make that pipeline operational in production. With data pipelines easier to create, and an ever-increasing number of data use cases, there are concerns about a shortage of data engineering resources to handle them.
- Fine Line between Storage, Platform and Pipeline: Half of the aforementioned EMA report also have at least five data storage platforms in their data ecosystem. In that context, both data warehouses and data lakes are considered to be data storage platforms, but they are also data architectures. O’Reilly Media’s “The Modern Data Platform: Rise of the Data Lake,” a Databricks sponsored ebook, surveyed 3,000 data professionals and takes the perspective that architecture is the best way to describe an offering that is commonly marketed as a data platform. This study found that operational complexity, data quality and data governance are all challenges cited by at least two-thirds of the respondents when more than one data architecture is used. From a business perspective, respondents say that business agility and team productivity are constrained by the architecture, but to what degree is this just the standard answer that data professionals have been giving for the last 20 years?
Feature image via Pixabay.