Addressing the Challenges of Real-Time Data Sharing
While conventional data warehouses and data lakes have become common practice for analytics workloads, they don’t solve the broader enterprise problems of sharing real-time operational data among departments or across companies. This three-part series explores the challenges and solutions that arise when integrating business data across different applications, clouds and organizations in a modern IT stack.
- Part 1 highlights the challenges of real-time data sharing, discusses operational vs. analytical data, and legacy solutions and their limitations.
- Part 2 defines the real-time data mesh and discusses the key tenets for incorporating them into modern IT stacks.
- Part 3 focuses on what’s needed to effectively evaluate real-time data-sharing solutions.
Canyon Spanning — The Foundational IT Challenge
One of the most enduring and foundational challenges for IT professionals regardless of their organization’s size or industry is getting data where it belongs. Computing and other types of workload processing, critical as they are, can’t even be contemplated if the data that drives them isn’t readily available. And while this problem has existed in some fashion since the emergence of commercial digital computing in the 1950s, structural trends — including SaaS-style applications, the emergence of public clouds and the commensurate need for multicloud strategies, and the increasingly complex, globe-spanning nature of business partnerships — have massively increased the scale and complexity of these data-sharing and “canyon-spanning” problems.
These “canyons” can take many forms in a modern IT stack:
- Multiple clouds — Whether through an explicit multicloud strategy or implicitly through years of M&A activities, companies are rarely “all in” on a single public cloud in perpetuity. Rather, the desire to be able to select best-of-breed solutions at a service level and the fact that partners and subsidiaries often cross cloud boundaries means that dealing with different clouds is essentially a given. Along with that realization comes the challenge of managing data transfer fees, multiple authentication and authorization techniques, and all the inherent security and infosec challenges inherent in any cross-cloud data-sharing approach at scale.
- Multiple departments and companies — Sharing data across multiple organizations within a company as well as sharing data with business partners such as suppliers, logistics providers, joint manufacturers, distributors, SaaS vendors and others is a large and growing challenge for IT professionals. These different parties also present “boundaries” through varying security policies, assurance and regulatory programs, leading to the need for complex access controls, governance requirements and auditing regimes. Whether truly “decentralized” or not, creating a single source of truth among these parties is a significant IT challenge that only grows as data flees on-premises data centers and the control of central IT teams, heads toward SaaS vendors and spreads across partners and clouds.
- Multiple geographies and accounts — The combined needs of fault tolerance/high availability, low access latency and regionally segregated security boundaries requires applications and their data to increasingly span multiple geographies and data centers, including erecting regional account barriers within public cloud solutions. These operational needs dramatically complicate IT solution development, turning conventional monolithic single-region solutions into complex distributed systems that need to span multiple regions and accounts with the requirement to continue operating even when one or more of those regions fail. Most IT teams are ill-equipped to deliver on these increasingly challenging platform considerations, and even outsourced custom development stretches budgets and increases delivery risk as requirements mount.
- Multiple applications — A large percentage of the operational data managed by IT departments has always existed in the form of first- and third-party applications. But as those applications increasingly migrate from on-premises, self-managed deployments and into SaaS-delivered solutions, data that was easily accessible to (and under the control of) IT is pulled out into (largely) public cloud repositories managed by vendors. Solutions that worked in previous generations — built-in ERP-sharing solutions, EAI (enterprise application integration) products, and API-centric solutions like Mulesoft — are no longer viable in a cloud- and SaaS-based environment. Meanwhile, even the most modern SaaS-aware ETL solutions in the market target analytics solutions, not operational data sharing, leaving enterprises with few options to deal with their mission-critical operational data.
Data sharing, initially in the form of data warehouses and more recently through data lakes, is a well-known pattern to IT architects when applied to analytics data that drives business intelligence (BI), AI/ML (artificial intelligence/machine learning) model training and similar activities. Vendors such as Snowflake incorporate multicloud data sharing in their solution, enabling IT professionals to more easily compose and share their analytics workloads.
However, data lakes represent only a fraction of the data under IT’s purview; in fact, the majority of the data stored in, transferred through and computed on by IT systems is actually operational data. Operational data differs from analytics data in several ways:
- Real-time — Operational data reflects a process in motion, such as airline travelers purchasing tickets, boarding planes or using loyalty points to purchase goods and services. Latency requirements for both reading and writing and the need for high levels of throughput are frequent “non-functional” requirements for operational data and the systems that produce and consume them.
- Application-centric data models — The data model or “shape” of operational data is dictated by the nature of the workload and/or by the specific application(s) that operate on the data, rather than by the nature of queries being performed on it.
- Fine-grained access controls — Assurance and regulatory programs, InfoSec and security policies, and partner relationships dictate which data can (or must) be shared and which data cannot be shared. Unlike analytics workloads, in which a dataset is typically made available (or kept private) in toto, operational data accessibility may literally vary “row by row.” Because it powers mission-critical applications, operational data is frequently highly sensitive, and the systems that manipulate it need to conform to the highest standards of security, compliance, governance and auditing as a result.
- Multiple readers and writers — Data lakes and other analytic solutions typically have highly structured patterns: either aggregating multiple sources of data into a single, combined data set or sharing a data set owned by a single company or department with one or more consumers. However, operational data can have highly complex “sharing topologies” in which multiple organizations contribute to and consume from an ever-evolving aggregate collection of data, as would be the case with a supply chain where multiple products, suppliers, manufacturers and distributors all operate in parallel on a “shared source of truth.”
Interpreting the Varied Nomenclature around Real-Time Data Sharing
While operational data, and the need to share it, is ubiquitous, the fragmented nature of previous approaches means that there isn’t a clear, distinct set of terminology for either the problem or its solution. The data itself may be referred to variously as “real-time,” “operational,” “transactional,” “OLTP” (online transaction processing) or “application.” Aggregation solutions may be described as real-time data lakes, real-time data warehouses, real-time data-sharing solutions or real-time data meshes.
Older-generation approaches are often identified as EAI (enterprise application integration) and occasionally as EiPaaS (enterprise integration platform-as-a-service) or are based on their protocols (EDI — electronic data interchange, or emerging industry-specific protocols such as FHRE).
Adjacent strategies include “multicloud” (or “cross-cloud” or “polycloud”) architectures and ETL/EL solutions, which may be described as SaaS or “no code”). “Reverse OLAP” is a term sometimes used to describe using the results of calculations performed in a data lake to create a feedback loop that informs or updates an operational system (loosely speaking, an inverse of the more typical operational-to-analytics ETL flow).
Legacy Solutions and Their Limitations
Given the long history of operational data sharing in companies, it’s not surprising that a variety of approaches have been developed over the years. Most of these legacy approaches are artifacts of the period in which they were initially conceived. Below, we explore each of the traditional vendor categories and examine their shortcomings when they are applied to modern, usually cloud-based, workloads.
- EAI/EiPaaS approaches — Enterprise application integration (EAI), also known as enterprise integration platform-as-a-service (EiPaaS), was a classic approach to sharing data among applications. EAI approaches, such as Boomi or Mulesoft, attempt to extract and productize the integration platform itself, making it agnostic to the applications and their data models. These solutions and their associated toolsets can lower the cost of building solutions, but they don’t tackle the essential problem of point-to-point integration; namely, that every data source needs to be connected to every data sink (see Figure 1).
- ERP systems — Closely related to EAI systems, enterprise resource planning (ERP) solutions often include some type of “intra-application” integration and data-sharing capability that works among their components. Customers who have large, IT-spanning deployments of SAP or other ERP solutions can likely address a significant percentage of their “in-house” data-sharing and integration challenges with platform-native ERP features. The challenge with this approach, of course, is that it ends at the boundary of the ERP solution: If a business partner or recently acquired subsidiary hasn’t made the same choice in ERP solutions (or doesn’t even possess a classic ERP deployment at all), then the system’s ability to share data or integrate application components is effectively nullified (see Figure 2). These solutions, developed in a pre-cloud world, are also artifacts of their time, often requiring complex, platform-specific data models and limiting flexibility to adopt modern cloud services.
- Public cloud platforms — Public clouds, such as AWS, offer a wide variety of services from infrastructure rental (“Infrastructure-as-a-Service”) to fully managed solutions for databases, data storage, compute and more. Their breadth of offerings is one of their key advantages, but ironically, it is also one of their weaknesses. As a real-time data-sharing solution, they present two difficulties:
- The heavy-lift of turning a “box of Legos” suite of individual services into a secure, scalable, solution capable of spanning the various gaps discussed above.
- More concerning is that they are, by their very nature, “walled gardens.” Simply put, it’s not in the best interest of a public cloud provider to simplify or streamline sharing data with any other public cloud. Everything from their feature sets to data-transfer pricing models are designed to facilitate sharing data and integrating solutions among their own native services while making it as difficult and expensive as economically plausible to dissuade use of competitive services.
- Classic data lake vendors — Vendors of conventional data lakes, such as Snowflake, Redshift and Databricks, offer powerful analytics platforms and ETL ecosystems. For column-oriented data sets, they can provide secure, compute-attached workloads that include basic unidirectional data sharing. While highly capable within their market position, these platforms are designed solely for analytics workloads: They lack the low latency, multiple writer, fine-grained ACLs and other capabilities needed for real-time operational data sharing and application connectivity. Their connections to operational systems are generally limited to ETL-based ingestion connectors and occasionally “reverse ETL” solutions to harvest analytics for operational systems tuning and feedback loops.
- Legacy blockchains — Unlike point-to-point EAI and homegrown cloud solutions, blockchains offered a distinctly different approach when they came on the scene: A decentralized technology capable of creating a durable source of truth even while maintaining separate stores of data for each party (see Figure 3). Blockchains achieve a unique balancing act: Each party retains a self-governed, operationally isolated copy of the data but at the same time that data is kept consistent and up to date among all the parties with cryptographic guarantees. This bypasses two of the key limitations of earlier approaches: Constant polling in an attempt to “learn the latest” and the related challenge of building point-to-point solutions. Unlike these earlier approaches, blockchain-based solutions can’t “disagree about the facts.”
Unfortunately, the first generation of blockchains wasn’t operationally ready for enterprise use cases: Their high latency, low throughput, high costs, lack of scalability and fault tolerance, and complex infrastructure deployment and management overhead made them ill-suited to real-world use cases.
All of the above approaches suffer from inherent limitations when used to tackle real-time data-sharing challenges; none of them are ideal solutions as a real-time data mesh. An ideal solution would offer the single source of truth achievable with a blockchain but with the low latency, high throughput and fine-grained data controls more typical of an EAI-based solution coupled with all the scalability and fault-tolerance benefits of a public cloud service.
In Part 2 of this series, we will explore how these elements can come together in a best-of-breed data mesh. We define the real-time data mesh and discuss the key tenets for incorporating them into modern IT stacks.