The Real-Time Data Mesh and Its Place in Modern IT Stacks
In Part 1 of this series, we highlighted the challenges of real-time data sharing, discussed operational vs. analytical data, and considered legacy solutions and their limitations. This post defines the real-time data mesh and discusses the key tenets for incorporating them into modern IT stacks.
Facilitating real-time data sharing is a challenging proposition, particularly when multicloud and SaaS applications are included as typical requirements. At the same time, these difficult implementation challenges are surprisingly undifferentiated: They don’t differ significantly across industries, sectors or company sizes, making data sharing ideal for a platform-based solution.
Best-of-breed solutions share with legacy blockchain approaches a key architectural insight: The job of maintaining a single source of truth across multiple parties belongs with the platform, not with each of the parties. This produces several advantages:
- The platform understands the data. Unlike Mulesoft and other “dumb pipe” solutions that are ignorant of the data being carried, having the platform understand the data means it can also ensure that data is consistent, correct and up to date everywhere it flows. This key difference shifts many of the conventional operational and security challenges of handling data from individual applications and IT teams back onto the platform provider.
- The platform can offer a uniform way to model data and access controls. Almost as important as sharing data is ensuring that the wrong data doesn’t get shared. When every party and system require a DIY approach to security, access controls, auditing, compliance and governance, those problems take on a life of their own, becoming heavy lifts that can dwarf the original problem of simply moving data between parties. Letting a platform accomplish this not only shifts the burden of implementation (and spreads it among many customers, making it cost effective to produce and maintain), it ensures that the approach is uniform across all parties and deployments.
Unlike early blockchains, which were essentially commercialized prototypes, modern data mesh offerings are based on solid public cloud engineering. They share the same multitenanted, highly scalable designs as highly adopted public cloud services and exploit modern interfaces, including GraphQL APIs and container-based code sharing. These advances in engineering and architectural patterns have allowed “second-generation” approaches to solve the issues that plagued early (and usually failed) attempts to deploy blockchain technologies in enterprise settings:
- The platform is highly scalable and low latency. Blockchains are plagued by poor performance, with public chains like Ethereum struggling to maintain 14 transactions per second worldwide, shared among all customers! Transaction settle time can approach 15 minutes or longer, and the cost per transaction can be as high as $50 or more. Even private chains, such as Hyperledger Fabric suffer from “one-box deployment” models — unlike a cloud-based service, they are forever, fundamentally limited by the processing and memory capacity of a single server and at most a few cores within that server. That doesn’t bode well for any enterprise workload that needs to scale.
- The platform is delivered as a SaaS service. First-generation blockchains were a throwback to early approaches in more ways than one: Not only do their “single-box deployment” models make them slow with limited throughput, this limitation also means they have to be manually deployed, maintained, monitored, secured, scaled, made fault-tolerant, etc. That’s a huge burden to an already strapped IT team and only adds to the burden of infrastructure management and staffing load. By contrast, next-generation data-sharing solutions are commonly provided as SaaS services, with zero infrastructure footprint and “ilities” such as scaling, fault tolerance, monitoring, infrastructure maintenance, etc., owned by the service provider, rather than left as an exercise to the IT customer.
Why Is Data Sharing so Difficult?
Despite being a ubiquitous need, real-time data sharing isn’t always a well-modeled element in existing IT stacks. Gartner echoes this thought, “IT and business-oriented roles … adopt EiPaaS as a key component of their integration strategy … [but] despite its mainstream use, choices of providers are fragmented and difficult to navigate.” It’s an intriguing question: Why should that be?
The answer lies in the structural shifts our industry is undergoing. “Classic” IT had a relatively simple problem to solve:
- Data, such as sales information, was produced in-house.
- Workloads to process that data, such as calculating accounts receivable and producing and sending invoices, were run in-house over the collected data.
- The data was optionally collected and shipped to an analytics subsystem for offline analysis to produce business intelligence reports for management.
In other words, both production and consumption of data, along with any transmission or “sharing”, was handled in-house — often within the confines of a single mainframe. Whether built in-house, through outsourced delivery partners, or provided via ERP systems, these “data monoliths” were, despite their other challenges, relatively easy to manage from a sharing perspective.
Flash forward to today:
- SaaS vendors increasingly pull data away from central IT and into their own (usually public cloud-based) storage and compute solutions.
- Business partnerships, such as supply chains, are increasingly global, meaning that more and more of the “data of interest” to a company lives outside its four walls. Amazon, for example, estimates that up to 80% of critical business data no longer resides internally.
- Adoption of public clouds in general and multicloud architectures, in particular, require a wholesale shift of data outside of conventional on-premises data centers, often into fully managed services or specialized cloud databases, such as NoSQL, time-series or graph-optimized offerings on AWS, Azure or GCP.
- Customer demands for “always-on” internet-era experiences mean that applications that used to be “9-to-5, Monday-to-Friday” SLAs are now 24x7x365 with 99.99% uptime requirements, a threshold that forces IT teams to design and deploy leading-edge approaches to scalability, fault tolerance and multicloud resiliency for virtually everything with public or partner surface area. That’s an incredibly tall order for teams already struggling to meet business needs and potentially with limited in-house knowledge of advanced distributed systems design techniques.
- “Shadow IT” has forever fragmented the notion of a single team operating a single mainframe into a distributed patchwork of applications, teams and approaches that is challenging to manage even for the most well-run of the Fortune 100. Approaches that embed security, scalability, fault tolerance and other “ilities” and governance models directly into the product or service offering thus confer a huge advantage over approaches that make those challenges DIY, because DIY in a shadow IT org usually implies unbridled heterogeneity and an increasingly chaotic portfolio over time.
With all these structural changes, it’s easy to see why ERP systems developed in the ‘90s, and even EAI approaches that worked fine in the 2000s are no longer able to satisfy the needs of companies and their IT demands: The challenge of disparate data isn’t something they had to worry about, and as a result, they’re ill-equipped to deliver on modern IT experiences in data sharing.
Incorporating Real-Time Data Mesh Solutions into a Modern IT Stack
Because of the challenges cited above, even high-functioning IT teams don’t necessarily have a strong “recipe” for incorporating real-time data sharing into their approach in a uniform, best-practice fashion. This section briefly surveys three deployment approaches with increasing levels of capability and complexity to provide an overview of how these platforms can be incorporated into modern, service-based IT portfolios.
The simplest deployment approaches are those where the data model and connectivity is tied directly to an existing SaaS-based domain, such as sharing marketing or sales information between CRM systems for co-selling purposes. Because the domain is well known in these cases, there is little to no data modeling challenge, and because the systems of record (Salesforce, Microsoft Dynamics) are well known, connectivity is equally easy, usually limited to authorizing the platform against the systems in question.
Setup and configuration typically can be done inside of a week and involves:
- Field name alignment among parties (usually two or more departments or business partners).
- Configuring access controls to ensure that only the authorized data is shared with the right parties.
Figure 1 illustrates a typical application-based deployment using CRM contact sharing as an example.
Application-based solutions are simplified by a shared domain model, such as CRM data, but are able to connect different SaaS vendors across different organizations, even among multiple companies. They represent substantial leverage over point-to-point API-based data-sharing solutions that require building and operating a full, security-hardened and compliant integration between every pair of parties involved.
Because both the applications being connected and the underlying platform are all SaaS-based, there is no infrastructure to deploy or complex data modeling to perform, and deployments can move from prototyping to testing to production in the space of weeks rather than months or years. For teams already familiar with ETL “data exhaust” from these applications, the design pattern is identical, making deployment even more efficient because similar patterns of authorization and enablement can be followed.
This pattern can be easily repeated for other SaaS applications and takes advantage of the industrywide trend toward SaaS: Eventually, every major SaaS application will have real-time data connectors that simplify the sharing of data with similar applications across departmental, cloud or organization lines.
The design is also open-ended: “Hybrid” deployments can take advantage of the simplicity of connection to a SaaS system, such as a CRM provider like Salesforce, while also connecting (internally or through a partner’s implementation) to in-house applications (see Figure 2). This flexibility supports custom development of mission-critical applications without giving up the advantages of simple data connectivity to existing systems.
(For more on fully modeled solutions and their deployments, see below.)
The next step toward custom development is file-based sharing. This pattern shares with application-based sharing the advantage of not requiring the construction of a data model: The data model is essentially just a file system shared among the various parties. File-based approaches are more flexible than pure application-based solutions, however, because they can leverage legacy formats. Many existing cross-company data-sharing solutions are based on files, and a file-based sharing approach is a simple way to maintain compatibility while simultaneously progressing toward a modern data-sharing solution for real-time data needs. Figure 3 illustrates migrating from an sFTP-based “file depot” solution to a real-time data-sharing pattern based on files while preserving existing file formats and application-processing logic.
As with the application-based approach described above, access controls are critical: Each party needs to define, for the files it authors, which other parties should receive the data. In addition, files can be large, and best-of-breed platforms will actually distinguish between sharing the data and copying the data. This additional dimension of control allows the members of a data-sharing arrangement, whether they’re two regional deployments in an application, multiple organizations within a single company or multiple companies with a shared workload (such as a supply chain) to decide how many copies of a file are warranted. Copying controls allow parties to balance the cost of making copies with the operational isolation that “having your own copy” naturally affords.
Real-time data mesh offerings also provide versioning, lineage (who changed what and when), built-in auditing, logging and reporting capabilities. These are essential for governing file-sharing systems over time and at scale; otherwise, the sheer weight of building appropriate compliance and security reporting can overwhelm already taxed teams. The more parties involved and the more “arm’s length” they are from each other, the more critical fine-grained access controls (and commensurate reporting, versioning and auditing capabilities) become. Legacy blockchains and “walled garden” ERP and EAI solutions typically fail at this level of complexity, because they don’t easily provide simple file-sharing capabilities, coupled to production-grade security and versioning controls.
The best file-sharing platforms also provide backward compatibility with existing public cloud blog storage APIs. This compatibility enables existing investments in popular cloud service APIs, such as AWS’s S3, to be preserved intact while still offering seamless data sharing across both organizations and with other clouds. Having cloud-based portability for files built-in means that file-sharing solutions can also be used in-house to create multiregion, multi-account and multicloud strategies with just a few lines of configuration code, rather than the months or years of planning and development usually mandated for a complex “cross-cloud” data-sharing platform.
File-sharing solutions are easily extended to incrementally incorporate additional fine-grained data modeling. This optional process can proceed in graduated steps:
- Attaching simple key/value metadata to files (no real data model, but it allows for incorporating fine-grained “scalar” data).
- Selectively adding a data schema in parallel with the file data.
- Migrating file-based formats to scalar formats, often using the foundation laid in Step 2.
Even for teams that want to adopt fully modeled solutions (see below), file-based approaches can be an easy on-ramp, as they often permit existing application workloads and file formats to remain unchanged in the initial stages of adopting a real-time data mesh framework.
Fully Modeled Data Solutions
The “holy grail” of real-time data sharing is a fine-grained data model capable of automatically powering secure, scalable public APIs. While this approach requires having in hand a data model (also known as a data schema) acceptable to all the parties involved, from there the platform can take over: Modern platform approaches such as Vendia’s can generate APIs automatically, using nothing more than the data model itself. This includes not just sharing current data, but also versioning (“time travel” access to older versions of the data) and lineage/auditing (access to information about “who did what and when,” which is needed to create compliant end-to-end solutions that third parties can successfully audit). Figure 4 illustrates a fully modeled, fine-grained data-sharing architecture among multiple parties.
As discussed above, sharing data is only half the battle: Just as it’s important to get data swiftly from one party to another, it’s important to ensure that only the right data is shared. Access controls, governance mechanism and fully auditable tracing of these settings are key requirements, not just for enterprises but for any company operating in an environment where accidental sharing of personal data makes headlines. Fine-grained data models also provide a natural framework on which to “hang” metadata such as access controls, indexing requirements, and other operational and security annotations, allowing the platform to compile them automatically into a complete, SaaS-delivered solution.
Real-time data mesh solutions don’t make challenges like authorization or authentication harder, but they do emphasize the inherent heterogeneity and security challenges associated with connecting clients that may vary dramatically from party to party. For example, one party might ingest data from a public cloud service and require a cloud native identity and access-control solution, while another party may have elected to distribute shared data to a mobile app running on millions of handheld devices. A successful platform needs to embrace, rather than bypass, these differences by supporting a variety of authentication and authorization mechanisms that can be customized on a per-party basis. As important as a shared-data and governance model is, allowing and supporting required differences among parties is equally critical.
Going Further — Schema Evolution
Business relationships are constantly changing. Business needs, and the data that powers them, is constantly evolving. So to be successful, a real-time data mesh needs to model the “sharing topology” as a first-class element and make it both simple and safe to evolve the data model over time to match the needs of the business.
Successful real-time data meshes incorporate both of these features: The parties sharing data, whether they represent multiple companies, different organizations within a single company, different cloud vendors, multiple SaaS applications, regional deployments or any combination thereof, need to be easy to capture and represent using configuration, rather than requiring complex code or tooling. The data model itself needs to be represented in a standards-based format, not a proprietary representation that could lead to a “walled garden” problem down the road with the ability to augment or alter it in controlled ways over time. By generating APIs and other infrastructure automatically from the data model, the platform can also guarantee backward compatibility for clients, ensuring that as the data model evolves, applications and other parties aren’t left broken and unable to continue sharing data effectively.
Once a deployment strategy has been elected, how can an IT organization perform an effective vendor selection process? The next article provides a methodology for vendor consideration that incorporates the requirements exposed by these design strategies to assist in locating a best-of-breed platform.
Vendia and Real-Time Data Meshes
Looking to learn more about real-time data meshes or their integration with analytical data solutions? The Vendia blog has a number of articles, including how these features surface in modern applications and get exposed through data-aware APIs.
In Part 3 of this series, we provide a vendor checklist that focuses on what’s needed to effectively evaluate real-time data-sharing solutions.