Real-Time Data Access Across Highly Distributed Environments
The goal is straightforward, but getting there has proven to be a challenge: how to offer real- or near real-time access to data that is continually refreshed on an as-needed basis across a number of different distributed environments.
Consequently, as different systems of data and their locations can proliferate across different network environments — including multiclouds and on-premises and, in many cases, geographic zones — organizations can struggle to maintain low-latency connections to the data their applications require. The challenges are especially manifest when users require and increasingly demand that their experiences, which are often transactional-based, are met in near- or real-time that require data-intensive backend support.
Many organizations continue to struggle with the challenges of maintaining and relying on data streaming and other ways, such as through so-called “speed layers” with cached memory, to maintain low-latency connections between multicloud and on-premises environments.
In this article, we describe the different components necessary to maintain asynchronously updated data sources consisting of different systems of record for which real-time access is essential for the end-user experience.
For the CIO, the challenges consist of the ability for applications to have low-latency access to data, often dispersed across a number of often highly distributed sources and systems of record.
“The CIO often asks ‘how do I get my data out of my existing environment, how will I get it to the cloud and then how can I have it run as a managed service.’” Allen Terleto, field CTO at Redis Labs , told The New Stack. “A lot of times, especially in healthcare and financial services, the CIO will ask ‘how do I maintain consistency with on-prem, because actually, my systems of record are still going to remain there and I made huge investments towards that ecosystem and enterprise integrations that I don’t want to lose. The question is how to maintain the entire data pipeline but still expose new digital channels through the public cloud.”
As many enterprises continue to maintain a significant on-prem and legacy infrastructure, organizations are typically looking for ways to seamlessly integrate their data across all environments to build next-gen real-time applications, Priya Balakrishnan, director of product marketing, for Confluent, told The New Stack. “The challenge with data streaming across cloud and on-premises environments is facilitating real-time data interoperability across a complex web of disparate technology stacks and disparate environments,” she said. “While specialized tools enable migration of data from on-prem to the cloud, many don’t offer a persistent bridge to keep their data in sync at all times.”
Consistency is also key. An organization can offer, for example, a high-powered cloud application but the data access aspect can remain an Achilles heel if certain datasets are not updated asynchronously, while the microservices are not as secure and reliable as they need to be. “Everything needs to be held consistent,” Terleto said. “If the inventory is not consistent and I go on the website and it says five hammers are available and then go to the store and there are only three left, I’m not a happy customer.”
A Digital Integration Hub (DIH) Cometh
An emerging alternative allowing organizations to develop applications for end-users requiring near- or close to real-time access to data is what Gartner originally called a digital integration hub (DIH). According to Gartner’s definition, a DIH aggregates data from different systems of record into a “low-latency high-performance data layer, usually accessible via APIs.” The developer can thus take advantage of DIHs by creating apps — for which ultra-low latency access to different data sources through APIs is key — without having to manage the synchronization between the applications, the cache and the systems of record.
“Faced with competition from disruptors in their domain and with the shift in customer expectations toward extreme user experience, organizations realize that they need to build an architecture that allows continuous and fast introduction of data-centric services over their digital channels,” Eti Gwirtz, vice president, product, for GigaSpaces, a digital in-memory DIH services provider, told The New Stack. “This architecture should allow event-driven data freshness and 24/7 API-enabled data accessibility to satisfy ever-growing business demands.”
Typical adopters of DIH include both cloud-based startups and traditional industries, such as banks and industries that have a significant amount of data stored on traditional — and often decades-old — infrastructure that must be modernized in order to offer asynchronous data connections. “DIH is where anyone obviously in the banking world and any other sort of traditional industry is heading at this time,” Gwirtz said.
Having a “speed layer” — where computed results are held and cached, and then updated once new data arrives — closer to the point of usage can help local applications continue to function even when streaming messages are delayed. Robert Walters, senior product manager MongoDB, told The New Stack. “This architecture allows for systems to be more loosely coupled but still connected and can help alleviate network and data latency challenges for real-time use cases,” Walters said. “The key for the speed layer between the application and the data stream must be a technology that can store and process all varieties of data.”
Whether network latency can be acceptable for applications or not also depends on the use case, Walters said. For example, a business report might be generated every night at midnight from data in the cloud, for which very low-latency connections are not necessarily critical. In a different use case, where more real-time data is needed, the use case is much more sensitive to network challenges, such as in the case of a fraud detection system, Walters said. “With fraud detection, you want to analyze transactions in real-time and compare them with geodata and previous spending patterns to determine if the financial account has been compromised,” Walters said. “In this use case, you need data from these systems to be available in near real-time, and thus, network latency will be an essential consideration.”
Data must be able to stream to and from a digital information hub through multiple data pipelines supporting not just databases, but from different types of microservices streaming from different sources, whether that might be Kafka streaming, IoT edge devices or any potential data sources. “The data can feed into multiple pipelines that can write into one or more targets. A single pipeline can also read from multiple sources at the same time,” Steve Wilkes, CTO and co-founder, of real-time data-integration provider Striim, said. “We have examples where customers are reading from multiple databases or a combination of databases and files, and are combining that data in real-time within the platform, and correlating and pushing the end result into some target or delivery into multiple targets,” Wilkes said.
Change Data Capture
However, there are limitations of in-cache memory as a way to maintain database updates. Without the ability to update in-cache memory in real-time across an organization’s distributed environments, problems can arise. An inventory database in cache memory that is only partially updated, for example, can cause problems at different points of sales or on-premises environments when a device is listed as available but is, in fact, unavailable in the inventory.
This is where change data capture (CDC) enters the picture. As Striim’s Wilkes stated, change data capture is a technology that can collect database activity such as inserts, updates, deletes, and relevant schema change (DDL) statements and turn them into a stream of events. A platform utilizing CDC needs to be completely change-aware, be capable of processing or modifying the events, and be able to apply those changes to a target database, data warehouse, or other technology.
Change data capture solves, to some degree, the cache problem by collecting changes from the underlying database associated with a cache and updating the cache as necessary whenever data changes, Wilkes explained.
The main connecting thread for data streaming, CDC and distributed database applications in general mostly consist of microservices. The vast majority of organizations rely on microservices to play a key role in low-latency connections supporting applications across highly distributed environments. According to a Redis Labs survey conducted with analyst firm IDC this year, 84% of the respondents reported using a Key-value or NoSQL database for their microservices applications.
“It’s no longer — in terms of a maturity or adoption model — about whether or not companies are using microservices — it’s a matter of scale,” Terleto said. “Microservices are the architecture of the day and they will continue to expand within enterprises as they get over the hump of how they work, what are the proper patterns that we should be using and how do we actually scale this in a feasible way.”
When propagating data with CDC technologies, for example, microservices will almost inevitably be used, Terleto said. Their fundamental properties require that they can be used in isolation and they are consistent and decoupled.
“The whole point is to get to market faster and to not have coupling across all of your microservices, so that way you’re not stuck to a single enterprise release cycle,” Terleto said. “Leveraging a streaming technology can also act as a backbone across the entire microservices architecture, allowing them to communicate asynchronously and provide eventual consistency.”