When it comes to storage support in production environments, progress has been slower to come in terms of solutions that meet the need of businesses to enable widespread adoption of cloud native technologies. Data, the beating heart of stateful workloads, seems to be an insurmountable roadblock on the way to realizing the vision of anytime/anywhere for workloads that have a significant amount of state. To stretch the Kubernetes ship metaphor (beyond recognition, perhaps), cloud native for stateful workloads is a boat stranded on the shoals of data gravity. It’s all but impossible to move workloads about freely when they are tethered to large amounts of data.
To better understand the gaps vendors and the community will need to bridge in order to deliver on the promise of Cloud Native for stateful workloads running in production it’s important to recognize the multicluster/multidata center/hybrid-cloud/multicloud reality that is endemic to customer IT departments.
The reasons for this balkanized world are myriad. At the most basic level are different Kubernetes clusters that arise due to such things as separation of dev/test, staging, and production environments, or clusters for dedicated workloads. At a higher level we see customers whose private clouds span multiple data centers for reasons of DR or locality. And, of course, many customers that have their own data centers also have a footprint in the public cloud to take advantage of IaaS and PaaS services creating hybrid-clouds. Finally, as larger customers increase their IT footprints in the public cloud we’re seeing increasing numbers adopt multicloud strategies to address issues such as vendor lock-in, dual vendor policies, regulatory constraints, cost management and locality.
To make things more complicated, there is no doubt that in this many-clouds-many-clusters world significant interaction between these cloud native silos is critical to increase the effectiveness of these IT investments and maximize business agility. Some of these issues have garnered significant attention from major industry players. Google’s Anthos, Microsoft’s Azure Arc, and VMware’s Tanzu are some of the more notable efforts to provide customers with a single control plane and a single Kubernetes based IT environment across all their clouds -private and public.
Still, businesses would find it difficult to gain agility and wring business value from their sprawling cloud native infrastructure. Close examination reveals that common customer usage scenarios are inadequately addressed today. A prime example is customers who want to temporarily expand capacity for workloads by leveraging resources from the public cloud in anticipation of higher peak loads (“cloud bursting”). This might happen when a new product is launched, or during a holiday sales season. The pain point is even more acute when the number of workloads that need to be moved simultaneously is large as in cases where customers want to load balance workloads across private data centers or to migrate between public cloud providers to take advantage of cost efficiencies and new technologies. Urgent regulatory intervention and other types of unexpected crises also require agility and the ability to move large numbers of workloads quickly.
Moving cloud native stateless applications between data centers and public clouds is well supported today; it is simple and fast. But the problem of moving stateful workloads that have significant amounts of storage is anything but. Careful planning, flawless execution, and significant lead times are prerequisites for success. Business agility? Forget about it! Data gravity guarantees that nothing will be quick or easy. The heart of the problem is, of course, WAN connectivity between the different clouds and data centers, with its limited bandwidth, high latency and high cost. But the problem is compounded when the amount of data involved is large — bear in mind that copying a TB of data over the LAN is no fun either. Moving a single stateful workload with a large amount of data is difficult even when it is allocated a disproportionate amount of WAN bandwidth, but when several workloads need to be moved simultaneously to enable business agility the problem becomes physically impossible as there simply isn’t enough BW to run the jobs in parallel.
The problem is further complicated by the existing technologies available for moving data between clouds and data centers across the WAN. Modern versions of what amounts to a Copy command will copy data based on volume snapshots as exemplified in the Linux Foundation’s OpenSDS project. The problem with copying snapshots is that the snapshot is out of sync with the production data as soon as it is taken and still needs time to get across the WAN. A complex process of incremental snapshots where an application is taken offline before the final snapshot can be copied, followed by an assembly process on the receiving end is required before the application can be brought up at the destination.
When the time and effort necessary to plan and flawlessly execute this process, not to mention the advanced notice required by the IT department, is taken into account the picture of a terribly sub-optimal business solution is manifest. Technologies such as DR and real-time data replication can also be used to move large volumes of data over the WAN but suffer from similar shortcomings: all data has to be copied across the WAN before the application can be started on the destination side. Additionally, DR and replication products are generally expensive both in software cost and BW costs as they tend to be bandwidth-hungry. The end result is inescapable: game, set and match to data gravity.
If existing technical solutions are inadequate to address the need to move stateful cloud native workloads between clouds and data centers in support of business agility, what would a good solution look like then? I would argue for an “Instant Data Mobility” approach whose central tenet is that stateful workloads and applications must be able to spin up on the destination side within seconds of the beginning of an Instant Data Mobility operation. Data for read operations must be fetched from the source volume on-demand thus obviating the need to move data prior to the operation while write operations must happen locally at the destination to reduce latency and minimize bandwidth requirements.
This approach would reduce application downtime to seconds as the apps are immediately brought up on the destination side providing unrivaled business agility, allowing businesses to turn their islands of IT from a liability to an asset. This approach would also reduce the WAN bandwidth required for each individual Instant Data Mobility operation allowing multiple operations to run in parallel. And, It would benefit IT organizations immeasurably by removing the need for significant advanced notice for mobility operations and it would reduce the planning and execution phases of mobility projects to a fraction of what they are today.
Needless to say, Instant Mobility Technology would need to invest significantly in optimizing the inefficiencies associated with fetching data across the WAN. For example, the use of technologies such as deduplication have been shown to reduce data volumes in primary data as much as 50%. A shared dedupe engine between source and destination would optimize data access so that data already available on the destination is fetched locally rather than over the WAN. An additional 50% reduction in data volume can be achieved through the use of data compression.
Another technique that has shown great benefit is temporal data adjacency in which a data fetched brings not only the data requested but also data that, historically, has been accessed shortly thereafter providing an extremely efficient data read-ahead mechanism. It’s also important to realize that while stateful applications can require a very large amount of data, that data is not accessed uniformly and applications tend to have working sets which are just a small fraction of the application’s data. Once the working set is available in its entirety on the destination side all on-going application IO access is local. And, of course, a low-intensity background process needs to run and fetch all the data that is not accessed in an on-demand fashion so that future data accesses will have a higher likelihood of being local and that, ultimately, all the data will have been transferred to the destination and WAN access is no longer required.
To summarize, stateful production workloads are the future of cloud native if cloud native is to fulfill its potential as a new industry paradigm. Nurturing and supporting production workloads must become an industry and community priority. Due to the nature of customer production environments that span private data centers as well as different public clouds, moving stateful cloud native workloads seamlessly is a challenge inadequately addressed by existing technology. This article argues for an Instant Data Mobility approach that leverages WAN and data optimization techniques to fetch data over the WAN on-demand while performing all writes locally at the destination thus allowing customers to move applications across IT islands with near-zero downtime.
Red Hat and VMware are sponsors of The New Stack.
Feature image via Pixabay.