The Data Protection Challenges of Kubernetes
Cloud Native Computing Foundation sponsored this post, in anticipation of the virtual KubeCon + CloudNativeCon North America 2020 – Virtual, Nov. 17-20.
Adopting Kubernetes as your de facto standard for container orchestration will accelerate your data-center orchestration and modernization efforts. Companies across all verticals and segments, from SMB to Large Enterprise, have adopted containers as a way to develop differentiated products built on cloud native foundations. But this accelerated adoption is not without its challenges.
In delivering Kubernetes to the enterprise, DevOps engineers have embraced containers as the new virtual machines and started the migration of stateful applications (i.e. using databases and middleware layers) to containers. With self-service access to provision storage whenever and wherever they need it, DevOps engineers are no longer bound by the delays of traditional IT help desk requests, so container deployments have skyrocketed. The resulting container sprawl and now cluster sprawl is providing the next challenge to understaffed IT teams.
The Containerization Journey
Why are DevOps engineers taking a technology designed for stateless applications and layering stateful applications with all the complexity of persistent volume management? Simple: agility. Kubernetes allows the developer to provision, test, QA and even scale their application, based on business needs or demand. The journey from a traditional monolithic application is multi-phased and step one is the virtual machine to container migration.
Businesses are relearning how to build applications that leverage the on-demand nature and resiliency capabilities of cloud native technologies, to respond to always-on customer demands in a mobile world. We are seeing rapid adoption in the VM to container space, followed by multiple re-factoring phases, followed by adoption of flexible and software-defined storage to simplify storage management at scale. The ideal state is the microservices architecture, which many businesses are striving to achieve.
What is missing in this approach are the underlying tools and automation to deliver end-to-end data management. There have been some initial projects that attempt to move beyond basic scripts (formerly Heptio Ark, formerly Velero, now VMware Tanzu). There is the Kubernetes Storage SIG, and the recently established Data Protection Working Group (WG). But fundamentally there are still some basic challenges that need resolution:
- What is the definition of an application within Kubernetes (see v1beta1 Application CRD)?
- How does a developer record the dependencies against an application (e.g., customer resource definitions or resources)?
- How does protection and recovery work in secure multi-tenanted Kubernetes clusters (see Hierarchical Namespace concept)?
In fact, if we look at the traditional monolithic applications that are actively being migrated to Kubernetes applications, we find another list of challenges. Application consistency must be achieved, without the requirement to insert non-application binaries or agents inside the container. Application consistency is the act of coordinating application state and the protection operation (backup, storage snapshot, etc.).
Storage consistency must be achieved using snapshot mechanisms, to allow for online or ‘live’ protection without impacting the running application. In fact, as businesses adopt a wide variety of storage solutions, the ability to take a cloud native approach to storage management (API-driven, open interface, seamless scalability) is required.
The Container Storage Interface (CSI) provides this cloud native approach today, and while dynamic provisioning, attach/detach, and mount capabilities are stable, snapshot capability is not yet generally available. Large enterprises have come to know and love snapshot-based protection with their traditional enterprise storage array technologies. Snapshot, clone, and consistency group (CG) backups are considered core functionality to permit a wholesale migration of traditional applications to containers. One example is the ability to provision all-flash storage to production environments, but leverage the CSI cloning copy of these snapshots to a more cost effective tier (i.e. dev/test seeding). These capabilities are still under development within the CSI specification.
It should be noted that while the CSI standard provides a way of providing a storage level point-in-time volume snapshot, it does not move that snapshot to alternative storage media. A snapshot is considered a ‘recovery point’ and requires copying to cloud, disk or tape media, to be considered a true ‘backup copy.’ The Data Protection WG is currently working on this challenge.
We have seen a bifurcated approach to application architecture and resiliency. Depending on the development resources available to a business, they may take one of two disaster recovery approaches:
- Application-centric recovery focused on capturing the entire Kubernetes application (manifests, persistent data, dependent resources) and re-scheduling them in a remote cluster. This approach can be entirely automated, with no reliance on the application owner.
- Infrastructure-centric recovery focused on leveraging next-generation software-defined storage (SDS), that can be tightly integrated into Kubernetes by way of a custom resource definition (CRD) to provide scheduling, replication, cloning, and recovery from the Kubernetes command-line (i.e. kubectl).
Both approaches are valid, but incur a different level of IT operations resources, application development resources, and associated automation. As recovery events are often a response to an unanticipated event, intelligent automation is required to drive consistent recovery outcomes and meet business recovery time objectives (RTOs).
Beyond the Kubernetes Cluster
Kubernetes-based or container-based applications are made up of a number of new data types distributed throughout the organization. At the recent KubeCon Europe, experienced Kubernetes veterans expressed a desire to not “bypass CI/CD, code reviews, and formal release processes.” Bottom-line, for containerization to form a stable building block of the next-generation application landscape, data protection best practices are required.
- Are you protecting developer workstations where the majority of development initiates?
- Are you protecting your source-code control system and CI/CD systems?
- What is the impact to your customers if your CI/CD system is unavailable?
- Are you protecting your etcd (etcd.io) data for on-prem clusters?
- Are you running your own private image registries, and if so, are they protected (goharbor.io)?
- How will you protect modern persistence stores like cloud object stores?
Considering the End-to-End Challenge
When we step back and review these challenges, we must reflect on why we perform data protection and data management.
- We need to recover a failed application to production.
- We need to recover a failed application or container(s) to an alternate location (disaster recovery).
- We need to migrate applications for infrastructure lifecycle or development (i.e., seeding a new cluster).
- We need to optimize our deployments by consolidating and reducing infrastructure sprawl.
- We need to protect applications to a defined SLA.
- We need to deliver the capabilities regardless of workload location (on-premises, cloud).
These challenges require data management capabilities that enterprises already enjoy, including:
- Centralized policy-based protection across all Kubernetes deployments.
- Holistic reporting, dashboards, trending, and alerting across all protected data.
- Self-service backup, recovery, and insights for authorized individuals.
- Integration into Single Sign On (SSO) systems to provide granular role-based access control.
- Policy-based control to access and use protected data.
- Governance and compliance capabilities for report, audit, log, and persistent data for regulatory requirements.
Many data protection solutions today rely on capturing application manifests and persistent data. Scheduling of protection occurs on a cluster-by-cluster basis, with little visibility across multiple implementations. Additionally, secure multitenancy best practices and even integration with technologies like Open Policy Agilent (OPA) are yet to mature.
Kubernetes has certainly delivered application mobility, with the orchestration of an application from one cluster to another now possible. Challenges moving forward are going to require policy controls, reporting, alerting and even Kubernetes manifest transformations, to support migration between disparate cluster versions and technologies. A community–developed standard approach for protection, with the ability for the solution to allow the development by third party protection vendors, is urgently needed.
To learn more about Kubernetes and other cloud native technologies, consider coming to KubeCon + CloudNativeCon North America 2020, Nov. 17-20, virtually.
The Cloud Native Computing Foundation is a sponsor of The New Stack.
Feature image via Pixabay.