Protecting Kubernetes Data: The Stateful Application Edition
With over 83% of enterprises deploying Kubernetes in production, it is becoming the de facto way to deploy applications. And the reasoning is clear to all.
Kubernetes allows for a declarative configuration that can be stored as a code and reused, portable workloads that can be shifted to a different cluster by simply running a YAML file or HELM chart, and self-healing application resiliency.
These are all very impressive capabilities, well… at least for stateless applications. But not all applications are stateless, most require data.
For stateful applications, it’s a completely different story when it comes to storage, full of all those wonderful challenges you typically want to avoid when it comes to touch-points like:
- Data Resiliency: Storage is often a single point of failure, storage resiliency architectures are complex and expensive,
- Recovery Point Objective and data consistency is hard to achieve even with data protection mechanisms,
- Complexity: Networking and replication is no cakewalk, it requires expertise and time.
And of course, cost; end-to-end data protection solutions are expensive!
In a nutshell, stateful applications come storage dilemmas: What to choose? How to deploy so that storage is available on all clusters. How to defend your data from failure?
How does a stateful app provision and use available storage? And how will the data be used by a stateful app?
First, you need to evaluate your data protection requirements: What can fail? What failures do you need to be protected from?
In other words, how granular is the redundancy required by your application?
- Availability zone
- Cloud provider
How much data loss can you afford? Does every transaction count? Or is recovering from a day-old backup/snapshot good enough?
How much are you willing to invest? Higher data protection options increase complexity and cost. Your goal is to match workloads to their cost-effective RPO
Let’s review our options:
Let’s start with local disks, directly attached to the node. They will not protect data from failures at the node, AZ region or cloud provider. If the node fails the data will be lost.
Let’s have a look at public cloud block storage, confined to a single availability zone it will protect from failure at the node level, but not at the AZ, region or cloud provider level. If it fails the data is gone.
Cloud Storage with Regional Snapshots
How about cloud block storage with regional snapshots? This brings us back to the RPO question: How much data can you lose? Is a point-in-time solution with inherent data lost good enough for your data?
Regional snapshots will let you resume operations at the node and AZ level but will not protect you from Region and cloud provider failure.
Cloud Storage with Snapshot Shipping
Can cloud storage with snapshot shipping save the day? This process of replicating snapshots to a different region can protect data from a region failure. However, it is still a point in time solution with risk of data loss. It will cover region failure but is limited to a single cloud provider and has its share of complexity and associated costs.
Some of you might be saying to yourself “I will use a managed service” and let someone else deal with the storage and infrastructure, assuring data resiliency and minimizing overhead costs. The problem with this approach is that most managed services (for example Aurora or RDS) are confined to a single region and exist outside of your Kubernetes Cluster. If a region is gone you can’t resume operations on a different region.
Database with Replication
OK, what if we run a database inside the K8s cluster that already has replication capabilities built into it? Well, that is a good solution to assure you have an end-to-end data resiliency architecture with data loss that won’t reach hours, but you still have to establish all the networking by yourself and it’s going to cost you a lot.
SDS with Replication
Another option is running software-defined storage solutions. Some like openEBS or Portworx will provide asynchronous replication between clouds and regions but will still require a need for managed storage and established networking. These options are costly and complicated and will not assure zero data loss.
If we stack up all our options and compare, we can see a clear tradeoff between cost and complexity to data resiliency, but why can’t you have it all?
We’ve found at Replix that the answer is simple: data resiliency requires data mobility. When one location fails you better have an up-to-date replica of the data outside of your failed location, a process that requires data mobility. Current solutions were simply not designed around data mobility. Add to that the complexity of networking and the fact that storage and replication are complicated, expensive, and are not really promoted by the public cloud vendors for the risk migration. We can understand while the majority of stateful applications have yet to utilize Kubernetes.