The Whys, Whens and Wherefores of Kubernetes Backup

Kubernetes is among the most misunderstood of new technologies when it comes to data protection. Several things contribute to this confusion, including:
- Kubernetes’ heritage as a solution used for stateless applications,
- the tendency for applications or even whole clusters to be deployed automatically using infrastructure as code from repositories,
- blurred organizational lines of responsibility in clouds, and
- the tendency for Kubernetes to be managed by development or DevOps teams without a background in traditional IT operations.
The Same, but Different
In many ways, the requirements for backing up Kubernetes are the same as for other critical IT infrastructure, but in other ways, they are very different.
With traditional server infrastructure, it is generally assumed that all servers or VMs that run production applications and all systems critical to development will be backed up. Other systems for test/QA, and staging are often excluded, but may also be backed up for convenience or to minimize possible development schedule disruptions. These backups may be done at the server or VM level, at the storage level, or more likely at both. The process, while not necessarily simple, is well understood.
With cloud infrastructure, configuration data for the infrastructure itself must also be backed up. If cloud infrastructure is deployed automatically using IaC tools, the repositories that contain the IaC files should be backed up. But often this isn’t done, and when it is it must be hoped that the actual configuration hasn’t diverged from the deployed configuration. Storage volumes in the cloud may include different types of replication and snapshot capabilities, but these need to be managed and are generally not a replacement for application-level backups. Running in the cloud doesn’t obviate the need for backups, it just changes the requirements.
Kubernetes adds an additional layer of complexity. A cluster is imposed on the underlying nodes or VMs, whose configuration may change over time. On top of this, containerized applications are deployed which may call for various types of persistent storage volumes and can create custom resources or otherwise modify the cluster state.
Several Methods, Same Outcome
There are multiple approaches to protecting Kubernetes and the applications that run on it, but not surprisingly the best approach is to use a solution that actually understands Kubernetes. You could use a tool that just protects the underlying Persistent Volumes (PVs) at the storage level, or you could back up the underlying nodes, which might be easy enough if they are VMs. But where would that leave you when you want to restore? There is a good chance that you may only want to restore a single namespace, or a single PV, or even a single resource such as a secret. A traditional backup tool that isn’t aware of Kubernetes will be no help with this.
It’s become common for stateful applications to run under Kubernetes, making use of persistent volumes and often even running databases on them. As with databases running on traditional server infrastructure, obtaining consistent backups can require application awareness in the form of “hooks” to quiesce the DB or application before volume snapshots are created. On Kubernetes, these hooks must also be cluster-aware so that they are directed to the proper node and container.
To Back up or Not to Back up
You might think that your Kubernetes cluster doesn’t need backups at all, because it only runs stateless applications, and everything is deployed automatically using CI/CD pipelines and IaC tools from files in a git repo. That may be true. But it may not. Are you sure you can easily rebuild that environment the way it was five minutes ago, or two weeks ago, or nine months ago if called on to do so? Are you sure there hasn’t been any configuration drift since your cluster was created by your deployment tools? And are you sure, even though you may not use PVs, that important application state isn’t being stored as Kubernetes custom resources, in CronJob entries, etc.? Are your secrets and certificates protected? Most importantly, are you sure that what your developers told you last month about the lack of application state and possible configuration drift on the cluster is still true today? If you opt to forgo backups, be sure to check with all stakeholders on an ongoing basis that it is still appropriate. Then ignore what they tell you and periodically test a full rebuild from scratch. It’s better to be safe than sorry.
High Availability vs. Backup
Kubernetes and cloud infrastructure together can provide an excellent high-availability platform for your applications. Infrastructure redundancy and replication across multiple availability zones or regions can provide fault tolerance and application-level resilience. But high availability is no substitute for backups. HA solutions protect against data loss and unavailability due to physical failures such as failures of disks, nodes, power, network connectivity, and, with proper design, even entire sites. But they don’t help protect against logical failures. Since the use of RAID and volume replication became common in the 1990s, physical failures in data centers are seldom the cause of restore requests. The primary cause of data loss and subsequent restores is logical errors: user errors, software errors, operator errors, and security breaches. Using highly available cloud solutions doesn’t relieve you of the responsibility to protect your applications and data with backups.
Happy Outcomes
Think carefully about the whys while deciding how, when, and whether to protect your Kubernetes clusters. A good and properly configured backup solution could make the difference between a good and bad Kubernetes experience.
As a data protection company, at CloudCasa by Catalogic we have heard of many approaches to protecting Kubernetes, and many reasons why customers don’t or didn’t, think they needed to back it up at all. Some of these reasons were valid, and others weren’t. Sometimes these decisions lead to unhappy outcomes that we heard about only afterward when customers came to us seeking to prevent them from happening again.