How to Simplify Kubernetes Updates and Reduce Risk
One of the advantages of using Kubernetes to run your infrastructure is that it makes keeping applications up to date relatively straightforward. So it’s ironic that keeping Kubernetes itself up to date is considered much more of a problem.
It’s not that the updates themselves are an issue. Some software, such as Mirantis Container Cloud, will even do the updates for you. But that doesn’t mean that the update itself is without risk or the need to invest the time of some of your most costly people to prevent catastrophe.
In short, Kubernetes updates mean that multiple applications can break, so naturally, everyone is involved in preventing this from happening — developers, team leads, operators and security. Everything else stops until the update is complete, which can really cut into your bandwidth.
Let’s Look at Why Kubernetes Updates Can Be Such a Problem
The Kubernetes project website suggests a basic order of operations for updating clusters:
- Upgrade the control plane.
- Upgrade the nodes in your cluster.
- Upgrade clients such as kubectl.
- Adjust manifests and other resources based on the API changes that accompany the new Kubernetes version.
This seems like a simple process, but each step can be fraught with danger. Kubernetes is a fast-moving project that sometimes introduces breaking changes — for example, deprecating features, extending APIs and introducing new best practices, including new software components and so on. These changes can have widespread effects on how your cluster(s) work. Changes can affect:
- How the cluster runs on infrastructure (host operating systems and networks). See this article for an example of a bug in last September’s release of Kubernetes version 1.25 that would make Kubernetes worker nodes unable to communicate over the network.
- How the cluster works with other services and resources like cloud provider APIs, DNS, ingress, service mesh, storage, backup and so on.
- How the cluster works with application-specific resources, used to help Kubernetes orchestrate the things you build and host on it.
- And finally, a Kubernetes update can break applications themselves when any of these components change.
So updating a Kubernetes cluster is a deep, potentially scary and perhaps an expensive proposition. If the application that is at the heart of your business goes down, you’re at a standstill. Can you afford for that to happen? Can anyone? For how long?
To prevent this from happening, first you need your most technical people to read the release notes in detail and flag anything that will break something. These changes may be an immediate blocker, in which case you’ll need to adapt your implementation and/or applications before updating.
Then you need to test everything you plan to do before you do it. In practice, that means you need to build a (perhaps substantial) test cluster that duplicates your current environment in as many respects as possible (ideally all of them). You need to mount the latest version of your applications on it, make your integrations to it and make sure everything works “as in production.”
And then you need to perform the update process meticulously while testing for problems as you go, with an eye to halting and rolling back as issues are discovered and assets (manifests and resources) are altered to adapt to the new version of Kubernetes; then retry the update until you can accomplish it without incident.
The complexity of these operations goes way up as cluster size and sophistication increases, as products external to Kubernetes become important dependencies, and so on. Things also get harder as the applications you run get configured in more complicated ways.
Ways to Simplify the Process and Reduce Risk
The best way to manage this complexity is through automation, though surprisingly, many Kubernetes users use only very basic automation to deploy clusters. Monolithic automation may be able to move a single target cluster (and potentially hosting and surrounding infrastructure) to a new desired state, but it might not be up to the complex task of updating a cluster in several phases, interspersed with testing (and rollbacks and so on).
You might need to compose and test custom automation to manage your particular update process, which will then become something that needs to be tested with each update.
All of this involves cooperation between operators, architects, DevOps engineers and application developers, all of whom must take time away from their primary duties until the update is successfully completed.
The alternative is to work with partners and providers such as ZeroOps practitioners, who will take this burden off your shoulders. This “de-risking” of Kubernetes updates is actually a complex process in itself. Look to a critical-path Kubernetes operations partner to:
- Help you plan software development and operations, make decisions about and build your Kubernetes cluster model. It’s possible to encode best practices from the start — in how you deploy Kubernetes clusters and how you build applications and services for them — to anticipate and prevent dependencies from evolving.
- Plan for updates, perform necessary tests and build proof-of concept clusters with limited footprints using pre-GA project software assets. This works best if the partner is actively maintaining and supporting the Kubernetes distribution that you use, which means staying away from the absolute “bleeding edge” of Kubernetes updates while still keeping your implementation contemporary (and, of course, fully supported).
- Provide and support continuously evolving and improving automation — not just to deploy and manage whole clusters, but also to automate the entire update process so that it happens reliably within a short maintenance window. In principle, the goal is to make updates seamless and continuous, entirely without disruption to running applications and processes.
- Extend your software development and operations teams with deep expertise required to interpret update communications. Work with the Kubernetes community to identify potential impacts and know enough about your operations and applications to flag “gotchas” early. Make plans to remediate — continually improving your way of working to be more and more free of dangerous dependencies and increasingly update friendly.
In short, while Kubernetes updates should, in theory, be straightforward, they can’t be a “set it and forget it” proposition. There’s too much potential for breaks. Whether you’re using a ZeroOps partner or going it alone, Kubernetes updates should always be performed carefully and deliberately, even if it means that everything comes to a stop with all hands on deck until it’s complete.