Everyone using Kubernetes starts with a single cluster but almost everyone expands to multiple clusters — and as the recent Cloud Native Computing Foundation end-user Technology Radar survey shows, multicluster management is still complicated and fragmented. Today, most organizations have to use multiple tools like Helm, Kustomize, GitOps tools like Argo and Flux, a variety of operators, and even create their own, but the Cluster API project can replace a lot of that with Kubernetes level APIs.
When Kubernetes was first created, it didn’t include cluster management, Kubernetes co-founder and Microsoft Corporate vice president Brendan Burns told The New Stack. “It just assumed that the machines were there; once you had the machines up and the software provisioned onto the machines, the Kubernetes API sprang into existence on top of those machines. And there was a long-standing discussion around ‘is the creation of the cluster a part of the API?’ or is it just assumed that someone creates a cluster and then the API comes up from that?”
“People wanting to create parallel clusters or clusters in private clouds or clouds that didn’t have managed Kubernetes like AKS needed tooling to help them. If you just say ‘here’s the kit of parts’ it’s like Kelsey Hightower’s ‘‘Kubernetes The Hard Way’; you learn a lot but it’s not a great way to do automation!”
The standard Kubernetes tool for this is kubeadm, a command-line tool to create, initialize, upgrade and administer clusters. But kubeadm was always intended as a bootstrapping tool that other tools would build on top of, and it doesn’t cover either day-to-day management of clusters or the long-term management of a Kubernetes environment.
As Kubernetes workloads become more complex and rely on operators, service meshes and API gateways, which in turn rely on the Kubernetes control plane of the API server, scheduler and etcd, the availability of that control plane becomes more critical. But because it typically includes etcd, that makes it stateful and operations need to be done in the right order, making control plane management complex enough to need automation, especially with so many different Kubernetes distributions, installers and managed services.
Different deployment mechanisms for managing Kubernetes clusters all have different APIs for handling what could be common cluster lifecycle events like creating and deleting clusters, upgrading masters and nodes, and adding or removing the capacity for scaling. That can mean “vendor lock-ins, inconsistencies, and rigidity, especially when there is a desire to move horizontally between different distributions,” Red Hat OpenShift product manager Adel Zaalouk told us.
“There’s a general need for that to be an API so you can do API-driven creation of clusters, just like you can with something like AKS, but to do it in a way that is generic across all these different providers, in a bare-metal environment or in a virtual private cloud and virtualized environments where there isn’t a provider like Azure,” Burns explained.
That’s what the Cluster API project in the Cluster Lifecycle special interest group promises: declarative, Kubernetes-style APIs for creating, configuring and managing clusters and automating cluster lifecycle management for platform operators. The real goal is “to make cluster lifecycle boring” Cluster API maintainer Vince Prignano told us.
Kubeadm is an imperative, command-line tool that Prignano calls a Swiss Army knife for creating clusters from zero; “What we build on top is a much more immutable, much more managed layer.”
Consistent Clusters Everywhere
Burns calls Cluster API the “API-ification of kubeadm” and compared it to an operator for clusters; it builds on kubeadm but can use any bootstrap provider and it uses infrastructure providers to support multiple environments.
“I want to create a virtual machine. Well, do you want to create it on VMware or in Azure or in a different cloud or on OpenStack? In order to build something generic, we need abstraction maps from ‘I want a machine’ to ‘I want a machine on this specific fabric.’ Cluster API gives people who are creating Kubernetes clusters centralized tools and not have to DIY everything, but it’s also nice because I can have a relatively generic API for on-premise clusters that I can interact with without having to worry about ‘this was a cluster that was created in OpenStack’ or ‘this is a cluster that’s on bare metal’ and ‘this is a cluster on some other private cloud’.”
Using infrastructure providers with a common API is an opportunity to share best practices and fix bugs, he pointed out. “Even if no customer ever has a hundred different cluster providers, the very fact that you bring all of them together into a single project means that a lot of that code is going to be reused and it’s going to be better as a result.”
“There’s just more people hitting on it and finding bugs and fixing the bugs.”
Cluster API provides abstractions for things like creating a VM or deploying a pool of VMs, using the right service for each provider, whether that’s Amazon Web Services‘ AutoScaling Groups, Azure Virtual Machine Scale Sets or GCP Managed Instance Groups, using custom resource definitions.
That’s particularly important at scale. Scripting and command-line tools work for dealing with one or two clusters, “but the minute you’re into tens or hundreds or, in the case of something like AKS, tens of thousands, you have to have API-driven stuff, you have to have automation.”
The focus of Cluster API initially is on projects creating tooling and on managed Kubernetes platforms, but in the long run, it will be increasingly useful for organizations that want to build out their own Kubernetes platform, Burns suggested.
“It facilitates the infrastructure admin being able to provision a cluster for a user, in an automated fashion or even build the tooling to allow that user to self-service and say ‘hey, I want a cluster’ and press a button and the cluster pops out. By combining Cluster API with something like Logic Apps on Arc, they can come up to a portal, press a button, provision a Kubernetes cluster, get a no-code environment and start building their applications, all through a web browser. That’s something previously they would have only been able to do in the cloud, and not in a portable way.”
Cluster API is a sign of Kubernetes maturing as a project, for both the ways it’s currently used and new opportunities at the edge, he said. “We’re maturing to a place where you don’t have to be an expert; where the person who just wants to put together a data source and a little bit of a function transformation and an output can actually achieve all of that in the environment where it needs to run, whether that’s an airstrip, or an oil rig or a ship or factory,” Burns said.
Cluster API Tools and Concepts
The first tool introduced with Cluster API was clusterctl, a command-line tool for installing and managing the cluster lifecycle; it handles deploying, upgrading and migrating resources. If it sounds reminiscent of kubectl, that’s deliberate because Cluster API deliberately echoes the API and command-line style of Kubernetes.
“It’s definitely intended for people to feel like just like you’re managing your workload but you’re managing the cluster instead of managing your workload,” Burns pointed out.
The v1alpha3 release of Cluster API last year added the Kubeadm-based Control Plane. This is a declarative API to deploy, scale and upgrade the Kubernetes control plane, including the API server, scheduler, controller manager, DNS and proxy services, and the underlying etcd datastore instead of manually creating machine resources to scale up and removing members from the etcd cluster to scale down.
Cluster API KCP can distribute nodes across failure domains. “You can say ‘I want three replicas, I want them to span across availability zones’ and when we get that declarative command we go and create that cluster and control plane and run health checks on it,” Prignano explained. If a node is having problems because the kubelet process has stopped or the disk runs out of space or there’s a hardware failure — or a bug or memory leak — Kubernetes will try to work around the failure, but if enough nodes fail the cluster can run out of resources and pods will be evicted. Removing unhealthy nodes and replacing them avoids the longer-term problem.
It can also perform limited automatic remediation. “If something is really wrong the user has to intervene but we’ve tried to target the 80% of cases where if a machine goes bad, we can detect it, delete it and create a new one.”
Cluster API offers two classes of cluster: a management cluster where Cluster API itself runs and workload cluster where user workloads run. The management cluster is responsible for keeping track of all the multiple workload clusters it creates. That allows you to do GitOps at a multicluster level, not just for nodes or workloads.
Deutsche Telecom is already using Cluster API, vSphere and Flux GitOps to provision clusters with Prometheus, Grafana and other tools as an internal platform. The management cluster doesn’t necessarily need to run on the same infrastructure as the workload clusters, which enables multicloud scenarios, he noted. Some users have a single management cluster that is network-connected between multiple clouds with VPNs.
“If you have the network connectivity, that management cluster can manage clusters across all cloud providers that we can offer. It’s not a single point of failure: we built everything from the ground up to be really re-entrant, and to make sure that we can rebuild that state. If you have a backup and restore system like Velero, you can tear down that management cluster, create a new one, restore the whole backup, and everything will be up and running again.”
Is Cluster API Ready to Use?
While the alpha label may put some people off, the reason Cluster API is still in alpha is because the bar the project is setting for user experience is very high, Prignano explained. The core concept is “to make the 80% easy and the 20% possible” so the focus is first on making the basics reliable and dependable and then on providing extension points for the kind of advanced use cases enterprises will have.
The new alpha four release is a major milestone, with the focus on stability and reproducibility, paving the way for a 1.0 beta release which he expects to come in the first half of 2022. “We want to make sure that when we get to beta that we have a much better user experience and that we’ve really nailed the 20% advanced cases.”
The stability of the APIs and the concepts that make up Cluster API will allow new features to like cluster classes to be built on top in a future release, likely in the second half of 2022.
He describes classes as a way to stamp clusters and then reuse the class for multiple clusters. Today, if you have to roll out a new version of Kubernetes because there was a CVE, you have to modify the image and apply the Kubernetes upgrade in place, and you have to check that it’s been successful. A cluster class describes several clusters with the same “shape”; “I can change one Kubernetes version and that rolls out to all the clusters. We can simplify an operation that takes a long time, and put checks along the way. If we detect control plane or worker nodes that have not been able to spin up successfully, we’ll stop and inform the user.”
Today, Kubernetes users have to build their repository for cluster information. Cluster API gets it with the clusterctl describe command which uses the management cluster to introspect the workload clusters.
With cluster classes and simple lifecycle management, health checks and backup and restore, you’ll be able to find problems — and look across your Kubernetes infrastructure to find a suitable cluster to migrate a workload to. “If you have the same shape of cluster, you can say ‘I want to move my application from one cluster to another.’ Instead of using this cluster, I want to spin down all my applications and spin them up into another cluster.”
Users will be also able to cycle machines: “some people don’t want to keep machines running for more than six months.” Cluster API will also be able to help with certificate management, which is often done manually today. “We know where Kubernetes needs to keep its certificate proxy and kubelet certificates and we’re trying to make that easier.” There’s a proposal to do node attestation and secure kubelet authentication; “we can make sure when a node joins a cluster that it’s signed and validated against what you expect it to be.”
Cluster API is deliberately built for incremental adoption. “You don’t have to use all the features we offer,” Priganano explains. “A lot of companies have already invested a lot of time and effort to create their own Kubernetes management system. Because we use Kubernetes to manage other Kubernetes clusters, you don’t have to move all of your infrastructure right away.”
Crucially, Cluster API tools like clusterctl are also provided as Go libraries so tool creators can build them into their own offerings and still take advantage of the infrastructure providers — without needing to know the specific details of how to create machine on VMware, Open Stack, Azure Stack HCI and multiple cloud providers.
Microsoft, Red Hat and VMware are already using Cluster API in their tools. He calls Cluster API “foundational” for Tanzu and Microsoft is relying on it for its hybrid Kubernetes solutions, with Arc and Azure Stack HCI using Cluster API for managing Kubernetes cluster.
The Arc client is “a very small Cluster API management cluster that reaches out and manages that customer cluster,” Burns told us. “I have two different places where I can run stuff; I have this place where my users run stuff, and there’s the place where I as an administrator run stuff and I don’t have to worry about interference between the users’ code running and my code running. I have a safe space that, that they’re not going to break.”
The Microsoft AKS Engine cluster provisioning on Azure Stack Hub doesn’t currently use Cluster API because it was written before it was available, but he suggested it would migrate to Cluster API in future.
Red Hat maintains forks of some Cluster API machine providers for OpenShift. “In the future, Red Hat is gravitating towards providing a frictionless path to gradually move the machine management logic from the Machine API Operator used today in OpenShift to Cluster-API,” Zaalouk said. Look for that as an upcoming tech-preview feature.
Cluster API machine providers are already used for testing in Red Hat’s HyperShift and Central Infrastructure Management and those projects will adopt it more broadly.
Not only are VMware and Microsoft contributors collaborating on Cluster API, Prignano noted; they’re also advising Apple’s Kubernetes infrastructure team, who is working with Alibaba to create a nested provider that’s a pod-based control plane. “We’re helping some folks from Apple to shape how they run clusters, which is really different from what people have been doing today.”
Another place Cluster API is already useful; the Kubernetes project relies on it. Because it’s developed on the tip of the main branch of Kubernetes and has a history of upstreaming issues found by the Lifecycle SIG, Cluster API tests are “release informing” for Kubernetes itself, as part of the feedback loop for releasing a new version.
And while it’s designed for infrastructure administration, Prignano points out that Cluster API can also be useful for developers, who need to work against the Kubernetes API but don’t have an installation to target. “The Docker provider is built into Cluster API, so you can just run it locally on your laptop.”
The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: Docker, Hightower, Real, Bit.
The Cloud Native Computing Foundation, Red Hat and VMware are sponsors of The New Stack.