Cloud Services / Kubernetes / Security / Sponsored / Contributed

Living with Kubernetes: Multicluster Management

28 Jun 2021 9:00am, by

This is part 3 of a series entitled Living with Kubernetes:
1. API Lifecycles and You
2. Cluster Upgrades

Justin Garrison
Justin is a senior developer advocate at Amazon Web Services (AWS).

Infrastructure has a tendency to sprawl. Data centers deal with server sprawl and VM sprawl. With the advent of containers, admins adapted to container sprawl with orchestration systems such as Kubernetes. Many environments have now reached the stage of Kubernetes cluster sprawl, meaning admins have to figure out the best way to manage and deploy to tens or hundreds of clusters.

There are plenty of reasons to use multiple clusters. Having different clusters for application environments and regions is recommended. If your apps have different compute needs or security boundaries between teams, you’ll probably need a few more clusters. This number quickly grows for all the same reasons servers and VMs did.

Historically we’ve solved sprawl issues by either creating artifacts to shift the maintenance burden to build time instead of run time. For example, a custom Amazon Machine Image (AMI) helps deploy lots of Amazon Elastic Compute Cloud (Amazon EC2) instances, automating the lifecycle management of the resources we deploy, such as configuration management to manage servers, or adding additional layers of management to coordinate a group of resources like an auto scaling group to control a group of servers.

Kubernetes doesn’t currently have a way to bundle everything into an artifact that contains all of the binaries and configuration needed to run a cluster, so using automation and additional management layers are our best options. Each pattern can apply to different areas of concern, so be mindful of what your goals are.

Multicluster management also affects applications: How to deploy applications to multiple clusters, how to expose services between clusters and how to configure disaster recovery. All of these topics have a lot of options and depend on your environment and requirements.

In this article, we’ll focus on cluster configuration and management options to help engineers responsible for the Kubernetes control plane.

Multicluster Needs

Creating your first cluster was probably a steep learning curve. Even with hosted Kubernetes options, there are still components you are responsible for. As you start to understand the requirements, you’re able to automate cluster creation and deploy some of your standard configuration, such as network plugins and container runtimes.

As your environment grows, there are other areas you may need to consider to make sure you can meet your business needs and the requirements of maintaining clusters:

  • Configuration and default services
  • Cluster discovery
  • Access and security
  • Patching and lifecycle management

The specifics on how you solve these problems will depend on your environment, tooling and needs. We can’t solve all these problems for you, but we’ll focus on these areas and how you should think about them.

Configuration and Default Services

Configuration and default services are highly dependent on your installation tooling. If your tools are custom-built, they may not have taken different configurations into consideration. You likely started with a common base, such as Kubernetes version and base services, and then expanded to allow your configuration to include other compute resources, storage and networking options.

You might be using Terraform, eksctl, Cluster API or tools like kubespray to deploy your clusters. In each case, there are ways you can create a cluster, but you may not always be able to install your default cluster services — such as the metrics server, AWS Load Balancer Controller or fluent-bit — at deployment time.

There are also ongoing config requirements to add custom resources, node labels, or sync ConfigMaps and secrets between clusters. No matter what you need, it will become apparent that there’s no one tool that can do everything. Decoupling cluster creation — servers and load balancers, for example — from configuration— like workload storage and networking— is a good idea. It allows for more flexibility as your needs change and ownership adapts to your use cases.

Your cluster deployments will likely be in some form of data representation like YAML that allows your tooling to create clusters. If you have a centralized team deploying clusters, they’ll likely have a single repo with all the cluster configurations. If smaller teams are responsible for deploying and managing their own clusters, they will probably keep configuration in separate repos.

One of the problems with a centralized repo is that your repo will grow with the number of clusters, and it’s difficult to test changes for things such as upgrades and common configuration. A monorepo will end up with complex CD systems to only apply changes for specific subdirectories, and it will probably have multiple coordination steps for upgrades and rollbacks.

Committing changes to a monorepo will become more risky over time. A repo for each cluster, app, team or environment will take multiple commits and reviews to coordinate changes. This is additional work and harder to validate in a large environment.

A third approach has been used for tooling such as configuration management, which is to use smaller repos for common configuration and use tooling that composes multiple repos into a central deployment. Tools like Flux can enable repo composability, to allow different clusters to pin to specific branches or tags, which can help clusters adopt changes faster or slower as needed.

Ongoing configuration will require different tooling from cluster creation. In this case, we can use Kubernetes controller patterns to read data from a Git repo or CRD and then apply those changes to a cluster or a group of clusters. If you’re a GKE customer, tooling like Config Sync exists to solve this problem. If you have a self-managed cluster, a GitOps controller might be your best option.

Cluster Discovery

Once you have multiple clusters, you’ll need some way to find out who is responsible for which cluster, what services run where and how services inside the clusters can discover one another. In many ways this is similar to a configuration management database, which traditionally was stored in a spreadsheet that users could reference to find a server they needed. In the case of Kubernetes, you’ll want to use newer technologies, such as custom resources, to allow for more frequent updates and to use it for automation.

With servers, we call this “service discovery,” and we can rely on similar patterns for multiple clusters. We need a single place to store information about clusters, services and endpoints. We also need some way to look up that information and call the required endpoints.

For hosted services, your cluster inventory will probably be your provider’s service dashboard. You can view all your clusters and automate them via the provider’s API, but it’ll be up to you to attach metadata/tags or use cluster naming schemes to understand what they’re used for or who is responsible for them.

For self-managed clusters, it’s best if your centralized tools for deploying and managing clusters have this information. Tools like Cluster API allow you to put metadata on the cluster definition resource to help you identify this information. You can also store the information as metadata in the data files that are used to deploy clusters in Git.

Kubernetes services use CoreDNS for discovering endpoints. DNS is a great way to allow services to find each other. We can also use DNS for clusters or tools like a service mesh, for service-to-service discovery and cross-cluster communication.

Tools such as ExternalDNS and Admiral attempt to solve this problem, along with the Kubernetes Cluster Federation special interest group (SIG). If you’re interested in getting your services to work across multiple clusters, I recommend you get involved with the KubeFed SIG.

Access and Security

There are some security concerns we should talk about for multicluster usage:

  • How someone accesses and identifies themselves to a cluster (authentication)
  • What they can do inside the cluster (authorization)
  • Managing secrets

The consideration for these things in a multicluster setting are really about centralizing management. Single sign-on (SSO) can help authenticate users in multiple clusters and, depending on your role-based access control (RBAC) needs, traditional CI/CD workflows or GitOps can help manage permissions in multiple clusters.

Single sign-on can be an integration with an external identity provider you run or via your cloud provider. Tools like dex help you set up OpenID Connect authentication to a cluster with various providers. If you’re using a cloud provider, tools like aws-iam-authenticator and identity and access management (IAM) are what you should use.

Authorization relies on RBAC in the cluster. This takes roles and defines the actions they can perform on resources in the cluster. This is very flexible and highly dependent on your organization’s needs.

One tool that can help in this space is audit2rbac, which can take Kubernetes access logs and turn them into an RBAC resource definition. rbac-tool and rbac-audit are other options that let you audit your existing RBAC permissions in a cluster.

You also will need to allow workloads to access the Kubernetes API and probably cloud resources, via service accounts and role bindings. If you’re a GKE customer, you can do this through Workload Identity, and in Amazon Elastic Kubernetes Service (Amazon EKS) you can use IAM roles for service accounts (IRSA).

Applying your RBAC rules to multiple clusters has the same concerns as any cluster configuration. How to apply and validate the rules will depend on your tooling, but you should be able to apply it to multiple clusters the same way you do any configuration.

Secrets are often needed for your applications, and there are a few ways to get them. The first is to create native secret resources and sync them between clusters. This works with a few clusters, but once you need to exclude secrets or clusters it gets complicated.

A better approach is to centralize secret storage with your cloud provider’s secret management offering or a tool like Hashicorp Vault. Just like with SSO, we can centralize the management of secrets, and then pods can access the information at run time. This makes exclusions and permissions easier and revoking secrets faster.

Patching and Lifecycle Management

We discussed patching and lifecycle management in the previous Living with Kubernetes article about cluster upgrades. The patterns still apply to individual clusters, and some patterns work better when you have multiple clusters. The main thing you’ll want to consider with lots of clusters is coordination between clusters so you don’t have too many updates happening in parallel.

By performing lots of cluster upgrades at once, you can potentially break more things with a bad configuration or by overloading dependent resources. If all your clusters pull images from a container registry, you might hit limits or scaling issues that could be solved by doing slower rollouts.

As with any Kubernetes resources, using declarative config is a good option. Tools such as kOps and Cluster API can apply upgrades to your cluster control planes. Coordinating how many clusters you want to upgrade at once, and in what order, will be your responsibility.

Even if you’re using fully managed Kubernetes solutions like Amazon EKS with managed node groups, coordinating cluster upgrades is still a good idea. Upgrading lower environment clusters first and then moving on to production is always recommended. Metrics and automation are the key  to making this successful. Trusting your tools to do the right thing without human intervention will let your teams scale.


For multicluster Kubernetes management, some things can be handled with simple scripts to apply the same resource everywhere. But at some point, you’ll likely outgrow that option and need more logic applied to where and how clusters are configured.

More tools and services are being created in this space frequently, so make sure you check for the latest options from your provider. In any case, we don’t need to invent new patterns, as sprawl has been a concern for infrastructure for a long time.

Reuse the Kubernetes state storage with CRDs when possible and a control loop pattern when needed. Tightly coupling your current needs with automation will let you come up with quick solutions, but decoupling your current implementation and requirements will help you scale over time.

Feature image via Pixabay.

A newsletter digest of the week’s most important stories & analyses.