Kubernetes at Scale without GitOps Is a Bad Idea

As Kubernetes environments begin to scale, the issue of consistently managing cluster configurations across multiple environments or multiple clouds can become difficult. Clusters deployed by different teams may not, at least initially, meet or share the same configuration for node size, autoscaling, networking or RBAC policies that are important for overall governance and security. As a result, the desired cluster configurations are not fully replicated across different clusters. Manually identifying drift and maintaining conformance across the different clusters as they scale is, of course, not a viable alternative.
DevOps teams should be able to solve these conformance challenges with a set of policy-management templates that describe the desired state of the clusters. The templates should allow the DevOps teams to create standards-based definitions for clusters and then replicate them across all of their clusters through a single interface and with minimal effort.
In this article, we describe the issues and challenges that enterprises face when deploying Kubernetes clusters at scale. We also describe how GitOps processes and tools can allow organizations to gain proper control of these highly distributed environments while improving security and compliance best practices.
Management Challenges and Cluster Sprawl
The shift to virtualization created issues of VM sprawl, where the sheer number of VMs deployed made effective management of VMs impossible. The widespread use of containerization and Kubernetes has created similar problems that enterprises must deal with.
Now, enterprises often must deal with a chaotic environment of cluster and workload sprawl as large and distributed teams provision multiple Kubernetes clusters across their local workstations, data centers, public clouds, edge sites, and at times, on premises at end-customer sites.
Enterprise IT working with Kubernetes environments must ensure that clusters launched by internal and field teams are compliant and adhere to organization-wide policies. Clusters created for end-user-facing application deployment must especially be monitored carefully and must not divert from a desired configuration. However, cluster sprawl and the fragmentation of cloud native infrastructure make it especially difficult to enforce global policies.
Configuration Sprawl
Kubernetes is well known for its declarative API: Every component in a cluster, from configuration settings to applications, no matter how small, is configured via a configuration snippet called resource, typically expressed as a YAML file. This means that fully describing and configuring a cluster may involve creating and maintaining dozens or hundreds of different resources, resulting in a management nightmare. The problem can be aggravated by the plethora of tools that introduce additional abstractions and features (e.g., templating) over those resources, such as Jsonnet, Helm, and Kustomize.
Pet Clusters
Back in the virtualization days, having “pet” VMs — large monolithic virtual machines that are difficult to upgrade or maintain and create a fault tolerance bottleneck — were unavoidable. A similar problem is occurring in the Kubernetes and containers world: As enterprises start deploying clusters at scale, having a few large “pet” clusters starts to become a common problem. These clusters at times may have up to 1,200 nodes.
The clusters at this scale must be split into multiple node groups, each requiring its own management and maintenance. CNI plugins and other solutions integrated into the cluster may start breaking or running into unexpected behaviors as the number of cluster nodes grows. A better solution is to have a large number of small clusters, each purpose-built for a use case, and then have better overall automation to manage clusters at scale. This addresses the issue of having to manage pet clusters, but in turn creates a new issue of having to manage a large inventory of clusters. This is where GitOps-style automation can really add value.
Desired State Management in Kubernetes
One of the biggest advantages of Kubernetes is its declarative system that handles desired state management for all the applications running within its scope. When a Kubernetes pod that belongs to a replica set is deleted, a Kubernetes controller compares the number of running pods with the pod deployment specification. A new pod is automatically scheduled to maintain the desired number of replicas. Controllers oversee the lifecycle of all Kubernetes resources, such as deployments, statefulsets, jobs etc.
By default, Kubernetes doesn’t have a mechanism to monitor changes to the cluster itself and its attributes and automatically reconcile the state. This is where GitOps comes into the picture.
Behind the scenes, the workload state is maintained in etcd, the default key value database for Kubernetes which acts as the single source of truth for resource configuration deployed on the cluster. The etcd database maintains both the configuration definition of the workload and the current state. Should discrepancies occur, the kube controller manager is responsible for recreating the resources that match the original definition.
However, by default Kubernetes doesn’t have a mechanism to monitor changes to the cluster itself and its attributes and automatically reconcile the state. For example, if an entire namespace is deleted, Kubernetes doesn’t recreate the namespace or the objects within. This Kubernetes shortcoming is where GitOps comes into the picture.
GitOps for Effective Cluster Management
GitOps is an operational framework where one employs standard development best practices around version control and management of source code using central repositories for source code management and CI/CD, and extends them to management of your infrastructure.
In a multicloud and multicluster environment, GitOps can be a very valuable and effective process to automate configuration management, deployment, updates and policy management of your Kubernetes clusters and the surrounding infrastructure.
When operating clusters at scale using GitOps principles, instead of handcrafting clusters across users’ workstations, customer sites, dev/test/production environments, DevOps teams standardize a set of cluster resources expressed as YAML, Kustomize, or Helm (or a combination), which can be grouped into “template”; each template captures desired state attributes for a certain type of cluster. The desired state attributes can be about the shape and size of the cluster, the number of master and worker nodes, add-ons to be deployed in the cluster, networking and security policies to be enforced and so forth. The template is stored in your Git repository in a dedicated repository where all configurations are stored. Any updates to the template get versioned, which is helpful from a governance and compliance perspective.
Enforcement with Flexibility
Having templates to capture desired state for your clusters is only useful if there is an enforcement engine that can ensure that the actual state of your clusters is always consistent with the desired state described in Git. The enforcement engine should allow for creation of new clusters using attributes described in a template, as well as fixing existing clusters to adhere to the template.
Let’s take an example of RBAC policy management for your Kubernetes clusters at scale to illustrate how effective template-based management and enforcement can be achieved in a Kubernetes world.
One of the benefits of Kubernetes is the ability to configure and manage clusters at scale using a simple command-line tool — kubectl. Giving all the development and ops team members — from the freelance developer to the CTO — access to the kubectl to manage your clusters, however, is hardly an ideal scenario. What’s ideal is to have an easy-to-use mechanism to configure RBAC policies for users across various clusters to granularly define their access level.
Managing Kubernetes RBAC policies, however, is complex by default. Kubernetes forces the user to sift through the complexities of editing and updating various YAML files to properly configure and update RBAC policies. Many, if not most, commercial Kubernetes solutions do not provide an alternative to this that significantly simplifies the process at scale.
The GitOps-style model that we described above can significantly simplify this process. In this model, a DevOps SRE engineer defines one or more “RBAC templates” that capture Kubernetes user roles and role bindings at namespace or cluster scope. The RBAC template, once defined, can then be applied to one or more clusters to grant appropriate levels of access to those clusters for a group of users.
The “RBAC template” gets stored in your Git repository declaratively, and in an immutable manner, without forcing the user to access a YAML file. Once the enforcement engine is instructed to associate the cluster with the repository (and, typically, a path within the repository), the contents of the repository path define the “source of truth” for the RBAC settings are applied to the cluster, then periodically synchronized to ensure that they continue to be enforced over time.
A proper system must not only enforce a new base state, but should also be able to test committed requests before they are approved and merged into the desired state. The process to get to the final, immutable state on Git is iterative, and remains flexible, so that it does not just enforce policy but facilitates changes. It should also provide a complete audit trail of all merge requests and changes on Git and to the clusters — which is one of the beautiful things about Git.
Don’t Get the Drift
The ability to easily and seamlessly audit changes made to clusters that might differ from the configuration on the Git repository is critical — known as “drift.” A properly implemented template system, described above, should provide “drift analytics,” where with a single command the system can determine and report whether there is a difference with the deployed cluster and the desired state captured in the template. Alerts should be sent when changes are made to relevant Ops owners depending on severity and importance of changes.
With Flux or Argo CD, drift might happen when a particular object in Kubernetes has changed — whether inadvertently by an admin, for example, or during the course of a network attack. The cluster configuration policy and other settings have been overwritten (instead of being changed manually on GitHub and then changed on a cluster level).
All changes should thus first be made in Git before they are rolled out into production, so that both the configuration on Git and in the clusters remain reconciled. The state of Git representing the immutable “single source of truth” is also a major component of cloud native security and compliance by meeting such mandates defined by the Sarbanes-Oxley act, for example.
What the process should not involve is a line-by-line manual check to determine whether the declaratively defined and running resources are different or not. If drift has occurred, the changes made to the cluster should be automatically identified. The cluster configuration will no longer conform to the defined state in Git, so with the right tools, the remediating changes are identified and applied to the cluster in an automated way — while sending alerts. In other words, drift in configuration should always be apparent and a first-class aspect of the system.
The Threshold Starting Point
It is possible to manually handcraft and manage the configuration for a single or a small number of clusters. However, the pain becomes palpable when attempting to manually manage 10 or more Kubernetes clusters consistently, especially if the size of the individual cluster is large.
It can be argued that it would be unviable for a small DevOps team to attempt to manage a multicluster environment without GitOps tools and processes on which to rely for configuration management and enforcement of clusters at scale. To put what’s involved into perspective, a multicluster environment can easily consist of more than 50 namespaces and seemingly countless microservices that multiple teams access.
As described above, GitOps tools and processes have emerged as the ultimate way to solve many of the management, security, compliance and other challenges organizations face when clusters are deployed at scale. By helping to remove “pet clusters,” drift, cluster sprawl with a lack of centralized control, a lack of standardization, and other issues that rapidly drain DevOps resources and sap productivity, GitOps serves as a necessary operational framework. The Git-centric management and centralized control GitOps provides for CI/CD can also offer productivity boosts by automating many tasks that operations and even developer team members would otherwise be responsible for.
Ultimately, GitOps serves as the framework to take advantage of what Kubernetes was originally created to do: to offer significant computing advantages, resource-savings, and compatibility for apps deployed across multiple and highly distributed containerized environments.