Introducing CloneSet: A Production-Grade Kubernetes Deployment CRD
Many people refer to Kubernetes as the new Linux, in the sense that it is the operating system for cloud computing. From another perspective, they are pretty much alike too: Kubernetes is layered just like Linux and allows different libraries and utilities to be built on top of. Sometimes these programs have overlapping functionalities. Sometimes they have dependencies. In Kubernetes (K8s), this is called a Custom Resource Definition (CRDs).
In this article, I’ll introduce an open source CRD for workload management called CloneSet.
CloneSet CRD belongs to a family of CRDs: Kruise. It’s part of Alibaba’s open source effort.
Alibaba group has adopted K8s in production early on. And it has one of the world’s largest clusters now running. During the migration and the operation of K8s, many problems of the upstream K8s surfaced. For example, Deployment doesn’t support canary rolling updates. Statefulset does by using partition updates. But since it’s StatefulSet, you have to update pods one by one. Assuming you have hundreds of pods, how long will it take to update them? Really, the rollout strategies available from upstream workloads are limited and implemented in different workloads. That’s understandable since K8s is a framework and cannot simply satisfy all use cases.
Another example is Deployment gives random names to pods. But that creates issues with monitoring after a service reboot. StatefulSets does enforce strict naming orders. However, it starts/updates pods one by one, in serial forms without enough flexibility.
So, Alibaba cloud created the open source project Kruise. Under it there are several CRDs that has been proven in real production environment. They are now shared with everyone. And CloneSet is the representative workload CRD, which has quite a few unique characteristics.
What Is CloneSet?
First, let’s talk about the naming convention.
In Kubernetes, there is the naming convention on controllers/CRDs. The suffix “set” suggests the CRD is working on the pods directly, like “StatefulSet.” In the same way, CloneSet works on individual pods. But StatefulSet emphasis on the “stateful” workloads while CloneSet focuses on “stateless” workloads.
Feature-wise, CloneSet is a workload for stateless pods. It now supports all the upstream Kubernetes rollout strategy. Yes, all the rollout strategies in all the other upstream workloads are supported here.
Here is the table of comparison:
Besides the rollout strategy, the CloneSet also offers a wide range of other capabilities. Let’s take a look at a few of them.
In-place update means the pod will not be recreated when getting updated. Only the container image is updated. The Pod object itself, IP, PVC, etc., all stay the same. Since Kubernetes work on the granularity of Pod and not container. From the perspective of Kubernetes, the pod is intact.
This is a cool feature if you have multiple containers in a pod. Especially if the container you want to update is not the main container. For example, every time when you update Istio, the main container will be updated with the sidecar. Do you want that to happen?
maxSurge is a new feature available in the latest 0.5 release. The use case is when you want to rolling update your pods while keeping the replica numbers stable. Just like in-memory value swap, we always need a buffer. maxSurge defines the buffer size when swapping out the old pod and swapping in the new. For example, if your replica is five and maxSurge is 20%, you have one pod as buffer.
Deployment’s maxSurge rolling update strategy which only works with maxUnavailable.
For CloneSet, maxSurge policy can be combined with Partition and maxUnavailable together and even with in-place update. That is way more powerful than deployment.
Selective Pod Deletion
Users can appoint which pod to be removed first when scaling down happens. Yes, it’s not the same as kubectl delete pod. When scaling down, both Deployment and StatefulSet has its own sequence that cannot be controlled by users. With CloneSet, you might select the ones you want to be removed first before the rest kick in. Here is an example:
The pod sample-9m4hp will be first to delete when scaling down. The use case of this feature is when you want to drain a node(s) when scaling down for resource scheduling purposes. It gives you the overriding power.
Per pod PVC
CloneSet offers per pod PVC claims. In StatefulSet, each pod gets a volume matching its name. But of course, you have bear with StatefulSet. In Deployment, all the pods will get a random volume name not related to its pod name. If the pod is updated, you can’t find that volume anymore. Per pod PVC gives each pod the capability to store stateful information without claiming themselves as StatefulSets.
There are many other features available from CloneSet. For the detailed list of features available, please check out the git repo of the CRD. Submit an issue if you have a use case that you would like to be added. There is a tutorial available here.
As mentioned above, the requirements of Kruise come from real-world use cases. As long as you are using K8s in production, I bet you’ve some of the issues mentioned above. So give it a try! We would like to hear your story. Here is the link to the project Kruise.