Strategies for Running Stateful Workloads in Kubernetes: Pet Sets

Anyone who tried running containers in production agrees that managing stateful services is one of the biggest pain points. Whether it is Swarm, Kubernetes, or Mesos, scaling out stateless containers is simple and straightforward when compared to ensuring the high availability of stateful containers. Though the container ecosystem is actively contributing to this effort, it still remains a challenge.
Influenced by the analogy of pets vs. cattle, the Kubernetes community has created Pet Sets as a way of running persistent, stateful workloads in version 1.3 or above. The goal of a Pet Set is to bring the flexibility and power of Replica Sets to stateful pods. Just like a pod managed by a replica set is recreated when it crashes, the Pet Set will ensure that the desired configuration of a stateful pod is always maintained.
This may sound simple but given the way Kubernetes is architected, this calls for a radically different approach in scheduling and managing the lifecycle of Pods.
Before understanding how a Pet Set treats the stateful pods differently, let’s take a look at the characteristics of a regular Pod which is stateless.
- Pods are scaled out and scaled in through a Replica Set.
- Pods will be assigned an arbitrary name at runtime.
- Each pod may be scheduled on any available Node unless an affinity rule is in effect.
- Pods may be restarted and relocated at any point of time.
- Pods may never be referenced directly by the name or IP address.
- A service selects a set of pods that match specific criterion and exposes them through a well-defined endpoint.
- Any request targeting the pod(s) goes through the Service, that routes the traffic to one of the pods.
In the above diagram, each colored circle represents a Service while each square depicts a Pod. When a Service is created with the Selector ‘color=red,’ Kubernetes will bring all the Pods that have the label ‘color=red’ under the new Service. Even the new Pods that are created at a later point, scheduled on any Node will start receiving the requests as long as they carry the same label. Technically, any number of Pods matching the appropriate label will be automatically brought under the fold of the Service. Services in Kubernetes act as the well-known endpoint for communication.
Let’s shift gears and consider how a highly available MySQL cluster is deployed in a Master-Master configuration. This requires that both the servers talk to each other while becoming available to the clients. It is common to use HAProxy to evenly route the traffic to one of the MySQL Server.
M1 and M2 should have well-known endpoints for the replication to happen. HAProxy will become the endpoint for all the clients to talk to the MySQL cluster. Deploying and managing this configuration in Kubernetes is not going to be easy unless we use PetSets.
A stateful pod participating in a PetSet is called as a Pet. It will have the following attributes:
- A stable hostname, that’s always resolved by DNS.
- An ordinal index number to represent the order/role of the Pet.
- A stable storage that is linked to the hostname and the ordinal index number.
When we create the MySQL HA configuration as a PetSet named mysql, one of the pets will be named as mysql-0 while the other will be named mysql-1. There is quite a bit of significance attached to the ordinal numbering scheme, which becomes obvious when dealing with Persistent Volume Claims.
Before configuring a Pet Set, it is important to have a persistent storage backend in the form of a distributed file system of block storage. All the Nodes should have access to the mount point exposed by the storage backend.
Starting with Kubernetes 1.2, certain backends such as GCE Persistent Disks, OpenStack Cinder, and Amazon EBS volumes can be dynamically provisioned. For details of the StorageClasses and dynamic provisioning, refer to Kubernetes documentation.
With the NFS backend, we need to implement the following steps:
- Provision Persistent Volumes for each Pet.
- Create a Persistent Volume Claim that is bound to a Persistent Volume.
- Create a Service that is used by each Pet to resolve the DNS name of other Pets.
- Create a Service exposed to the external clients.
- Create the Pet Set with the required number of Pets.
The Persistent Volumes backed by NFS share are created.
We then create two Persistent Volume Claims that are bound to the above PVs.
After we created the Pet Set, we see two Pets created as stateful Pods based on the above PVs.
These two Pets belong to the mysql Pet Set, which is visible from the below screenshot.
The Pet Set exposes two Services — one for the Pets to talk to each other and the other for the external communication.
When we describe the Service, we see that it has a named port called db. This is used for resolving the DNS name of each Pet. The Pet Set definition will also contain an arbitrary name that will be used to create the subdomain for each Pet.
The Pet, mysql-0 can always be reached via the DNS name mysql-0.db.default.svc.cluster.local.
Pet mysql-0 can always reach mysql-1 through the endpoint mysql-1.db no matter where it is.
Assuming that the replication has been configured between the two instances of MySQL servers, we have almost managed to emulate the MySQL HA configuration on Kubernetes.
What happens if one of the Pets crashes? The Pet Set automatically brings up the Pet with the same name as old Pet that carries exactly the same index. Since The Pet’s Volume will be automatically pointed to the Persistent Volume Claim, the state is immediately restored.
In some configurations such as one MySQL master with multiple slaves, it is easy to add new Pets that act as slaves. We just need to repeat the steps of creating the Persistent Volumes and Claims beforehand and increase the number of replicas in the Pet Set.
The objective of this series was to introduce various strategies for running stateful workloads in Kubernetes. In the upcoming articles, I will walk you through all the steps involved in configuring highly available, durable, stateful workloads in Kubernetes. Stay tuned.
Intel’s Nick Weaver Discusses Orchestration
Intel and Mesosphere are sponsors of The New Stack.
Feature image via Pixabay.