Implement Node and Pod Affinity/Anti-Affinity in Kubernetes: A Practical Example

I introduced the concept of node and pod affinity/anti-affinity in last week’s tutorial. We will explore the idea further through a real-world scenario.
Objective
We are going to deploy three microservices — MySQL, Redis, and a Python/Flask web app in a four-node Kubernetes cluster. Since one of the nodes is attached to SSD disk, we want to ensure that the MySQL Pod is scheduled on the same node. Redis is used to cache the database queries to accelerate application performance. But no node will run more than one Pod of Redis. Since Redis is utilized as a cache, it doesn’t make sense to run more than one Pod per node. The next goal is to make sure that the web Pod is placed on the same node as the Redis Pod. This will ensure low latency between the web and the cache layer. Even if we scale the number of replicas of the web Pod, it will never get placed on a node that doesn’t have Redis Pod.
Setting up a GKE Cluster and Adding an SSD Disk
Let’s launch a GKE cluster, add an SSD persistent disk to one of the nodes, and label the node.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
gcloud container clusters create "tns" \ --zone "asia-south1-a" \ --username "admin" \ --cluster-version "1.13.11-gke.14" \ --machine-type "n1-standard-4" \ --image-type "UBUNTU" \ --disk-type "pd-ssd" \ --disk-size "50" \ --scopes "https://www.googleapis.com/auth/compute","https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \ --num-nodes "4" \ --enable-stackdriver-kubernetes \ --network "default" \ --addons HorizontalPodAutoscaling,HttpLoadBalancing |
This will result in a 4-node GKE cluster.
Let’s create a GCE Persistent Disk and attach it to the first node of the GKE cluster.
1 2 3 4 5 |
gcloud compute disks create \ mysql-disk-1 \ --type pd-ssd \ --size 20GB \ --zone asia-south1-a |
1 2 3 |
gcloud compute instances attach-disk gke-tns-default-pool-b11f5e68-2h4f \ --disk mysql-disk-1 \ --zone asia-south1-a |
We need to mount the disk within the node to make it accessible to the applications.
1 2 |
gcloud compute ssh gke-tns-default-pool-b11f5e68-2h4f \ --zone asia-south1-a |
Once you SSH into the GKE node, run the below commands to mount the disk.
1 2 3 4 5 |
sudo mkfs.ext4 -m 0 -F -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb sudo mkdir -p /mnt/data sudo mount -o discard,defaults /dev/sdb /mnt/data sudo chmod a+w /mnt/data echo UUID=`sudo blkid -s UUID -o value /dev/sdb` /mnt/data ext4 discard,defaults,nofail 0 2 | sudo tee -a /etc/fstab |
Running lsblk command confirms that the disk is mounted at /mnt/data
Exit the shell and run the below command to label the node as disktype=ssd.
1 2 |
kubectl label node gke-tns-default-pool-b11f5e68-2h4f \ disktype=ssd --overwrite |
Let’s verify that the node is indeed labeled.
1 |
kubectl get nodes -l disktype=ssd |
Deploying the Database Pod
Let’s go ahead and deploy a MySQL Pod targeting the above node. Use the below YAML specification to create the database Pod and expose it as a ClusterIP-based Service.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
apiVersion: v1 kind: Service metadata: name: mysql labels: app: mysql spec: ports: - port: 3306 name: mysql targetPort: 3306 selector: app: mysql --- apiVersion: apps/v1 kind: Deployment metadata: name: mysql spec: selector: matchLabels: app: mysql template: metadata: labels: app: mysql spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disktype operator: In values: - ssd containers: - image: mysql:5.6 name: mysql env: - name: MYSQL_ROOT_PASSWORD value: "password" ports: - containerPort: 3306 name: mysql volumeMounts: - name: mysql-persistent-storage mountPath: /var/lib/mysql volumes: - name: mysql-persistent-storage hostPath: path: /mnt/data |
There are a few things to note from the above Pod spec. We first implement node affinity by including the below clause in the spec:
1 2 3 4 5 6 7 8 9 |
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: disktype operator: In values: - ssd |
This will ensure that the Pod is scheduled in the node that has the label disktype=ssd. Since we are sure that it always goes to the same node, we leverage hostPath primitive to create the Persistent Volume. The hostPath primitive has a pointer to the mount point of the SSD disk that we attached in the previous step.
1 2 3 4 5 6 7 |
volumeMounts: - name: mysql-persistent-storage mountPath: /var/lib/mysql volumes: - name: mysql-persistent-storage hostPath: path: /mnt/data |
Let’s submit the Pod spec to Kubernetes and verify that it is indeed scheduled in the node that matches the label.
1 |
kubectl apply -f db.yaml |
1 |
kubectl get nodes -l disktype=ssd |
1 |
kubectl get pods -o wide |
It’s evident that the Pod is scheduled in the node that matches the affinity rule.
Deploying the Cache Pod
It’s time to deploy the Redis Pod that acts as the cache layer. We want to make sure that no two Redis Pods run on the same node. For that, we will define an anti-affinity rule.
The below specification creates a Redis Deployment with 3 Pods and exposes them as a ClusterIP.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
apiVersion: v1 kind: Service metadata: name: redis labels: app: redis spec: ports: - port: 6379 name: redis targetPort: 6379 selector: app: redis --- apiVersion: apps/v1 kind: Deployment metadata: name: redis spec: selector: matchLabels: app: redis replicas: 3 template: metadata: labels: app: redis spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - redis topologyKey: "kubernetes.io/hostname" containers: - name: redis-server image: redis:3.2-alpine |
The below clause ensures that a node runs one and only one Redis Pod.
1 2 3 4 5 6 7 8 9 10 |
affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - redis topologyKey: "kubernetes.io/hostname" |
Submit the Deployment spec and inspect the distribution of the pods.
1 |
kubectl apply -f cache.yaml |
1 |
kubectl get pods -l app=redis -o wide |
It’s clear that the Redis Pods have been placed on unique nodes.
Deploying the Web Pod
Finally, we want to place a web Pod on the same node as the Redis Pod.
Submit the Deployment spec to create 3 Pods of the web app and expose them through a Load Balancer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
apiVersion: v1 kind: Service metadata: name: web labels: app: web spec: ports: - port: 80 name: redis targetPort: 5000 selector: app: web type: LoadBalancer --- apiVersion: apps/v1 kind: Deployment metadata: name: web spec: selector: matchLabels: app: web replicas: 3 template: metadata: labels: app: web spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - web topologyKey: "kubernetes.io/hostname" podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - redis topologyKey: "kubernetes.io/hostname" containers: - name: web-app image: janakiramm/py-red env: - name: "REDIS_HOST" value: "redis" |
1 |
kubectl apply -f web.yaml |
The container image used in the web app does nothing but accessing the rows in the database only after checking if they are available in the cache.
Let’s list all the Pods along with the Node names that they are scheduled in.
1 |
kubectl get pods -o wide | awk {'print $1" " $7'} | column -t |
We can see that the node gke-tns-default-pool-b11f5e68-2h4f runs three Pods – MySQL, Redis, and Web. The other two nodes run one Pod each for Redis and Web which are co-located for low latency.
Let’s have some fun with the affinity rules. Remember, we are running 4 nodes in the cluster. One of the node is not running any Pod because of the Kubernetes scheduler obeying the rule of co-locating the Web pod and Redis Pod.
What happens when we scale the number of replicas of the Web Pod? Since the anti-affinity rule of Web Deployment imposes a rule that no two Pods of the Web can run on the same node and each Web Pod has to be paired with a Redis Pod, the scheduler wouldn’t be able to place the pod. The new web Pods will be in the pending state forever. This is despite the fact that there is an available node with no Pods running on it.
1 |
kubectl scale deploy/web --replicas=4 |
Remove the anti-affinity rule of the Web Deployment and try scaling the Replica. Now Kubernetes can schedule the Web Pods on any node that has a Redis Pod. This makes the Deployments less restrictive allowing any number of Web Pods to run on any Node provided it runs a Redis Pod.
1 |
kubectl get pods -o wide | awk {'print $1" " $7'} | column -t |
From the above output, we see that the node gke-tns-default-pool-b11f5e68-cxvw runs two instances of the Web Pod.
But, one of the nodes is still lying idle due to the pod affinity/anti-affinity rules. If you want to utilize it, scale the Redis Deployment to run a Pod on the idle node and then scale the Web Deployment to place some Pods on it.
Continuing the theme of co-locating database and cache layers on the same node, in the next part of this series, we will explore the sidecar pattern to deploy low-latency microservices on Kubernetes.
Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.