Is Your Kubernetes Cluster Healthy? Here are 5 Ways to Find Out
Kubernetes is extremely intelligent technology, but without the right direction, it can respond in unwanted or unexpected ways. As is true with most “smart” technologies, it is only as smart as the operator. In order to set teams up for peak success with Kubernetes, it is vital they have a pulse on their Kubernetes clusters. Here are five ways that engineers can best identify any loose ends when setting up a Kubernetes cluster and ensure the healthiest workloads possible (For more Kubernetes observability deep dives, check out our eBook, Kubernetes Observability).
Fortunately, there are technologies that collect logs, metrics, events, and security threats across Kubernetes environments to help monitor the health of various clusters. These collectors gather data from all parts of the Kubernetes cluster, which can be aggregated to get a high-level view of cluster health and get insights such as resource utilization, configuration mistakes, and other issues in real-time.
1. Set CPU Requests and Limits on all Pods
Requests and limits are the mechanisms Kubernetes uses to schedule pods intelligently given the available resources like CPU and memory usage. For CPU, this is defined in millicores, so 1000m equals one core. Requests are how much you anticipate a given container will need, and limits, on the other hand, are the actual upper bound for how much a container is allowed to use.
Make sure you have CPU requests set for all of your pods. A best practice is to set this to one core or below, and if you need more compute power, then add additional replicas. It is also important to note that if you schedule this too high, say 2000m, but you only have single cores available, then this pod will never get scheduled. In step five, I’ll show you how to double-check for unscheduled pods.
Make sure you have CPU limits set for all of your pods. As mentioned above, the limit is the upper bound, so Kubernetes will not allow your pod to use more CPU than you have defined in the limit. That said, CPU is somewhat forgiving, as it is considered a compressible resource. If your pod hits the CPU limit, it will not be terminated, but instead throttled. Your CPU will be restricted, so you could experience a performance hit.
2. Set Memory Requests and Limits on all Pods*
Make sure you have memory requests set for all of your pods: The memory request is how much data you think your pod will consume. Like CPU, Kubernetes will not schedule a pod if the memory request is larger than your largest node.
Make sure you have memory limits set for all of your pods: The memory limit is the hard upper bound for how much memory your pod will be allowed to use. Unlike CPU, memory is not compressible and cannot be throttled. If a container goes past its memory limit, then it will be terminated.
3. Audit Provisioned Resources
Another thing to check for optimal Kubernetes health is if you have under or over-provisioned your resources. If you have a surplus of available CPU and memory, then you are under consuming, and likely paying too much. On the other hand, if you are getting close to 100 percent utilization, you might run into problems when you need to scale or have an unexpected load.
Check the remaining pod capacity. A useful Kubernetes metric is “kube_node_status_allocatable” which is Kubernetes’ estimate of how many more pods will fit on a node, given average pod resource utilization. We can add up the remaining pod capacity to give a rough guess at how much we can scale out without running into issues.
Check the total percent CPU usage vs percent CPU requested vs percent CPU limits: The total CPU usage will tell you how much you are using right now, requested tells us how much we guessed we might need, and the limit is the hard limit we set as the upper bound.
In the example below, we are only using 2.5% of our available compute power. We are way over-provisioned and can probably scale back. By contrast, our CPU requested is 46%, so we thought we were going to need way more than we are actually using. Either we guessed wrong or we have highly bursty needs that we haven’t planned for.
Finally, our CPU Burstable tells us the sum of all of our CPU limits. As this is lower than our CPU requested, we might want to go back and check our limit settings. Either we don’t have limits set on everything or we have misconfigured our limits.
Check total percent Memory usage vs percent Memory requested vs percent Memory limits. Just as with CPU, we can check if our memory has been overprovisioned. Only 3.8% utilization is telling us we are indeed overprovisioned, but we can comfortably scale for ages.
4. Review Pod Distribution Across Nodes
When we look at how pods are distributed across the available nodes in a cluster, we want a roughly even distribution. If certain nodes are completely overloaded or underloaded, it could be a sign of a larger issue worth some investigation.
Some things to check that might cause uneven distribution include:
Node affinity. Affinity is a pod setting that causes them to prefer nodes with certain properties. For example, pods might need to run on machines with a GPU or SSD attached, or pods might require nodes with specific security isolation or policies. Double-checking your affinity settings could help narrow down the cause of uneven distribution, and reduce the likelihood of scaling issues.
Taints and tolerations. Taints are the opposite of affinity. These are settings on a node that “taints” them so pods are less likely to be assigned there. You might use this if you want to reserve nodes for specific pods or ensure that pods on that node have full access to the available resources.
Limits and requests: Look back at your limit and request settings. This is so often the cause of issues that it is worth mentioning in three sections of this post. If your scheduler doesn’t have the right information about what pods need, it is going to do a bad job of scheduling.
5. Check for Pods in a Bad State
In Kubernetes environments, the current state changes from moment to moment, so being overly concerned about every terminated pod will slowly eat away at your time and sanity. However, the following list is worth keeping an eye on to make sure it matches what you might expect based on the current events in your cluster.
- Nodes not ready: Nodes can fall into this state for a number of reasons but often it is because they ran out of memory or disc space.
- Unscheduled pods: Pods typically end up in an unscheduled state because the scheduler cannot fulfill the CPU or memory requests. Double-check that your cluster has the available resources your pods are requesting.
- Pods that failed to create: Pods fail at creation often because there is an issue with the image like a dependency missing in the startup script. In this case, go back to square one.
- Container restarts: Some container restarts are not a cause for concern but seeing a lot of these could mean pods in OOMKill (Out of Memory Killed) states. Out of Memory is one of the most common errors in Kubernetes which could be caused by image issues, downstream dependency issues, or, surprise, limit and request issues.
These cluster health best practices can limit the unexpected behavior in a Kubernetes environment, and ensure you don’t run into issues scaling when the time comes. These also give you a starting point to help you answer those amorphous questions like, “Is my Kubernetes cluster healthy?” If all of these items are in the green, your cluster is likely in good health and you can rest easy.
For more Kubernetes Observability deep dives, check out the eBook on Complete Kubernetes Observability.