Kubernetes Troubleshooting Primer

To run an application with minimal downtime, you need to perfect your troubleshooting skills to extend past the application to the Kubernetes cluster in which they run. It’s crucial to regularly debug and troubleshoot the entire Kubernetes cluster to offer consistent support and service. Troubleshooting can include identifying, diagnosing and resolving problems in Kubernetes clusters, nodes, pods, containers and other resources.
Because Kubernetes is a complex system, troubleshooting issues can be challenging. Problems can occur in a single container, one or more pods, a controller, a control plane component or a combination of these. This makes it challenging to diagnose and fix bugs even in small, local Kubernetes clusters. And if there’s limited visibility and numerous moving parts in a large-scale production setup, the problems worsen.
Luckily, there are successful approaches to solving these problems. This article explores the most common Kubernetes issues and solutions, including ImagePullBackOff
, CrashLoopBackOff
, out-of-memory (OOM) errors, BackOffLimitsExceeded
messages, and liveness and readiness probe issues.
Crash Course in Kubernetes Troubleshooting
The following sections list some of the most common Kubernetes error messages and issues, commands to identify them when they occur and tips for resolving them.
ImagePullBackOff
One reason that a Kubernetes pod fails to start is because the runtime was unable to retrieve a container image from the registry. In other words, the pod won’t launch because at least one container specified in the manifest didn’t launch.
When a pod experiences this issue, the kubectl get pods
command will show the status of the pod as ImagePullBackOff
. This error can occur when the image name or tag is entered into the pod manifest incorrectly. In this case, use docker pull
from any of the cluster nodes connected to the Docker registry to confirm the right image name. Then, change it in the pod manifest.
ImagePullBackOff
can also appear when permission and authentication issues with the container registry prevent the pod from retrieving the image. This usually occurs when there’s an issue with the secret holding credentials (an ImagePullSecret
) or when the pod lacks the required role-based access control (RBAC) role. To resolve this, ensure that the pod and node have the proper permissions and secrets, then use the docker pull
command to try the operation manually.
You can also change the log verbosity by setting the --v
parameter for your docker pull
command. Turn up the log levels to get more information about why the error is occurring.
If you don’t know the credentials or the contents of an ImagePullSecret
to log in and pull the image, you can follow the steps below.
First, use the kubectl get secret
command, replacing <SECRET_NAME> with the name of the ImagePullSecret
that you want to retrieve.
kubectl get secret <SECRET_NAME> -o json
The above command will output the JSON representation of the secret, which includes the data field that contains the base64-encoded credentials.
To decode the base64-encoded credentials, you can use the base64
command that is available in most Unix-like operating systems, including Linux and macOS. For example:
kubectl get secret <SECRET_NAME> -o json | jq -r '.data.".dockerconfigjson"' | base64 --decode
This command uses jq to extract the value of the “.dockerconfigjson” field, which contains the base64-encoded credentials and then pipes the output to the base64
command to decode it.
Once you have the decoded credentials, you can use them with the docker login
command to authenticate with a Docker registry. For example:
docker login -u <USERNAME> -p <PASSWORD> <REGISTRY_URL>
Replace <USERNAME> and <PASSWORD> with the credentials that you decoded from the ImagePullSecret
, and <REGISTRY_URL> with the URL of the Docker registry that you want to authenticate with. Then issue the docker pull
command to test pulling the image.
CrashLoopBackOff
Another reason that a pod may not launch is that a specified container cannot be scheduled on a node. When a pod experiences this issue, the kubectl get pods
command shows the status of the pod as CrashLoopBackOff
.
This error can occur if the pod cannot mount the requested volumes or if the node does not have the resources needed to run the pod. To get more information about the error, run the following command:
kubectl describe pod <pod name>
The end of the output will help identify the root cause. If the cause is that the pod cannot mount the requested volume, manually verify the volume by ensuring that the manifest appropriately specifies its details and that the pod can access the storage volume using those definitions.
Alternatively, if the node does not have enough resources, manually remove the pod from it so that the pod will be scheduled on another node. Otherwise, you can scale up your node resource capacity.
This scenario can occur if you use a nodeSelector
to schedule a pod to run on a specific node in the Kubernetes cluster.
Out-of-Memory
When a container is terminated due to an OOM error, there’s typically a resource shortage or a memory leak.
Execute the kubectl describe pod <pod name>
command to determine whether a container in the pod has reached a resource limit. If so, the reason for the termination will appear as OOMKilled
. This error indicates that the pod’s container has tried to use more memory than the configured limit.
To resolve OOMKilled
, increase the container’s memory limit as part of the pod specification. If the pod still fails, check for a memory leak in the application and address it promptly by fixing the memory leak issue on the application front.
To minimize the chance of an OOM error and optimize your Kubernetes environment, you can define how many resources, like CPU and memory, that a container needs when you specify a pod. The kube-scheduler chooses which node to direct a pod based on the resource requests for its containers. Then, the kubelet allocates a portion of that node resource for that container. Additionally, the kubelet enforces the resource restrictions (limits) for a defined container, preventing the running container from using more of that resource than intended.
BackoffLimitExceeded
BackoffLimitExceeded
indicates that a Kubernetes job has reached its retry limit after multiple failed restarts.
A job in Kubernetes can control a pod’s runtime, monitor its status and restart if the pod fails. The backoffLimit
is a job configuration option that controls the number of times a pod can fail and retry before the job is finally considered as failed. The default value for this configuration setting is 6. This means that the job will retry six times, after which retries will cease. You can execute the kubectl describepod <pod name>
command to determine whether a job has failed due to the BackoffLimitExceeded
error.
A Kubernetes job’s success or failure state is based on the final exit code of the container it manages. So, if the exit code of a job is anything other than 0, it’s considered a failure. A job can fail for several reasons, including that a designated path doesn’t exist or when the job cannot locate an input file for processing.
You can overcome this job failure by performing a failure analysis on the job definition. Execute the kubectl logs <pod name>
command to check the pod’s log, which will typically uncover the reason for the failure.
Probe Failures
To monitor and respond to the state of pods, Kubernetes offers probes (health checks) to ensure only healthy pods service the requests. Each probe (Startup, Liveness and Readiness) helps Kubernetes pods self-heal when unhealthy. A probe can fail when it remains in the pending state for too long or when it’s not ready or unable to be scheduled. For more information on the types of Kubernetes probes and their courses of action, check out this article about probes and their impact on autoscaling.
A pod has a four-phase life cycle: Pending
, Running
, Succeeded
and Failed
. Its current stage depends on the termination status of its primary containers. A pod can’t be scheduled onto a node if it’s stuck in Pending
. In most cases, scheduling is prevented by a lack of resources.
Reviewing the output of the kubectl describe
command will give you clarity. If the pod stays pending, then it could be an issue, and the root cause could be insufficient resources in the node. Alternatively, the pod may not be ready if you’re specifying a host port for the container that isn’t available, or the port is already being used in all the nodes of a Kubernetes cluster.
Regardless of the reason for the failure, you can use the kubectl describe
pod command to uncover the reason for the pod failure. Your next step will vary based on the reason for failure. For instance, probe failures in Kubernetes can occur if the application running in a container takes longer to respond than the configured probe timeout. Troubleshoot by increasing the probe timeout, monitoring the logs and testing the probe manually. Once you identify the root cause, optimize the application, scale up resources or adjust the probe configuration.
Conclusion
Troubleshooting in Kubernetes can seem like an overwhelming undertaking. However, by properly diagnosing the problem and understanding the reasons behind it, you’ll find the troubleshooting process more manageable and less frustrating.
Kubernetes troubleshooting allows you to take the proper steps to tackle problems in your components and solve them effectively. Always approach problems from the bottom up. This will help you resolve issues more quickly by focusing on localized resources like pods, rather than resources like services, which span multiple components.
To further hone your troubleshooting skills, check out this Kubernetes resource management cheat sheet and learn more about how to optimize your Kubernetes resources.