With great power comes great responsibility. While Kubernetes is an extremely powerful system with nearly infinite scale and distribution capabilities, it is also a double-edged sword from an operational management perspective. For the most part, managing Kubernetes falls on platform engineering teams (DevOps and SRE), but there are quite a few areas that could be shifted left to improve long-term troubleshooting and maintenance.
Even before deploying to Kubernetes, there are some proactive steps to democratize system management and maintenance, making it possible for ops and devs to troubleshoot more quickly and efficiently.
For this, we’ve rounded up hard-earned tips that, when applied from the design phases of your Kubernetes systems, can save much of the troubleshooting pains. By baking these in from the very beginning, you will be able to distribute the management efforts and provide a greater sense of ownership through the entire engineering organization.
Hacks of Kindness: How to Shift Your Kubernetes Troubleshooting Left in Practice
Below, we’re going to take a dive into the quick wins that you can implement right away in your engineering organization. These will make your Kubernetes system easier for everyone to troubleshoot.. As with all code, there will always be issues and breakage, so you want to make the processes of getting to the root cause as quick and painless as possible.
1. Stateless-First Design
When Kubernetes was initially conceived, it was built for stateless applications, although stateful support has made a lot of progress.. That’s why it remains a good practice to build stateless applications from the start when deploying them to Kubernetes. This allows you to remove a lot of the risk around your applications and also gain the benefits of elasticity.
Applications that you’re required to restart aren’t dependent on external information or data to run the same as they did before restarting, so they’re a lot easier to manage.
Therefore, applications that rely on external state for initialization and startup will have a lot more difficulty adapting to Kubernetes operations, as they require significantly more engineering expertise to enable similar capabilities — elegantly restarting after a crash, for example. In many instances, this cannot be done at all.
On the flip side, stateless applications can scale up or down through simple definitions in the code. It also helps with troubleshooting. For example, when K8s detects a problem, it will just restart your application and container forever if something goes wrong and enable you to resume operations as usual, seamlessly. This type of approach is much more prone to breakage with stateful applications.
2. How to Segregate Environments
Much has been said about managing environments on Kubernetes; a quick search will bring up plenty of great resources. For those who prefer the TL;DR version, a commonly recommended practice is to create an environment for each stage of development: development, QA, staging, production.
With Kubernetes, it is also possible to separate environments logically using the namespaces resource, which segregates environments by “names” that point to specific objects in isolation while still sharing the same underlying resources and infrastructure.
This also maintains these different environments on the same shared resources, such as nodes, meaning that if one of the environments is using too many system resources, such as CPU or memory, your production environment will have less to work with.
One way to work around this is with node taints and tolerations, which enables you, according to your configuration, to let the K8s scheduler decide which node it should live in. This makes it possible to choose which node or specific environment lives on which machine based on the amount of resources it usually consumes.
For someone just getting started with Kubernetes, separating the cluster entirely is likely a better practice. While it is more expensive, it will deliver greater safety overall in addition to being easier to launch and manage.
3. Proper YAML Management (AKA Your K8s Deployment Manifest)
When working with YAML files, including helpful metadata can significantly simplify troubleshooting in the long run.
Some good practices include setting the right labels and annotations, environment variables, secrets and config maps that point to the proper objects and volumes, configuring liveness and readiness probes. In this way, K8s will know when your app is healthy and ready to accept traffic, or otherwise alert you when there is an issue.
Other important data includes cloud-specific configurations — node taints, tolerations, readiness gates etc. — depending on the cloud provider you choose to use.
4. Kubernetes-Aware Logging
With K8s’ inherent complexity, aggregating and collecting logs from a multitude of containers, clusters and machines is a daunting operation.
Do “future you” a favor and take some proactive actions to simplify the troubleshooting processes by making sure to tag and label your logs properly. Some good practices include:
- the proper service name (and not rely on the pod names that are volatile)
- the version, and;
- cluster environment information
Also, note that there are many K8s-specific tags that define the application’s production or runtime environment. This can help with troubleshooting and understanding where an issue originated from.
So, for example, if you have an application with 500 replicas and an error occurs in one of them, understanding which container this error came from will be nearly impossible if you don’t properly tag and label your containers and logs.
5. Invest in Proper Monitoring
Prometheus with Grafana have become very popular ways to monitor applications on Kubernetes due to being open source projects as well. While open source projects provide a lot of flexibility and are built for customizability, they do come with a high learning curve. It’s often difficult to get started with configuring them properly to receive the relevant and actionable information for monitoring your cluster. This can come with the cost of not having accurate information in real time when there is an issue, or can even cause you to miss critical failures.
There are commercial options that provide extensive documentation and a lot of features out of the box to help get started rapidly, such as Datadog and New Relic.
For instance, with the Datadog agent on your cluster, you’ll get an integration with your cloud provider out of the box the moment you deploy along with a default dashboard including useful metrics about your cluster as well as relevant data about your cloud provider.
It’s absolutely possible to get your open source tooling to do this, but takes more effort and familiarity with configuring this tooling.
Tooling aside, the three main things you’ll want to start with monitoring on K8s are:
- Resources: CPU / memory usage
- Container status: Up / down / errors / probe data / restart count
- Application metrics: APMs – Per-request or action metrics / latency / stack traces / custom metadata
The first two bullets provide critical information about your cluster, the third point and likely the most important one, APMs, provide you business-critical information about your application.
Due to Kubernetes’ innate scalability, if you go from a couple to tens to hundreds of servers, running hundreds to thousands of applications, finding the root cause of an issue is going to make you pull some hairs.
At a certain scale, the manual approach simply stops working, and that’s when you’ll need the help of these monitoring tools to help you to detect, alert and understand the business logic of your applications.
6. Knowledge Sharing and Transparency
When it comes to applications running on K8s, there are certainly core concepts that developers should be aware of, and this is where your DevOps teams can help. DevOps teams can and should empower development teams to learn about the platforms and environments on which they deploy their applications. Knowing about the platform that the application lives on is critical to helping developers respond to incidents more quickly.
Understanding the nuances of containers and pod configurations, health checks, cluster orchestration, load balancing and more helps developers troubleshoot issues rapidly and effectively without escalating to the DevOps team when something goes wrong.
Takeaway: The more you empower and entrust to your developers, the more efficiently your Kubernetes systems will run end to end.
Bringing it all Together
Kubernetes is often perceived as DevOps and SRE closed quarters. However, the truth is that there are actually quite a few practices that can be implemented as early as the design phases by DevOps engineers to empower developers to be much more autonomous with troubleshooting down the line.
DevOps can provide developers a deeper understanding of the underlying infrastructure, even from a high-level perspective, so that they can be more proactive and independent with error handling, something that can prove particularly valuable with 2 a.m. pages and alerts for everyone involved.
In addition, proper mapping of your system via metadata, tagging, labels and other good practices provides the benefit of having the right information that is easy to understand in real time. The value of such insights grows hand-in-hand with the scale of your operation and becomes a must-have for troubleshooting large-scale deployments.
We hope these tips will help developers and DevOps engineers alike when designing their applications and deployments to be highly optimized for Kubernetes systems.
To learn more about Kubernetes and other cloud native technologies, consider coming to KubeCon+CloudNativeCon North America 2021 on Oct. 11-15.
Photo by Tim Gouw from Pexels.