5 Tools for Monitoring Kubernetes at Scale in Production
The concept of monitoring a distributed systems environment is completely different from monitoring a client/server network, for a reason that becomes retroactively obvious once it’s discovered. The thing you are monitoring — whose performance, resilience, and security are important to your organization — is bigger than any one processor that runs any individual part of it. So monitoring a server or watching a network address conveys far less relevant information for distributed systems and microservices than it did before.
This creates a challenge: you need a monitoring strategy before you choose the tool that can help your organization execute this strategy. Kubernetes by design is not self-monitoring, and a bare installation will typically have a subset of the monitoring tooling that you will need. A highly-tuned and properly architected Kubernetes environment will be self-healing and protect against any downtime. That same system will use monitoring as critical to identifying issues before they arise and ensuring that self-healing is only a last resort.
Monitoring Kubernetes requires solving many of the same challenges that need to be solved with any highly scalable elastic application, though the tooling or approaches may be different. All of the Kubernetes components — container, pod, node and cluster — must be covered in the monitoring operation. Equally important, processes must be in place to assimilate the results of monitoring and take appropriate corrective measures in response. This last item is often overlooked by DevOps teams.
Kubernetes Monitoring Tools
Choosing a monitoring toolset is certainly important, but not for the reason you might be thinking. Every monitoring toolset has its pros and cons and, shall we say, unique qualities. You may find yourself choosing a combination of toolsets, just as our team at Kenzan has done internally, especially when you need to monitor several facets simultaneously. The New Stack’s Kubernetes User Experience Survey, conducted earlier this year, found that many Prometheus and Heapster users also pair them with other tools for cluster monitoring (see the chart below).
The fact is, the most important thing about a toolset is that you stick to the set you’ve chosen, and use it consistently for your Kubernetes clusters.
We had to find that fact out for ourselves. Although there are many viable options, which you can see in the chart below, here are the monitoring toolsets that are frequently used by our team at Kenzan, and which we recommend for your organization:
- Heapster: Installed as a pod inside of Kubernetes, it gathers data and events from the containers and pods within the cluster.
- Prometheus: Open source Cloud Native Computing Foundation (CNCF) project that offers powerful querying capabilities, visualization and alerting.
- Grafana: Used in conjunction with Heapster for visualizing data within your Kubernetes environment.
- InfluxDB: A highly-available database platform that stores the data captured by all the Heapster pods.
- CAdvisor: focuses on container level performance and resource usage. This comes embedded directly into kubelet and should automatically discover active containers.
Production Monitoring ‘Process’
While this article is focused on tooling in a Kubernetes production environment, I would be remiss not to discuss the process of healthy monitoring in a production environment. At Kenzan, we too often see organizations push monitoring and all monitoring-specific requirements much too late in the development process. We advocate a “shift left” style of monitoring as part of a healthy DevOps culture. Two main rules of thumb that have served us well here are:
- Monitoring as part of development — It is very important that the features being developed will include monitoring and expected monitoring in the development cycle. This should be included in all estimates and handled no differently than other development activities.
- Monitoring of non-functionals — If possible, all non-functionals should be monitored (with alerts). Good examples of this are response times, requests per second, etc. We have found this very useful in catching small issues before they turn into large issues.
Monitoring tools need to be just as durable, if not more durable than, your application as a whole. Nothing is more frustrating than an outage that causes your monitoring tools to go dark, leaving you without insight at the time you need it most. While best practices for monitoring at this level tend to be very specific to the application, you should look at the failure points within your infrastructure and ensure that any outages that could happen would not cause monitoring blind spots.
Most third-party and add-on applications that monitor Kubernetes (e.g., cAdvisor, Heapster) will be deployable inside your environment. Still, make sure either that logging of those applications happens outside of the cluster, or that they are set up with failover capability themselves. It’s remarkable how frequently this simple but critical concept is overlooked.
As I mentioned earlier, you will want to monitor all of the Kubernetes components including the container, pod, node and cluster operations. You also want to ensure that monitoring is a part of all development cycles if you want to maximize it’s potential. You’ll find a more detailed explanation of monitoring tools and methods for each of these components in The State of the Kubernetes Ecosystem ebook.