Ops Checklist for Monitoring Kubernetes at Scale
By design, the Kubernetes open source container orchestration engine is not self-monitoring, and a bare installation will typically only have a subset of the monitoring tooling that you will need. In a previous post, we covered the five tools for monitoring Kubernetes in production, at scale, as per recommendations from Kenzan.
However, the toolsets your organization chooses to monitor Kubernetes is only half of the equation. You must also know what to monitor, where to put processes in place in order to assimilate the results of monitoring and how to take appropriate corrective measures in response. This last item is often overlooked by DevOps teams.
All of the Kubernetes components — container, pod, node and cluster — must be covered in the monitoring operation. Let’s go through monitoring requirements for each one.
Containers are the lowest-level entity within a Kubernetes ecosystem. Monitoring at this level is necessary not only for the health of the containerized application, but also to ensure scaling is happening properly. Most metrics provided by Docker can be used for monitoring Docker containers, and you can also leverage many traditional monitoring tools (e.g., Datadog, Heapster). Kenzan tends to focus on the lowest-level data to help determine the health of each individual container; for instance:
- CPU utilization: Rendered as an average per minute, hour, or day.
- Memory utilization: Rendered as an average of usage/limit per minute, hour, or day.
- Network I/O: Determines any major latencies in the network, as oftentimes the effects of traffic spikes may be amplified by network latencies. Monitoring I/O may also expose opportunities for better caching or circuit-breaking within the application.
These three categories will reveal whether containers are getting stressed, where the latency is at the container level, and whether scaling is happening when needed.
Pods are Kubernetes’ abstraction layer around the container (or multiple containers). This is the layer where Kubernetes will scale and heal itself. Pods come in and out of existence at regular intervals. The type of monitoring that you may find most useful at the pod level involves the life cycle events for those pods. The data you can harvest from such monitoring may be very useful in understanding whether you are scaling properly, whether spikes are being handled, and whether pods are failing over but correcting themselves.
Lots of data may be captured from Kubernetes; a deeper dive into the Kubernetes docs will have the full list. Here’s what Kenzan’s experience tells its teams to focus on:
- Scale events take place when pods are created and come into existence. These events give a higher-level view of how applications are handling scale.
- Pod terminations are useful for knowing which pods were killed and why. Sometimes terminations are on account of load fluctuations, and other times Kubernetes may kill the pod if a simple health check fails. Separating the two in your mind is critical.
- Container restarts are important for monitoring the health of the container within the pod.
- Lengthy boot time intervals are common signals of an unhealthy application. Containers should spin up and out of existence very quickly.
If you keep your monitoring somewhat simple and uncluttered, then it’s much easier to manage. When you need to take a deeper, but ad hoc, dive into the metrics, you can rely on custom dashboards from a service like Grafana.
The processes involved with monitoring a production system are either similar enough to one another or morphing to become that way progressively, particularly with respect to what data these processes are looking for. But the rationales for monitoring these different components will vary. At the cluster level, we tend to look at the application much more holistically. We are still monitoring CPU, memory utilization, and network I/O, but with respect to how the entire application is behaving.
Often we see a cluster fail when an application scales beyond its provisioned memory and CPU allotments. An elastic application presents a unique challenge: It will continue to scale until it can’t scale anymore. So the only real signal you get is an out-and-out failure, often at the cluster level. For this unique reason, it’s very important to keep a close watch on each cluster, looking out for the signals of cluster failure before they happen.
We do a kind of time-series analysis in which we monitor four key variables:
- Overall cluster CPU usage
- CPU usage per node
- Overall cluster memory
- Memory usage per node
Although relatively rare, CPU usage per node can reveal one node severely underperforming the others, while memory usage per node can uncover routing problems or load sharing issues. Using time series analysis, you should be able to plot these variables on a heuristic chart.
Monitoring the Network
As more and more applications are shifting towards elastic applications with microservices, it’s easy to overlook the extent to which they depend upon the network to be healthy and functioning. A highly elastic microservice application on an underperforming network will never run smoothly. No amount of defensive development or auto-healing will make it run properly.
This is why we take network monitoring very seriously. Fortunately, tools such as Heapster can capture metrics on the network and its performance. While we typically find these metrics to be useful for spotting the bottlenecks, they don’t go far enough in diagnosing the root cause. This requires further digging with network specific applications.
We typically like to monitor a few items, and find it useful to separate between transmitting and receiving:
- Bytes received over network shows the bytes over a designated time frame. We generally look for spikes in this series.
- Bytes transmitted over network reveals the difference between transmitted and received traffic, which can be very useful.
- Network received errors reveal the number of dropped packets or errors the network is getting over a specified duration.
- Network transmitted errors tell us the number of errors happening in transmission.
Your insight into the underpinnings of your Kubernetes environment will only be as good as your metrics. We suggest you take findings regularly and develop an actionable plan to resolving the issues you uncover. The action plan will need to be targeted to the application, the Kubernetes environment, or the platform it is running on. Remarkably, teams tend to forget the importance of this feedback loop.
One of the biggest challenges with microservices can be monitoring all of the communication flows between the services. Knowing what is happening inside of containers, pods, and clusters is useful but understanding what is happening between application components is critical. This is where distributed tracing systems can be very useful. Tools like Zipkin or Jaeger are commonly used tools for tracing individual threads and can be used to push data into monitoring or dashboard tooling.
While an entire article could be written about distributed tracing systems it has been a very useful tool for us to find bottlenecks between services, monitor SLAs between calls, and follow the data flow for individual calls that process as they traverse through the system. Distributed tracing does come at a performance cost for the threads that are using it. We typically recommend a sampling (e.g. 10 percent) approach to tracing so that we can minimize the impacts to end users.
The Cloud Native Computing Foundation, which manages the Kubernetes project, is a sponsor of The New Stack.
Feature image via Pixabay.