Which Kubernetes Health Metrics Should Be Monitored

Circonus sponsored this post.

I previously wrote an article on the 12 most common health conditions you should be monitoring to ensure that Kubernetes is performing optimally. But which metrics that cause these health conditions (and more) should you be collecting and analyzing?
In a recent survey that Circonus conducted of Kubernetes operators, uncertainties around which metrics to collect was one of the top challenges to monitoring that operators face. This isn’t surprising, given the millions of metrics that Kubernetes can generate on a daily basis.
In this article, we’re going to share which health metrics are most critical for Kubernetes operators to collect and analyze. We’ll look at three sources of these metrics, define and name each of the metrics by source, and which health conditions they’re associated with that you should monitor.
#1: Resource and Utilization Metrics
Resource and utilization metrics come from the built-in metrics API and are provided by the Kubelets themselves. We only use CPU Usage as a critical health condition, but monitoring memory usage and network traffic is also important.
Metric | Name | Description |
CPU Usage | usageNanoCores | The number of CPU cycles used per second for a node or pod (divided into billionths of a second). |
CPU Capacity | capacity_cpu | The number of CPU cores available on a node (not available for pods). |
Memory Usage | used{resource:memory,units:bytes} | The amount of memory used by a node or pod, in bytes. |
Memory Capacity | capacity_memory{units:bytes} | The total memory available for a node (not available for pods), in bytes. |
Network Traffic | rx{resource:network,units:bytes}
tx{resource:network,units:bytes} |
The total network traffic seen for a node or pod, both received (incoming) traffic and transmitted (outgoing) traffic, in bytes. |
High CPU usage is a critical health condition to monitor and alert on using these metrics. This is the easiest to understand: you should track how many CPU cycles your nodes are using. This is important to monitor for two reasons. First, you don’t want to run out of processing resources for your application. If your application becomes CPU-bound, you need to increase your CPU allocation or add more nodes to the cluster. Second, you don’t want your CPU to sit there unused. If your CPU usage is consistently low, you may have over-allocated your resources and are potentially wasting money. You should compare utilization{resource:cpu} to a pre-decided threshold over a particular window of time (e.g. has it stayed over the threshold for over 5 minutes?) to determine if your CPU usage is getting too high.
#2: State Metrics
kube-state-metrics is a component that provides data on the state of cluster objects (nodes, pods, DaemonSets, namespaces, et al). It serves its metrics through the same metrics API from which the resource and utilization metrics are served.
Metric | Name | Description |
Node Status | kube_node_status_condition {status:true,condition:OutOfDisk| MemoryPressure|PIDPressure| DiskPressure|NetworkUnavailable} |
A numeric boolean (0 or 1) for each node/condition combination, indicating if that node is currently experiencing that condition. |
Crash Loops | kube_pod_container_status_ waiting_reason{reason: CrashLoopBackOff} |
A numeric boolean (0 or 1) for each container, indicating if it’s experiencing a crash loop. |
Job Status (Failed) | kube_job_status_failed | A numeric boolean (0 or 1) for each job, indicating if it has failed. |
Persistent Volume Status (Failed) | kube_persistentvolume_status _phase{phase:Failed} |
A numeric boolean (0 or 1) for each persistent volume, indicating if it has failed. |
Pod Status (Pending) | kube_pod_status_phase {phase:Pending} |
A numeric boolean (0 or 1) for each pod indicating if it’s in a pending state. |
Latest Deployment Generation | kube_deployment_metadata _generation |
Sequence number representing the latest generation of a Deployment. |
Observed Deployment Generation | kube_deployment_status_ observed_generation |
Sequence number representing the current generation of a Deployment as observed by the controller. |
Desired DaemonSet Nodes | kube_daemonset_status_ desired_number_scheduled |
Number of nodes that should be running each pod in the DaemonSet. |
Current DaemonSet Nodes | kube_daemonset_status_ current_number_scheduled |
Number of nodes that are running each pod in the DaemonSet. |
Desired StatefulSet Replicas | kube_statefulset_status_replicas | Number of replicas desired per StatefulSet. |
Ready StatefulSet Replicas | kube_statefulset_status_replicas _ready |
Number of replicas which are ready per StatefulSet. |
Using these metrics, you should monitor and alert on: Crash Loops, Disk Pressure, Memory Pressure, PID Pressure, Network Unavailable, Job Failures, Persistent Volume Failures, Pod Pending Delays, Deployment Glitches, DaemonSets Not Ready, and StatefulSets Not Ready. Each of these health conditions are defined here.
#3 Control Plane Metrics
The Kubernetes Control Plane encompasses the portions of Kubernetes that are considered “system components,” for helping with cluster management. In a managed environment like what Google or Amazon provide, the Control Plane is managed by the cloud provider and you typically don’t have to worry about monitoring these metrics. However, if you manage your own cluster, you’ll want to know how to monitor your Control Plane. When they’re available, most of these metrics can be found via the metrics API.
Metric | Name | Description |
etcd Leader | etcd_server_has_leader | A numeric boolean (0 or 1) for each etcd cluster member, indicating whether that member knows who its leader is. |
etcd Leader Changes | etcd_server_leader_changes_ seen_total |
The count of the total number of leader changes which have happened in the etcd cluster. |
API Latency Count | apiserver_request_latencies_count | The count of the total number of API requests; used to calculate average latency per request. |
API Latency Sum | apiserver_request_latencies_sum | The total of all API request durations; used to calculate average latency per request. |
Queue Waiting Time | workqueue_queue_duration_ seconds |
The total time that action items have spent waiting in each of the controller manager’s work queues. |
Queue Work Time | workqueue_work_duration_ seconds |
The total time that has been taken to process action items from each of the controller manager’s work queues. |
Unsuccessful Pod Scheduling Attempts | scheduler_schedule_attempts _total{result:unschedulable} |
The total number of attempts made by the scheduler to schedule pods on nodes which ended up being unsuccessful. |
Pod Scheduling Latency | scheduler_e2e_scheduling_ latency_microseconds (< v1.14) or scheduler_e2e_scheduling_ duration_seconds |
The total length of time that has been taken to schedule pods onto nodes. |
Control Plane Health Conditions
You should monitor the following Control Plane health conditions:
etcd Leaders
The etcd cluster should always have a leader (except during the process of changing leaders, which should be infrequent). You should keep an eye on all of your etcd_server_has_leader metrics, because if too many cluster members don’t recognize their leader, your cluster performance will be degraded. Also, if you’re seeing a high number of leader changes reflected in etcd_server_leader_changes_seen_total, it could indicate issues with connectivity or resourcing in the etcd cluster.
API Request Latency
If you divide apiserver_request_latencies_count into apiserver_request_latencies_sum, you’ll get your API server’s average latency per request. Tracking the average request latency over time can let you know when your server is getting overwhelmed.
Work Queue Latency
The work queues are action queues managed by the controller manager, and are used to handle all automated processes in the cluster. Watching for increases in either workqueue_queue_duration_seconds or workqueue_queue_duration_seconds will let you know when the queue latency is increasing. If this happens, you may want to dig into the controller manager logs to see what’s going on.
Scheduler Problems
There are two aspects of the scheduler that are worth watching. First, you should monitor scheduler_schedule_attempts_total{result:unschedulable}, because an increase in unschedulable pods may mean you have a resourcing issue with your cluster. Second, you should keep an eye on the scheduler latency, using one of the latency metrics indicated above (the metric name and units changed with v1.14). An increase in pod scheduling latency may cause other problems, and may also indicate resourcing issues in your cluster.
Events
In addition to collecting numeric metrics from your Kubernetes cluster, collecting and tracking events from your cluster can also be useful. Cluster events will let you monitor the pod lifecycle and watch for significant pod failures, and watching the rate of events flowing from your cluster can be an excellent early warning indicator. If the rate of events changes suddenly or significantly, it may be an indicator that something is going wrong.
Application Metrics
Unlike the rest of the metrics and events we’ve examined above, application metrics aren’t emitted from Kubernetes itself, but rather from your workloads that are run by the cluster. This telemetry can be anything that you consider important from the point of view of your application: error responses, request latency, processing time, etc.
There are two philosophies of how to collect application metrics. The first (which has been widely preferred until recently) is that metrics should be “pushed” out from the application to a collection endpoint. This means a client like StatsD has to be bundled with each application, to provide a mechanism with which to push metric data out of that application. This technique requires more management overhead to ensure that every application running in your cluster is instrumented properly, so it’s begun falling out of favor with cluster managers.
The second metric collection philosophy (which is becoming more widely adopted) is that metrics should be “pulled” from applications by a collection agent. This makes applications easier to write, because all they have to do is publish their metrics appropriately — but the application doesn’t have to worry about how those metrics are pulled or scraped. This is how OpenMetrics works and is the way Kubernetes cluster metrics are collected. When this technique is combined with service discovery by your collection agent, it creates a powerful method for collecting any kind of metrics you need from your cluster applications.
Final Thoughts
Kubernetes can generate millions upon millions of new metrics daily. This can present two big challenges. First, many conventional monitoring systems just can’t keep up with the sheer volume of unique metrics needed to properly monitor Kubernetes clusters. Second, all this data “noise” makes it hard to keep up with and know which metrics are most important.
Your Kubernetes monitoring solution must have the ability to handle all of this data, as well as automatically analyze, graph and alert on the most critical metrics to pay attention to. This way, you know you’ve collected everything you need, filtered out the unnecessary data, and automatically narrowed in on the most relevant data. As a result, you can save substantial time and rest assured that everything is working as it should.
Feature image via Pixabay.