How to Fix Kubernetes Monitoring
It’s astonishing how much data is emitted by Kubernetes out of the box. A simple three-node Kubernetes cluster with Prometheus will ship around 40,000 active series by default! Do we really need all that data?
It’s time to talk about the unspoken challenges of monitoring Kubernetes. The difficulties include not just the bloat and usability of metric data, but also the high churn rate of pod metrics, configuration complexity when running multiple deployments, and more.
This post is inspired by my recent episode of OpenObservability Talks, in which I spoke with Aliaksandr Valialkin, CTO of VictoriaMetrics, a company that offers the open source time series database and monitoring solution by the same name.
Let’s unpack Kubernetes monitoring.
A Bloat of Out-of-the-Box Default Metrics
One of the reasons that Prometheus has become so popular is the ease of getting started collecting metrics. Most of the tools and projects expose metrics in OpenMetrics format, so you just need to turn that on, and then install the Prometheus server to start scraping those metrics.
Prometheus Operator, the standard installation path, installs additional components for monitoring Kubernetes, such as kube-state-metrics, node-exporter and cAdvisor. Using the default Prometheus Operator to monitor even a small 3-node Kubernetes cluster results in around 40,000 different metrics! That’s the starting point, before even adding any applicative or custom metrics.
And this number keeps growing at a fast pace. Valialkin shared that since 2018, the amount of metrics exposed by Kubernetes has increased by 3-1/2 times. This means users are flooded with monitoring data from Kubernetes. Are all these metrics really needed?
Not at all! In fact, the vast majority of these metrics aren’t used anywhere. Valialkin said that 75% of these metrics are never put to use in any dashboards or alert rules. I see quite a similar trend among Logz.io users.
The Metrics We Really Need
Metrics need to be actionable. If you don’t act on them, then don’t collect them. This is even more evident with managed Kubernetes solutions, in which end users don’t manage the underlying system anyway, so many of the exposed metrics are simply not actionable for them.
This drove us to compose a curated set of recommended metrics, essential Kubernetes metrics to be collected whether from self-hosted Kubernetes or from managed Kubernetes services such as EKS, AKS and GKE. We share our curated sets publically as part of our Helm charts on GitHub (based on OpenTelemetry, kube-state-metrics and prometheus-node-exporter charts). VictoriaMetrics and other vendors have similarly created their curated lists.
However, we cannot rely on individual vendors to create such sets. And most end-users aren’t acquainted enough with the various metrics to determine themselves what they need, so they look for the defaults, preferring the safest bet of collecting everything so as not to lack important data later.
Rather, we should come together as the Kubernetes and cloud native community, vendors and end-users alike, and join forces to define a standard set of golden metrics for each component. Valialkin also believes that “third-party monitoring solutions should not install additional components for monitoring Kubernetes itself,” referring to additional components such as kube-state-metrics, node-exporter and cadvisor. He suggests that “all these metrics from such companions should be included in Kubernetes itself.”
I’d also add that we should look into removing unused labels. Do we really need from prometheus-node-exporter the details on each network card or CPU core? Each label adds a dimension to the metric, and multiplies the time series data exponentially.
Kubernetes has made it easy to package, deploy and manage complex microservices architectures at scale with containers. The growth in the number of microservices results in an increased load on the monitoring system: Every microservice exposes system metrics, such as CPU, memory, and network utilization. On top of that, every microservice exposes its own set of application metrics, depending on the business logic it implements. In addition, the networking between the microservices needs to be monitored as well for latency, RPS and similar metrics. The proliferation of microservices generates a significant amount of telemetry data, which can get quite costly.
High Churn Rate of Pods
People move to Kubernetes to be more agile and release more frequently. This results in frequent deployments of new versions of microservices. With every deployment in Kubernetes, new instances of pods are created and deleted, in what is known as “pod churn.” The new pod gets a unique identifier, different from previous instances, even if it is essentially a new version of the same service instance.
I’d like to pause here and clarify an essential point about metrics. Metrics data is time series data. Time series is uniquely defined by the metric name and a set of labeled values. If one of the label values changes, then a new time series is created.
Back to our ephemeral pods, many practitioners use the pod name as a label within their metrics time series data. This means that with every new deployment and the associated pod churn, the old time series stops receiving new samples and is effectively terminated, while a new time series is initiated, which causes discontinuity in the logical metric data sequence.
Kubernetes workloads typically have high pod churn rates due to frequent deployments of new versions of a microservice, as well as autoscaling of pods based on incoming traffic, or resource constraints on the underlying nodes that require eviction and rescheduling of pods. The discontinuity of metric time series makes it difficult to apply continuous monitoring on the logical services and analyze trends over time on their respective metrics.
A potential solution can be to use the ReplicaSet or StatefulSet ID for the metric label, as these remain fixed as the set adds and removes pods. Valialkin, however, refers to this as somewhat of a hack, saying we should push as a community to have first-level citizen nomenclature in Kubernetes monitoring to provide consistent naming.
Configuration Complexity with Multiple Deployments
Organizations typically run hundreds and even thousands of different applications. When these applications are deployed on Kubernetes, this results in hundreds and thousands of deployment configurations, and multiple Prometheus scrape_config configurations defining how to scrape (pull) these metrics, rules, filters and relabeling to apply, ports to scrape and other configurations. Managing hundreds and thousands of different configurations can quickly become unmanageable at scale. Furthermore, it can burden the Kubernetes API server, which needs to serve requests on all these different configurations.
As a community, we can benefit from a standard for service discovery of deployments and pods in Kubernetes on top of the Prometheus service discovery mechanism. In Valialkin’s vision, “in most cases Prometheus or some other monitoring system should automatically discover all the deployments, all the pods which need to be scraped to collect metrics without the need to write custom configuration per each deployment. And only in some exceptional cases when you need to customize something, then you can write these custom definitions for scraping.”