Why Monitoring Kubernetes Is so Challenging and How to Manage It
Despite the popularity of Kubernetes, operating a Kubernetes cluster is challenging. Becoming knowledgeable on this technology, which has layers upon layers of abstractions, takes considerable time. Managing a cluster deployment is difficult, the ecosystem is constantly changing, and best practices are continually evolving.
As a result, Kubernetes monitoring is also complicated. Knowing metrics on cluster health, identifying issues, and figuring out how to remediate problems are common obstacles organizations face, making it difficult to fully realize the benefits and value of their Kubernetes deployment.
Understanding how to best approach monitoring Kubernetes health and performance requires first knowing why Kubernetes observability is uniquely challenging. This article dives into these complexities, along with how to manage them with the right solution.
Kubernetes Monitoring — Why So Complex?
Kubernetes’ strength is also one of its weaknesses. It abstracts away a lot of complexity to speed deployment; but in doing so, it leaves you blind as to what is actually happening, what resources are being utilized, and even the cost implications of the actions being taken. Also, Kubernetes has many more components — e.g. servers and services — than traditional infrastructure, making it much more difficult to do root cause analysis when something goes wrong. The following are three core complexities that make Kubernetes monitoring challenging.
Complexity #1: Millions of Metrics
Kubernetes is a multilayered solution. The entire deployment is a cluster and inside each cluster are nodes. Each node runs one or more pods, which are the main components that handle your containers, and the nodes and pods in turn are managed by the Control Plane. Inside the Control Plane are many smaller pieces such as the kube-controller, cloud-controller, kube-api-server, kube-scheduler, and etcd. These abstractions all work to help Kubernetes efficiently support your container deployments. While they’re all very helpful, they generate a significant number of metrics.
In addition to Control Plane metrics are “Pod Churn” metrics. Real-world pod usage varies wildly between different organizations. Some organizations design systems where pods may last days, weeks, or even months; while other organizations have systems where pods only last for minutes or seconds. In Kubernetes, each pod produces a collection of its own metrics. “Pod churn” refers to the cycle through which pods and containers are created, destroyed, and later recreated; and every time a pod is created, you have a new set of metrics being created for it. This results in a large volume of high-cardinality (very unique) metrics. A high level of “pod churn” can result in millions upon millions of new metrics being created every single day.
Complexity #2: Ephemerality
In addition to the system Control Plane, there are your deployment elements — which constantly change. Deployments, DaemonSets, Jobs and StatefulSets all can generate new pods to monitor, and sometimes it’s even necessary to scale down; then pods or nodes will disappear forever. The Kubernetes scheduler schedules all of these elements to ensure that resources are always available and allocated where you want them to be. As new deployments are scheduled, Kubernetes may decide that it needs to move a pod in order to free up resources on a given node. This results in pods being moved and recreated — the same pod, just with a different name and in a different place.
Complexity #3: Lack of Observability
Organizations that adopt Kubernetes tend to also follow modern software practices, including using microservices and/or stateless application design. These ultimately lead to application architectures that are very dynamic and hinder observability.
In a microservice-based application, engineers break down the application into components representing the core functions or services of the application. These components are intended to be loosely coupled, so the services are operated independently and designed in such a way that a change to one service won’t significantly affect other services. Modern applications can be composed of dozens of microservices, and Kubernetes keeps track of the state of these various components, ensuring they are available and that there are enough of them to handle the appropriate workload. The microservices themselves are in constant communication with each other, and that communication takes place through a virtual network within the Kubernetes cluster itself.
In a stateless application, the application avoids storing any client session data on the server. Any session data storage (if it needs to occur at all) is handled on the client-side. Since there is no session data stored on the server, there is also no need for any particular client connection to be favored over any other client connection. This allows the application to treat each connection as the first connection and easily balance the processing load across multiple instances. The biggest benefit of stateless application design is that it enables applications to be horizontally scaled, simply by deploying instances of the application on multiple servers and then distributing all incoming client requests amongst the available servers.
Microservices are not required to be stateless (and stateless apps are not required to be organized into microservices), but you do tend to find these two practices being leveraged together for the sake of being able to easily scale the application. This means Kubernetes becomes an ideal platform upon which to deploy this type of software. However, these types of services are (by design) expected to be ephemeral; they scale up to handle a workload and subsequently disappear when no longer needed. As a result, all operational information present within a pod disappears with it when it’s torn down. Nothing is left; it’s all gone.
How does this affect the observability of Kubernetes? Since observability is the ability to infer the state of a system through knowledge of that system’s outputs, it sure seems like Kubernetes is a system with minimal observability. This limited observability is why it’s so difficult to troubleshoot problems with Kubernetes. It’s not uncommon to hear stories of Kubernetes operators finding major software problems months or even years after having migrated to the ecosystem. Kubernetes itself does such a fantastic job of ensuring that services stay running, that given its limited outputs you can easily find yourself in just such a situation without realizing it. On the surface, this is a great success story for Kubernetes, but sooner or later those software problems need to be found, and that’s going to be a problem when the system seems to be a “black box.”
Monitoring Solutions that Manage Kubernetes Complexities for You
Managing the complexities of Kubernetes observability requires knowing what to look for in a monitoring solution. While there are several open source Kubernetes monitoring solutions, they require you to create and install several individual components before you can meaningfully monitor your cluster. Several traditional IT monitoring tool providers have also introduced Kubernetes monitoring solutions, but many are not purpose-built for Kubernetes. As a result, organizations are required to do more tuning and spend considerable time identifying problems, what’s causing them, and how to fix them.
So how do you identify what’s best for you? The following are key criteria to consider when evaluating Kubernetes monitoring solutions.
Adapts to Changes Automatically Yet Keeps a Consistent User Experience
Due to the ephemeral nature of Kubernetes, a monitoring solution needs the ability to detect changes automatically and continue monitoring without interruption. Also, Kubernetes itself is constantly changing. So on top of figuring out how Kubernetes works, there’s the challenge of trying to figure out how to make monitoring work in an ever-evolving ecosystem. The goal of a good Kubernetes monitoring solution should be to keep up with and encapsulate these changes in a way that still provides a consistent, reliable, and repeatable experience for the user — thus removing the burden from them.
Offers Turnkey Tools Built Specifically for Kubernetes
If you’re monitoring Kubernetes, it’s your job to identify problems, the cause of failure, and the steps to quick remediation. This requires developing appropriate domain knowledge, including a discrete set of pathological failures and prescriptive responses to each of them. But when you look at the Kubernetes skills gap in the market today, this sort of knowledge is extremely rare.
An effective Kubernetes monitoring solution should provide turnkey capabilities for identifying and remediating recurrent, specific failures seen in Kubernetes deployments — like crash loops, job failures, CPU utilization, etc. Users should not need to figure out which of these they need to monitor and how. The solution should make you aware of the problem, but not require deep analysis or learning to track it, deal with it, and ensure it doesn’t happen again.
Handles a Lot of Data and Knows Which Metrics Are Important to Pay Attention to
Conventional monitoring systems just can’t keep up with the sheer volume of unique metrics needed to properly monitor Kubernetes clusters. A comprehensive Kubernetes monitoring solution must have the ability to handle all of this data. But which metrics should you watch? You can’t watch all of them, and you don’t need to watch all of them. Your Kubernetes monitoring solution needs to keep tabs on the important metrics, so you can rest assured everything is working as it should.
Kubernetes offers significant benefits for modern cloud-based applications, but to fully reap its benefits organizations need a new approach to monitoring. Kubernetes presents unique observability challenges, and conventional monitoring techniques are not enough to gain insights into cluster health and resource allocation. By understanding the complexities behind Kubernetes monitoring, you can better identify a solution that will allow you to derive more value from your Kubernetes deployment.