The Challenges of Monitoring Kubernetes and OpenShift
Like many organizations, we are embracing Kubernetes as a deployment platform for many of our applications. We use both upstream Kubernetes and OpenShift, an enterprise Kubernetes distribution on steroids. The Kubernetes framework is very powerful. It allows massive deployments at scale, simplifies new version rollouts and multi-variant testing, and offers many levers to fine-tune the development and deployment process.
Kubernetes and OpenShift are powerful and flexible. They’re also complex to setup, monitor and maintain. Here’s a sneak peek into what we monitor in OpenShift, as well as some hard-earned advice on how our strategy might benefit your own environments.
At the same time, this flexibility makes Kubernetes complex in terms of setup, monitoring and maintenance at scale. Each of the Kubernetes core components (api-server, kube-controller-manager, kubelet, kube-scheduler) has quite a few flags that govern how the cluster behaves and performs. The default values may be okay initially for smaller clusters, but as deployments scale up, some adjustments must be made. We have learned to keep these values in mind when monitoring OpenShift clusters — both from our own pain and from published accounts of other community members who have experienced their own hair-pulling discoveries.
Our company’s demo platform is one of the first technology assets that was migrated to Kubernetes. Despite the word “demo” in the name, it is a critical platform with a 24/7 availability requirement. With over 30 application instances deployed to the platform, these applications are complex and constantly evolving.
Each app comprises at least a dozen of microservices built in various technologies and serving various functions. As an example, one of our more mature applications relies on 3 DB backends (Oracle, MySql and Mongo), Kafka queue, Java legacy components and several microservices developed in Go and Node. Over the course of this year alone, more than 14 million lines of code were touched.
The move to Kubernetes allowed us to realize all major benefits that the container management solution provides:
- We develop faster: we do not need to spend the time on developing infrastructure components that are already built into Kubernetes;
- We deploy more frequently because Kubernetes makes it easy. This year alone we went through 30 major upgrades. The ease of deployment also helps us spend less time on updates and issue resolution. Rapid test cycles allow us to deliver a higher quality code: since the move to Kubernetes the number of Sev1 issues fell by 80 percent;
- We reduced our operational cost: with the higher density of containerized deployments, we use fewer VMs and our cloud spend decreased. We also leverage the ease of the blue-green deployments to maintain the high availability of the platform.
In this article, we will share what we monitor in OpenShift clusters and give suggestions as to how our strategy might be relevant to your own environments.
What We Monitor and Why
At the foundational level, we want monitoring operators to keep an eye on the health of the nodes where the cluster is deployed. Typically, you would have a cluster of masters, where core Kubernetes components (api-server, controller-manager, kube-schedule, etc.) are deployed, a highly available etcd cluster and a number of worker nodes for guest applications. To paint a complete picture, we combine infrastructure health metrics with the relevant cluster data gathered by our Kubernetes data collectors.
From an infrastructure point of view, we track CPU, memory and disk utilization on all the nodes, and also zoom into the network traffic on etcd. In order to spot bottlenecks, we look at various aspects of the traffic at a granular level (e.g., reads/writes and throughput). Kubernetes and OpenShift clusters may suffer from memory starvation, disks overfilled with logs or spikes in consumption of the API server and, consequently, the etcd.
Ironically, monitoring solutions are often responsible for bringing down clusters by pulling excessive amounts of information from the Kubernetes APIs. It is always a good idea to establish how much monitoring is enough and dial it up when necessary to diagnose issues further. If a high level of monitoring is warranted, you may need to add more masters and etcd nodes. Another useful technique, especially with large-scale implementations, is to have a separate etcd cluster just for storing Kubernetes events. This way, the spikes in event creation and event retrieval for monitoring purposes won’t affect performance of the main etcd instances. This can be accomplished by setting the –etcd-servers-overrides flag of the api-server, for example:
--etcd-servers-overrides =/events#https://etcd1.cluster.com:2379;https://etcd2. cluster.com:2379;https://etcd3. cluster.com:2379
From the cluster perspective, we monitor resource utilization across the nodes that allow pod scheduling. We also keep track of the pod counts and visualize how many pods are deployed to each node and how many of them are bad (failed/evicted).
A dashboard widget with infrastructure and cluster metrics combined.
Why is this important? Kubelet, the component responsible for managing pods on a given node, has a setting, –max-pods, which determines the maximum number of pods that can be orchestrated. In Kubernetes the default is 110. In OpenShift it is 250. The value can be changed up or down depending on need. We like to visualize the remaining headroom on each node, which helps with proactive resource planning and to prevent sudden overflows (which could mean an outage). Another data point we add there is the number of evicted pods per node.
Evictions are caused by space or memory starvation. We recently had an issue with the disk space on one of our worker nodes due to a runaway log. As a result, the kubelet produced massive evictions of pods from that node. Evictions are bad for many reasons. They will typically affect the quality of service or may even cause an outage. If the evicted pods have an exclusive affinity with the node experiencing disk pressure, and as a result cannot be re-orchestrated elsewhere in the cluster, the evictions will result in an outage. Evictions of core component pods may lead to the meltdown of the cluster.
Long after the incident where pods were evicted, we saw the evicted pods were still lingering. Why was that? Garbage collection of evictions is controlled by a setting in kube-controller-manager called –terminated-pod-gc-threshold. The default value is set to 12,500, which means that garbage collection won’t occur until you have that many evicted pods. Even in a large implementation it may be a good idea to dial this threshold down to a smaller number.
If you experience a lot of evictions, you may also want to check if kube-scheduler has a custom –policy-config-file defined with no CheckNodeMemoryPressure or CheckNodeDiskPressure predicates.
Following our recent incident, we set up a new dashboard widget that tracks a metric of any threats that may cause a cluster meltdown (e.g., massive evictions). We also associated a health rule with this metric and set up an alert. Specifically, we’re now looking for warning events that tell us a node is about to experience memory or disk pressure and that a pod cannot be reallocated (e.g., NodeHasDiskPressure, NodeHasMemoryPressure, ErrorReconciliationRetryTimeout, ExceededGracePeriod, EvictionThresholdMet).
We also look for failed daemon pod failures (FailedDaemonPod), as they are often associated with the cluster health rather than issues with the daemon set app itself.
Pod crashes are an obvious target for monitoring, but we are also interested in tracking pod kills. Why would someone be killing a pod? There may be good reasons for it, but it may also signal a problem with the application. For similar reasons, we track deployment scale-downs, which we do by inspecting ScalingReplicaSet events. We also like to visualize the scale-down trend along with the app health state. Scale-downs may happen by design through, for example, auto-scaling when the app load subsides. They may also be issued manually, in error, and expose the application to an excessive load.
Pending state is supposed to be a relatively short stage in the lifecycle of a pod, but sometimes it isn’t. It may be good idea to track pods with a pending time that exceeds a certain, reasonable, threshold — one minute, for example. Even better is if you’re able to baseline metrics like this and track deviations from the baseline. If you catch a spike in pending state duration, the first thing to check is the size of your images and the speed of image download. One big image may clog the pipe and affect other containers. Kubelet has this flag, —serialize-image-pulls, which is set to true by default. It means that images will be loaded one at a time. Change the flag to false if you want to load images in parallel and avoid the potential clogging by a monster-sized image. Keep in mind, however, that you have to use Docker’s overlay2 storage driver to make this work. In newer Docker versions this setting is the default. In addition to the Kubelet setting, you may also need to tweak the max-concurrent-downloads flag of the Docker daemon to ensure the desired parallelism.
Large images that take a long time to download may also cause a different type of issue that results in a failed deployment. The Kubelet flag –image-pull-progress-deadline determines the point in time when the image will be deemed as “too long to pull or extract.” If you deal with big images, make sure you dial up the value of the flag to fit your needs.
Many big issues in the cluster stem from small user errors (human mistakes). A typo in a spec — for example, in the image name — may bring down the entire deployment. Similar effects may occur due to a missing image or insufficient rights to the registry. With that in mind, we track image errors closely and pay attention to excessive image-pulling. Unless it is truly needed, image-pulling is something you want to avoid in order to conserve bandwidth and speed-up deployments.
Storage issues also tend to arise due to spec errors, lack of permissions or policy conflicts. We monitor storage issues (e.g. mounting problems) because they may cause crashes. We also pay close attention to resource quota violations because they do not trigger pod failures. They will, however, prevent new deployments from starting and existing deployments from scaling up.
Speaking of quota violations, are you setting resource limits in your deployment specs?
Policing the Cluster
On our OpenShift dashboards, we display a list of potential red flags that are not necessarily a problem yet but may cause serious issues down the road. Among these are pods without resource limits or health probes in the deployment specs.
Resource limits can be enforced by resource quotas across the entire cluster or at a more granular level. Violation of these limits will prevent the deployment. In the absence of a quota, pods can be deployed without defined resource limits. Having no resource limits is bad for multiple reasons. It makes the cluster capacity planning challenging. It may also cause an outage. If you create or change a resource quota when there are active pods without limits, any subsequent scale up or redeployment of these pods will result in failures.
The health probes, readiness and liveness, are not enforceable, but it is a best practice to have them defined in the specs. They are the primary mechanism for the pods to tell the kubelet whether the application is ready to accept traffic and if it is still functioning. If the readiness probe is not defined and the pods takes a long time to initialize (based on the kubelet’s default), the pod will be restarted. This loop may continue for some time, taking up cluster resources for no reason and effectively causing a poor user experience or outage.
The absence of the liveness probe may cause a similar effect if the application is performing a lengthy operation and the pod appears to Kubelet as unresponsive.
We provide easy access to the list of pods with incomplete specs, allowing cluster admins to have a targeted conversation with development teams about corrective action.
Routing and Endpoint Tracking
As part of our OpenShift monitoring, we provide visibility into potential routing and service endpoint issues. We track unused services, including those created by someone in error and those without any pods behind them because the pods failed or were removed.
We also monitor bad endpoints pointing at old (deleted) pods, which effectively causes downtime. This issue may occur during rolling updates when the cluster is under increased load and API request-throttling is lower than it needs to be. To resolve the issue, you may need to increase the –-kube-api-burst and –kube-api-qps config values of kube-controller-manager.
Context plays a significant role in our monitoring philosophy. We always look at application performance through the lens of the end-user experience and desired business outcomes. Unlike specialized cluster-monitoring tools, we are not only interested in cluster health and uptime per se. We’re equally concerned with the impact the cluster may have on application health and, subsequently, on the business objectives of the app.
In addition to having a cluster-level dashboard, we also build specialized dashboards with a more application-centric point of view. There we correlate cluster events and anomalies with application or component availability, end-user experience as reported by real-user monitoring, and business metrics (e.g., conversion of specific user segments).
Leveraging K8s Metadata
Kubernetes makes it super easy to run canary deployments, blue-green deployments, and A/B or multivariate testing. We leverage these conveniences by pulling deployment metadata and using labels to analyze performance of different versions side by side.
By bringing these different datasets together under one umbrella, an APM solution should establish a common ground for diverse groups of operators. On the one hand, you have cluster admins, who are experts in Kubernetes but may not know the guest applications in detail. On the other hand, you have DevOps in charge of the APM or managers looking at the business metrics, both of which may not be intimately familiar with Kubernetes.
The right solution allows these groups to have a productive monitoring conversation, using terms that are well understood by everyone and a single tool to examine data points on a shared dashboard. AppDynamics provides a unified monitoring platform that centers conversations around the business transaction, i.e., the user-facing pieces of functionality within your applications. With this common starting point conversations are more productive and we are able to quickly correlate the performance of our various systems, including Kubernetes/ OpenShift, using their specific metrics, with what our users are experiencing.