Kubernetes Debugging: What Works and What Fails

KubeCon + CloudNativeCon sponsored this post, in anticipation of KubeCon + CloudNativeCon NA, Nov. 18-21 in San Diego.

Kubernetes is a highly distributed, microservices-oriented technology that allows developers to run code at scale. It has revolutionized cloud infrastructure and made the lives of engineers simpler in many aspects, highlighted by various studies that show an increase in Kubernetes adoption within enterprises.
But even hardcore Kubernetes enthusiasts will admit debugging Kubernetes pods is still a pain.
In such a highly distributed system, simulating the exact situation necessary to find the root cause of a problem is very difficult. There are limitations to the traditional approach of debugging locally, and there are also issues with debugging remotely directly in the cloud or in production.
Kubernetes is all about making software more elastic, and elastic software requires elastic observability; yet we are still stuck using the same old debugging and logging approaches of decades past.
We need a new, better approach.
The Limitations of Debugging Locally
Every developer debugs locally. It’s a standard part of the process and development lifecycle. However, when it comes to Kubernetes and the related complexities of microservices architecture, this approach becomes significantly more difficult.
Each microservice will both serve and use services by other microservices. To recreate this complex architecture on a local environment, with all of the same infrastructure and dependencies, would require a ton of work and setup. One way to deal with these problems is by having an automation script, which allows developers to run the microservices on their own machines simply by running execution commands. However, since the developer must control the configuration and how the branch which is being used is aligned with other branches, the script often breaks.
There are also open-source solutions, such as Hotel or Docker Compose, which can help mitigate some of these frustrations — but they also have their own set of challenges and disadvantages. These drawbacks include, as open source tools, how they need to be perpetually maintained. It also forces the development team to dedicate resources to learning a new tool. This is time and energy that could be saved with a better approach. Moreover, ultimately the system you care most about lives in production, and even the best copy you create for debugging will be a mere shadow of the reality developers actually need to face.
The Issues with Debugging Kubernetes Remotely
While developers can debug microservices hosted by cloud providers, Kubernetes has its own orchestration mechanism and optimization methodologies. Those methodologies are what make Kubernetes great. But they also make debugging more difficult. Accessing pods is a very unstable operation. If a developer wants to SSH to try and run debugging tools on a Kubernetes pod, Kubernetes might actually kill it a second before getting the data.
There are ways to address these issues. Relying on logger.info(“Got Exit Signal: {}”.format(sig)) is the oldest trick in the book — but now comes with the heavy price of redeployment as its side effect. Developers can also choose to redirect traffic from the cluster to their local machine, which can certainly help with recreating an issue for effective debugging. But this isn’t secure and is often a lot of data for a local computer to handle.
Service mesh solutions such as Istio and Linkerd, can help developers track their microservices without the need to change code; and because these tools proxy both inbound and outbound traffic, they are an ideal place to add debugging and tracing capabilities. The main downside of service mesh debugging, however, is that it lacks the ability to find the root-cause of an issue. It can highlight that microservice A is slow, for example, but won’t say why. This will often require developers to dive back into the code with other tools and frequently, restoring to redeploying log lines yet again, to get to the bottom line.
On-demand, Live Datapoint Collection
A better approach consists of harnessing the elastic power of on-demand instrumentation and adding logs/debug-snapshots at runtime, seamlessly providing a unified solution for rapid debugging in development, staging and production environments. This enables developers to get the data they need from Kubernetes applications, without having to write more code, restarting or redeploying. The idea is, essentially, to bring breakpoints back to Kubernetes, to set a breakpoint and get any datapoint instantly; only unlike traditional breakpoints — without having to stop the application at any moment. The way these non-breaking breakpoints work is the same for local, remote, and even for production deployments. Plus, since the solution is not tied to a single container instance, it works in scale and can also address issues that only show up intermittently.
Kubernetes and container orchestration frameworks bring about new capabilities but make it more challenging to get the exact and timely data from our applications. To address this challenge, the industry is now heading in a new direction: The ability to collect data instantly and pipeline it on the fly while allowing the application to perform continuously.
Enabling new concepts like on-the-fly conditional logging (Kubernetes equivalent of conditional breakpoints), and temporal logging (the ability to limit logging in time for specific needs). Decoupling data from code and liberating it from CI/CD pipelines, complex installations, and high-overhead brings observability to a whole new level, eliminating friction, bottlenecks and freeing up dev and devops resources on the way. Similar to how structured logging supported initial software growth, this next-gen responsive way of logging, can add the flexibility that is needed to fuel modern and upcoming software trends.
With Kubernetess, the complexity and scale of modern software are exploding. We’ve come a long way and from humble beginnings. The days of simple single instance servers are long gone. And it’s time we retire with them the old ways of debugging while enabling decoupling data from code and facilitating the elasticity that modern observability requires.
Feature image by from Pixabay.