Containers / Kubernetes / Microservices / Sponsored / Contributed

Troubleshoot Microservices with Dynamic Packet Capture

15 Sep 2021 6:15am, by and

Dhiraj Sehgal
Dhiraj is director of product marketing at Tigera.

Troubleshooting container connectivity issues and performance hotspots in Kubernetes clusters can be a frustrating exercise in a dynamic environment where hundreds, possibly thousands of pods are continually being created and destroyed. If you are a DevOps or platform engineer and need to troubleshoot microservices and application connectivity issues or figure out why a service or application is performing slowly, you might use traditional packet capture methods like executing tcpdump against a container in a pod. This may help you achieve your goals in a siloed single-developer environment, but enterprise-level troubleshooting comes with its own set of mandatory requirements and scale. You don’t want to be slowed-down by these requirements, but rather address them to shorten the time to resolution.

Dynamic packet capture is a Kubernetes-native way that helps you to troubleshoot your microservices and applications quickly and efficiently without granting extra permissions. Let’s look at a specific use case to see some challenges and best practices for live troubleshooting with packet capture in a Kubernetes environment.

Use Case: CoreDNS Service Degradation

Let’s talk about this use case in the context of a hypothetical situation.

Scenario

Your organization’s DevOps and platform teams are trying to figure out what’s wrong with DNS service, as it has seen DNS service degradation several times during the past few days.

The teams notice that, a few minutes before every outage, there has been a massive number of requests in addition to packet retransmission coming from the logging pod in the storefront namespace.

Problem Observations

Joseph Yostos
Joseph is a technical marketing engineer at Tigera.

The DevOps and platform engineers are presented with the following problems:

  • The issue happens overnight when none of the storefront service owners are present to do live troubleshooting.
  • The DevOps engineer doesn’t have admin privilege to the storefront namespace and cannot run packet capture on this pod.
  • One alternative is to run tcpdump, which is not available on the storefront images, and patching this app to add tcpdump would require further approvals, which are hard to get in a short period of time.
  • As more customers visit the storefront, pods auto-scale, causing new logging pod introduction that requires packet capture.
  • Due to Kubernetes’ dynamic nature, if the pod is recreated, you need to capture the traffic from the new pod.

Desired Outcome

The DevOps and platform engineers want to troubleshoot the problem fast and resolve it quickly with a minimum number of steps.

  • The DevOps engineer needs self-service, on-demand access to run a dynamic packet capture job in the storefront namespace in order to capture the problem on the CoreDNS.
  • Only the DevOps engineer and the storefront service owner should be able to retrieve and review the captured files.
  • Additional filtration is required to do specific capture for faster and targeted review, and to avoid running out of space to capture relevant information.

When troubleshooting microservices and applications in Kubernetes with dynamic packet capture, you should consider the following best practices:

  • Configure packet capture files to be rotated by size and time.
  • Filter the captured traffic based on the port and protocol.
  • Enable a self-service model with RBAC controls to allow teams to troubleshoot workloads within their own namespaces without affecting the rest of the Kubernetes cluster.
  • Leverage commonly used desktop-based networking troubleshooting tools like Wireshark to analyze data from packet capture.

Demo: Addressing the Problem Using Dynamic Packet Capture

Dynamic packet capture is a Kubernetes-native way to capture packets from a specific pod or collection of pods with specified packet sizes and duration to troubleshoot performance hotspots and connectivity issues faster. Dynamic packet capture is provided as a custom resource definition in Kubernetes APIs that uses the existing label-based approach to target workloads’ in-network policies to identify single or multiple workload endpoints for capturing live traffic.

The following is a basic example of how to select a single workload:

Here is another example of how to select all workload endpoints in a sample namespace:

We will select the app “logging” and specify UDP port 53 in our manifest, as follows:

Using a namespace-based RBAC controller, we can give the service account privileges to run packet capture in the storefront namespace.

NOTE: This RBAC gives the DevOps engineer privileges to run packet capture, but not to retrieve the captured files.

At this point, the DevOps engineer has privileges to run packet capture jobs, but can’t retrieve the captured files. If they try to retrieve these files, they should get a 403 HTTP response. (The client does not have access rights to the content, so the server should refuse to give the requested resource).

To allow the DevOps engineer to access the capture files generated for the storefront namespace, a role/role binding similar to the one below can be used.

Finally, once the DevOps engineer has the right privileges to retrieve the captured files, they can use the following API to download the pcap files.

Once the DevOps engineer captures the needed traffic to run their analysis, they can stop the packet capture using the following command:

Conclusion

In most of the incidents when you need to do a packet capture, the problem doesn’t last long and usually happens randomly. So when it does occur, you need to be very fast to capture some useful information to find the root cause of the problem. With the dynamic and ephemeral nature of Kubernetes, a Kubernetes-native solution like dynamic packet capture, is your most efficient option.

Ready to try dynamic packet capture for yourself? Get started with a free 14-day Calico Cloud trial.

The New Stack is a wholly owned subsidiary of Insight Partners. TNS owner Insight Partners is an investor in the following companies: Tigera.

Featured image via Pixabay

A newsletter digest of the week’s most important stories & analyses.