Modal Title
Kubernetes / Observability / Security

Understand Kubernetes with Splunk Observability Cloud

Kubernetes is a different animal and needs a more mature approach to monitoring. Here are three capabilities you'll need: Tagging, AI-guided troubleshooting and root cause analysis.
Oct 19th, 2022 8:00am by
Featued image for: Understand Kubernetes with Splunk Observability Cloud

Kubernetes is the hottest trend in software delivery, and it’s only picking up steam as more people find out about it. Kubernetes has many benefits and is extremely powerful, but with great power comes great respon… difficulty in monitoring.

Observability tools, like the Splunk Observability Cloud, built to handle cloud native applications from the ground up, are your secret weapon to make sense of a complicated Kubernetes environment.

The Role of Observability in a Kubernetes Environment

While Kubernetes provides many benefits (e.g., easy scaling, no need to plan which specific machines an app will run on, etc.), it also creates new challenges. The biggest of these challenges is the added complexity. A Kubernetes-based environment is commonly very large, running dozens or hundreds of microservices across a large number of machines.

The volume of data generated by splitting applications into containers and using orchestration systems to run them is simply too great for classic tools to handle. Each container creates metrics and each deployed application does as well.

Observability, as an evolution of monitoring, helps you make sense of these huge volumes of data using AI and machine learning technologies. One way that Splunk Observability helps you do this is through Kubernetes Navigator within Infrastructure Monitoring — offering a map-based view of your entire Kubernetes cluster, in addition to more specific views of nodes, pods, and containers. See a screenshot of Kubernetes Navigator below:

Screenshot of Kubernetes Navigator=

In the rest of this article, I’ll discuss what three capabilities you need to effectively operate a Kubernetes environment — tagging, AI-guided troubleshooting and root cause analysis.

Observability Capabilities: Making Sense of all the Data

Telemetry data is the most essential thing an observability product consumes. This includes metrics, traces, and logs, but also includes events like user sessions or synthetics tests. This data is what enables all of the other cool stuff observability provides, so throwing it away through a sampling technique is counterproductive to the goal of seeing everything, and in a complicated system like Kubernetes, sampling may cause you to miss critical events until it’s too late and customers start to notice.

You want to be certain that your observability tool provides full fidelity with your data and does not sample, unless you understand and are willing to accept the potential risks sampling adds to your environment. Splunk Observability Cloud makes it easy to ingest all of your data with no sampling, and in real time. You can see issues anywhere in your infrastructure in seconds, without waiting for an alert rollup, polling, or for data to go through a complicated ingestion pipeline.

The flagship view of all of this data is through our dynamic service map, shown below. On this service map, you can see how the services your Kubernetes workload supports interact with each other. This view can be customized using the dropdowns at the top to specific Kubernetes environments, or even specific business workflows.

Splunk's Dynamic Service Map

  1. Tagging

Tagging is another key capability for a Kubernetes environment. Tagging lets you add pieces of metadata to your observability data — things like which region the instance is deployed in, what customer ID the request is associated with, which canary deployment or A/B test is in use, or other business metrics that matter to you such as tier of customer or marketing campaign they’re responding to — data that is simply essential to quickly troubleshoot issues and determine how different types of users are experiencing your application.

You’ll want to make sure that your platform supports tags with an arbitrary number of values (called “cardinality”) — being able to tag every request into your app by specific user and to then follow their particular experience through your application is incredibly powerful. Most platforms don’t support this and/or charge exorbitantly for tags with a large number of values. Make sure you don’t fall into that trap. Splunk Observability Cloud supports tags with cardinality in the millions and beyond.

In Splunk Observability, we offer a powerful view of tagged data called Tag Spotlight, shown below. Tag Spotlight lets you see how RED metrics (requests, errors, and duration) differ by specific tag at a glance. Additionally, quick links at the bottom of Tag Spotlight let you jump to related views to solve issues quickly, including the Kubernetes clusters, cloud instances, and logs. In this example, you can see by looking at the version tag that we clearly have a problem with version 350.10 (every request with that version tagged is erroring out), just at a quick glance:

Splunk's tag spotlight

Splunk’s tag spotlight.

  1. AI-Guided Troubleshooting

Most mature software development organizations deploy code many times a day. Each one of these deployments changes the way data flows through the various microservices that make up the application. Additionally, with Kubernetes, deployments will almost certainly result in pods being reassigned to different nodes, resource limits being changed or created, and other internal Kubernetes manipulation that you need to be aware of.

The frequency of these changes and the complexity of a containerized application create thousands of potential interactions and points of failure. Traditionally, you’d need to prepare for all of these failures by creating a dashboard and alert for each of them. This is error-prone and time-consuming. Additionally, with the complexity provided by deploying containerized microservice apps that likely call each other in new ways with each release, this simply doesn’t scale. Finding root causes requires too much effort (finding dashboards, switching tools, etc.) without an AI-guided tool.

In Splunk Observability Cloud, we provide actual end-to-end visibility with a directed troubleshooting experience that includes business context (e.g., what specific business workflow is affected) and tells engineers why a problem occurred, its impact, and where to look to investigate and debug the problem. You don’t need to look through logs hoping to find the problem — our service map shows you which services throughout your architecture are failing. In this example, you can see that the API is returning some errors — about 7% of the time — but you can also see that the API service itself is not the root cause.

End-to-end observability

End-to-end observability

  1. Root Cause Analysis

Finally, when problems happen, time is money. When something fails, you need to see what the root cause of the failure is and fix it, fast. Splunk Observability Cloud can help you do this via dynamic service maps that draw out the relationship between microservices in your application and then highlight the root cause right away. That way, you’re not swimming through thousands of lines of frontend logs when the issue is a faulty backend service release.

Root cause analysis and detection is a table-stakes feature for observability, so you’ll want to make sure that your platform can identify issues, even in a complex deployment across multiple clouds, a hybrid cloud deployment, or even in a serverless world.

In Splunk Observability Cloud, you can see a solid red dot on the service map that shows you the service that is the root cause of upstream errors — in the screenshot above, you can see that the API service is not the root cause of any errors, but the final path points directly to the payment service, which is the root cause of the errors in this example.

Hovering over that service shows that 12% of the requests (about 16,000 of them) to that service have errors, and this service is the root cause of… about 16,000 errors for requests to the service — in other words, this service is the sole cause of all of these issues. We were able to find this issue with zero clicks once on the service map — all the exploration was done simply by hovering over the map.

Getting Started

Many vendors offer observability products. The best way to start with any of them is to integrate OpenTelemetry into your applications, so that you’re ready to emit the necessary data to any observability platform. Using a proprietary agent might be faster in the short term, but can make it difficult to change platforms later. Splunk offers a 14-day free trial of Splunk Observability Cloud, which uses OpenTelemetry as its native data format, so it’s a great place to get started. It also has the capabilities mentioned above.

However you choose to start your observability journey, it’s important to recognize that Kubernetes is a different animal and needs a more mature tool. Make sure that your tool provides the essential capabilities described above before making any decision. Additionally, make sure that your team has taken some training in Kubernetes and understands how the platform operates.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.