Kubernetes / Monitoring / Sponsored

Komodor Workflows Extend Kubernetes Troubleshooting

12 Oct 2021 7:22am, by

Kubernetes native troubleshooting platform Komodor promises to ease the incredibly painful process of finding out what’s gone wrong inside your Kubernetes cluster. This week at KubeCon+CloudNativeCon North America, the company is releasing a workflow feature update to ease troubleshooting even further.

“The very nice — or not so nice — thing about Komodor is that when we go to customers and we ask them how troubleshooting is done today, who is responsible, every single org we meet is struggling to find out what’s happening in a Kubernetes cluster once you have an issue. There are so many moving parts and possible failure scenarios that can explain the issue that they are experiencing, they are simply lost,” said Komodor chief technology officer and co-founder Itiel Shwartz in an interview with The New Stack.

Kubernetes Troubleshooting

A Kubernetes outage resolution — or any outage really — requires at least three steps to resolution, Shwartz said:

  1. Understand that you have a problem and where the cause lies.
  2. Manage and communicate to bring more people on to take action.
  3. Prevent from happening again.

For now, Komodor focuses on the first, tracking both everything within your Kubernetes cluster and anything outside that might affect it. The platform keeps track of how your cluster behaves and how that might change over time. When something goes off within your pods or clusters, it will let you know what also happened around the same time — like if someone changes a replica number and maybe source code in Github — so hopefully you can pinpoint what caused the error much quicker.

Kubernetes and SRE

More and more organizations have one or two Kubernetes experts or site reliability engineers. But that’s not enough at all. Despite widespread adoption, there’s still a huge Kubernetes talent gap for these more senior, experienced roles.

Screenshots of the internal app with Itiel's face next to it

“People don’t know what to do and even what question to ask or where to look,” Shwartz explained. Even if they have right question, he says it still takes a lot of time to dig into their Jenkins pipeline or check all their feature flags. The wild complexity of Kubernetes systems means that even if they know what the likely error is, it can take a lot of time to check it.

Even if you are an expert that knows all about Kubernetes, it still demands a lot of time and cognitive load to ingest that data. This means the one or two experts within an organization are naturally bombarded with questions, risking burnout.

“Not succeeding troubleshooting is very frustrating, but, also, being the hero day after day is very frustrating,” Shwartz said.

And this difficulty to discover the cause of failure also affects KPIs, as outages last significantly longer.

Komodor looks to answer the first round of questions, enabling those less senior people to have a lot more information about any issue, before they each out to their Kubernetes expert. Their target audience is companies that are moving fast and agile, trying to shift left to get developers to change things. Komodor creates the question for those junior engineers out of the box, asking the questions that a very senior SRE or DevOps person would ask.

“We take our very deep understanding on Kubernetes for how the Kubernetes structure should look, and we do this out of the box. And you get answers faster,” Shwartz said.

He also pointed out how easy it already is to see what’s currently happening within Kubernetes, but it’s nearly impossible to find historical data of system changes. Komodor is built to not only solve for the lack of knowledge and the extreme complexity, it allows people to go back in time a day or even a week to track changes that may have caused an issue.

New Features of Komodor

Komodor is launching today a new workflow feature to help users get to a next level of understanding the root cause.

“We wanted our users to understand what is happening in the cluster and, when there’s a problem, we wanted them to have a crystallized view of what is interesting to look at,” Shwartz said.

The new workflow feature runs different checks on where the root cause might lie and then finds the outcome to send to the user so it can be used to diagnose.

Imagine your application is currently crashing. It has three different vertical pods, each one on its different Kubernetes node. You know something is wrong, so you use Komodor to review it, which turns up one of three replicas as unhealthy. It alerts you that there’s something wrong in a certain node. You then know you might have a problem with your infrastructure and how you provision the nodes.

As of today, Komodor has about a dozen workflows that look to automate common troubleshooting scenarios. The workflow starts by checking and analyzing the state of pods, ingress, load balancers, PersistentVolumeClaims (PVC), services, and endpoints, in order to give you a detailed view of where the problem lies.

Soon they will allow users to customize these workflows for their own needs, “but we are bringing a lot of value out of the box without them doing anything.”

A newsletter digest of the week’s most important stories & analyses.