DevOps / Kubernetes / Monitoring

Gremlin Sound Proofs Chaotic Pods in Kubernetes Clusters

18 Nov 2020 4:00am, by

Honeycomb sponsored The New Stack’s coverage of Kubecon+CloudNativeCon North America 2020.

Kubernetes has become a sort of myth. Originating from the Greek word for helmsman or pilot, the word “Kubernetes” evokes in tech executives as something that steers and accelerates their containerized workloads in some magical way. The hard truth is Kubernetes still requires a lot of effort to fly safely in the right direction.

A whole segment of the IT industry has been established around making Kubernetes easier and safer to drive. You actually have to experiment with it a lot to understand your system to make sure it works the way you expect it to. And it’s never just your piece of the complex orchestration. Hundreds of different ephemeral services can collide in a single cluster, with shared CPU, memory and security permissions. It makes for a lot of noise that can be frankly chaotic.

This week for Kubecon+CloudNativeCon North America, Gremlin system chaos testing provider has updated its chaos engineering platform to deploy targeted chaos engineering on isolated objects to make sure that the whole Kubernetes environment can handle if something happens to that one object, allowing engineers to build confidence and understanding around their Kubernetes deployments.

In effect Gremlin can soundproof individual pods so if a service bumps into a neighboring service from another team, performance doesn’t suffer. And, on the flip side, the chaos-dropping devs can then zero in on the specific services they are testing for resilience. Besides the benefits of being able to proactively understand how Kubernetes will behave in production, the platform offers the ability to perform more granular attacks on specific services.

Lorne Kligerman, Gremlin’s senior director of product, told The New Stack that typically adoption of Gremlin begins with one team within an organization. Before now, shared resources in Kubernetes meant that when you targeted a Kubernetes deployment, the chaos could potentially rain down on other containers too. While there’s a time and a place to run experiments on an entire cluster, engineering teams often want to get specific without affecting tangent services or applications.

As Kligerman explains in a blog post:

Kubernetes allows for packing multiple pods onto a single node and scaling out each pod individually without impacting neighboring pods. Horizontal Pod Autoscaling (HPA) helps squeeze more utilization out of your infrastructure by scaling out only pods that have reached their resource limits, saving costs versus scaling out entire applications. Resource Limits prevent containers from over-utilizing resources and disrupting other services that share a node. However, if applications aren’t tested for HPA and resource limits, it’s difficult to determine if your application is decoupled enough to scale out pods independently and to know if noisy neighbors can still break services sharing the same node.

“Customers can now experiment on one service at a time in a multitenant cluster and be confident that only that service will be impacted so that they can make sure that they are diagnosing the problems that they are looking for, not just the unknown,” Kligerman said.

“Failure is going to happen. What’s important is: Can your system mitigate that failure for what’s important to you?” — Lorne Kligerman, Gremlin

Gremlin does this within isolated control groups called cgroups, as a way to isolate resource contacts to a container, based on a process rather than a machine. By being able to get very granular, you can actually make sure that you’ve architected Kubernetes in a way that means one application spiking the CPU or memory usage, for example, doesn’t impact other applications on the same cluster.

A common use case can be two teams with 20 services each that don’t realize they are in the same cluster. One team is using Gremlin while the other isn’t. They can be in for a surprise.

Kligerman said this feature has been particularly popular among their most engaged customers in the financial space which sees clusters getting larger and larger with hundreds of services across their clusters and nodes.

He said, “A big part of chaos engineering is not just to blindly unleash chaos. The practice of chaos engineering is about well-thought-out experiments so that, as you inject the failure, you are observing what takes places whether you can mitigate failure or not, so you can make sure you are providing the best experience for your customers when that failure does occur.”

Chaos engineering is the scientific application of precise attacks on controlled parts of your systems so you learn how they will react. This control is essential to figuring out why something happens and to contain the blast radius, all while not affecting customer experience in production.

While this new feature allows teams to test their services one at a time, it doesn’t mean you should always soundproof to your neighbors. You should warn them — like telling them ahead you’re going to have a party and to let you know if it bothers them. They will be more likely not to complain if you were thoughtful ahead. There is still a compelling test case to see how all services within a single cluster react and interact when one or more services is affected. Now, with this new Gremlin feature, you can just prevent chaos without consent.

Gremlin now allows you to experiment with expanding the impact — including on other teams — as you grow your chaos engineering practice.

Feature image by Kokaleinen de Pixabay.

A newsletter digest of the week’s most important stories & analyses.