Cloud Native Ecosystem / Kubernetes / Service Mesh

The Rush to Fix the Kubernetes Failover Problem

31 Mar 2022 8:32am, by

Services and clusters will certainly fail on Kubernetes, and all too often, the unfortunate SRE or operations person will get that call in the middle of the night to manually fix it. While Kubernetes does indeed offer a failover mechanism it is not automated in such a way that in the event of a cluster or a service failure, the services are instantly transferred to a replica cluster configuration where they resume functionality.

A new automated failover functionality for Linkerd gives Linkerd the ability to automatically redirect all traffic from a failing or inaccessible service to one or more replicas of that service — including replicas on other clusters, Buoyant’s Alejandro Pedraza, a senior software developer, wrote in a blog post. “As you’d expect, any redirected traffic maintains all of Linkerd’s guarantees of security, reliability, and transparency to the application, even across clusters boundaries separated by the open internet,” Pedraza said.

Other leading service mesh providers also offer a similar fix for Kubernetes’ failover shortcomings that Istio and HashiCorp provide (more about that below).

Sigh of Relief

For Linkerd users, this failure functionality should prompt a sigh of relief among operations teams working in Kubernetes environments. This is because it prevents operations teams “from having to scramble to fix Kubernetes clusters in the middle of the night, simply by automatically rerouting application traffic without any need for code changes or reconfiguration,” Torsten Volk, an analyst at Enterprise Management Associates (EMA), told The New Stack.

With Linkerd’s new automated failover feature, cluster operators can configure failover at the service level in a way that’s fully automated and also transparent to the application, Linkerd co-creator William Morgan, who is also CEO of Buoyant, told The New Stack. This means that if a component fails, all traffic to that component will be automatically routed to a replica, “without the application being aware,” Morgan said.

“If that replica is in a different cluster in a different region or even a different cloud, Linkerd’s mutual TLS implementation means that the traffic remains fully secured even if it is now traversing the open Internet,” Morgan said. “This is something Linkerd users have been asking for a long time and we’re happy to deliver it to them today.”

In Istio’s case, Istio has supported the automation of failovers for Kubernetes “for a while,” Christian Posta, vice president, global field CTO, for Solo.io, told The New Stack, adding “we automate away all of the config” with Solo.io Gloo Mesh.

“It largely stems from locality and priority-aware load balancing that Envoy has,” Posta said.

The locality failover sequence with Istio.

HashiCorp has also implemented the automation of a failover functionality for some time, which is described in its documentation.

The push to automate the failover functionality of Kubernetes supports the original conept of policy-driven application placement, Volk said. In this way, “DevOps teams no longer have to exactly define a specific application environment based on application requirements, but instead, developers can declare app requirements within the application code that the service mesh then matches,” Volk said.

Simple Concept

The main issue is how Kubernetes does not provide an automated failover functionality in the event of a failure. When services and clusters fail on Kubernetes, Volk said.  “DevOps teams must typically make changes to the application code to change traffic routing in a manner that is specific to the underlying cloud infrastructure,” Volk said. “This means, you would need to write different code for routing workloads to or between clusters on AWS, Azure, Google Cloud or other specific platforms.”

Indeed, the concept of failover is simple, Morgan said. “If a component breaks, send all traffic destined to that component to a replica that’s somewhere else, usually in another cluster. One of the biggest challenges for DevOps teams who want to use failover to improve the resilience of their applications is simply the fact that Kubernetes itself doesn’t provide any automation around this,” Morgan said. “So you can deploy replicas of application components across regions and zones, but failing over between them is left up to you. Worse, if you want to be able to failover individual services, the application somehow needs to understand how to send traffic to different replicas in the event of failure. That conflates application concerns with platform concerns and leads to maintenance problems.”

The new failover feature in Linkerd is built on top of existing Kubernetes and Linkerd features, like health probes and SMI TrafficSplits, and introduces a minimum of new machinery and configuration surface area, Morgan said. “This is the same design principle that has made Linkerd the simplest service mesh to operate, by a wide margin,” Morgan said. “It’s part of our commitment to our users: Kubernetes is complicated enough; your service mesh doesn’t have to be.”

Featured image via Unsplash.