Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements
0% When Kubernetes Goes Bad

Thursday marked the one-year anniversary of Kubernetes Failure Stories, a carefully-curated collection of links to very public postmortems.
May 11th, 2020 6:00am by
Featued image for: When Kubernetes Goes Bad

Thursday marked the one-year anniversary of Kubernetes Failure Stories, a carefully-curated collection of links to very public postmortems.

Prior to being hosted on its own domain name (, the project had existed solely on a GitHub repository. There’s a nice succinct description of its mission at the bottom of the page:

Kubernetes is a fairly complex system with many moving parts. Its ecosystem is constantly evolving and adding even more layers (service mesh, …) to the mix.

Considering this environment, we don’t hear enough real-world horror stories to learn from each other! This compilation of failure stories should make it easier for people dealing with Kubernetes operations (SRE, Ops, platform/infrastructure teams) to learn from others and reduce the unknown unknowns of running Kubernetes in production.

The site also thanks Joe Beda, who’d been one of the co-founders of Kubernetes during his time at Google as a senior software engineer, “for contributing his domain for this project!”

Along with the title, each listing includes the company name and where the post-mortem appeared, along with a list of the system components involved — and the impact. (For example, “Partial Production outage” or “delay of 1-3 seconds for outgoing TCP connections.”)

But behind it all is a real-world Kubernetes user with a commitment to quality information, open source software, and real-world workplace results. “This is totally driven by my own and/or our motivation to learn from other people’s experiences and failures,” explained Henning Jacobs, the site’s creator, during a 2019 appearance on Google’s Kubernetes podcast.

“I think one good way is giving back and actually showing people that this is not such a big deal to share these failures.”

Sharing Stories

Jacobs is the head of developer productivity at Zalando, an e-commerce platform that runs over 140 Kubernetes clusters. It is using Kubernetes to support over 1,100 developers, and in a March 2019 blog post, Jacobs explained why. “Kubernetes’ cohesive and extensible API matters…” he wrote, adding encouragingly to his blog’s readers that “learning it is worthwhile, even if you just want to run a bunch of containers.”


Jacobs was responding to a blog post titled “Maybe You Don’t Need Kubernetes,” written by a backend engineer for the hotel listings site Trivago. (“Especially for smaller teams, it can be time-consuming to maintain and has a steep learning curve.”) Jacobs agrees there are many choices for running containers, and “all of them ‘work’, but differ heavily in what interface they provide…”

But he also argued that “Having an extensible API matters as you will sooner or later hit a use case not reflected 100% by your infrastructure API, and/or you need to integrate with your existing organization’s landscape.” He applauded custom resource definitions (CRDs), which “allow building higher-level abstractions on top of core concepts.” Jacobs’ blog post also noted there’s already a vast ecosystem built on top of the Kubernetes API, arguing that the world converged on its feature set as it did with the Linux kernel (or its service and system management component).

“I think this network effect will prevail and we will see more and more high-level tools (apps, operators, ..) for Kubernetes…” he wrote. And he even sees as one those resources. “I started collecting Kubernetes Failure Stories for no other reason than to leverage the enormous community and improve infrastructure operations (true for managed and self-hosted)…”

Beda even once joked that the domain “af” could stand for “architectural failures,” though technically it is the top-level domain name for Afghanistan.

But how does he really feel about Kubernetes? Jacobs doesn’t seem like he’s driven by antagonistic snark, but rather a sincere desire to make things better. In 2019 he was asked to be the guest on Google’s own Kubernetes podcast, where he shared his own experiences when Zalando migrated to the cloud in 2015 and said the site grew out of their commitment to open source software. Though Zalando shares their own open-source Kubernetes components on GitHub, “it’s not only about code, it’s also about how to deal with the code and components, how they work together, and how to actually set this up.”

“Maybe, when I look for failure stories, often how things are framed is this is kind of how we migrated. These are the success stories. And then on the way, the challenges… I wanted to turn it a little bit around to make it more open to learning about the failures. But I think a lot of these failures and challenges are always hidden in the success story….”

So shares another kind of story:

  • 10 Ways to Shoot Yourself in the Foot with Kubernetes, #9 Will Surprise You (Datadog’s presentation at KubeCon Barcelona 2019)
  • A Kubernetes failure story (by an anonymous Fullstaq client, sharing slides from a Dutch Kubernetes meetup in 2019)
  • How NOT to do Kubernetes (by a Senior Site Reliability Engineer at Google and a VP of Product Management at the cloud native storage solution Portworx, speaking at a Cloud Native Meetup in 2018
  • Running Kubernetes in Production: A Million Ways to Crash Your Cluster (Zalando, slides from a presentation at DevOpsCon Munich 2018)

“it’s not about driving people away from Kubernetes,” Jacobs explained in a comment on Hacker News, warning people not to mistake the easy availability of stories as a representative sample. He points out it may not be so easy to collect stories about on-premises failures or even the other more fragmented orchestration frameworks. “We will never hear about them as they are buried inside orgs.”

It’s a point he drove home in his talk at the GOTO 2019 conference in Berlin, explaining his philosophy in a talk titled “Why I love Kubernetes Failure Stories and You Should Too.”

“We now have this ecosystem — we now have this common language we can talk about, and we can have these failure stories, which didn’t really make sense for a lot of other things, either proprietary or not so broadly adopted.”

He’s also seen the value of experiences being shared as a member of the CNCF End User community — Jacobs is also the co-chair of the CNCF End User Developer Experience SIG.) And in his talk he pointed out that while lots of people are planning to roll out Kubernetes — calls for advice aren’t always helpful.

Henning Jacobs talk at GoTo -- Kubernetes advice from Hacker News (screenshot).

“Everybody loves failure stories, but maybe for the wrong reasons,” note the talk’s description, citing our love of schadenfreude rather than the hope for “continuous improvement through blameless postmortems, sharing incidents, and documenting learnings.”

But the talk’s description explains it’s ultimately designed to highlight “why Kubernetes makes sense despite its perceived complexity.”

A Curated Collection

The stories keep coming. Earlier in the year a DevOps engineer for the travel search engine Omio — which runs 100% of its workloads on Kubernetes — wrote about “CPU limits and aggressive throttling in Kubernetes” resulting in high latency and errors. (“There is a serious, known CFS bug in the kernel that causes unnecessary throttling and stalls…”)

The head of DevOps for the travel search site LoveHolidays remembered the time Google Kubernetes Engine (GKE) ran out of IP addresses, resulting in a stuck deployment and blocked autoscaling of both pods and nodes. (“By default, GKE allocates 256 IPs per node, meaning that even large subnets like /16 can run out pretty quickly when you’re running 256 nodes.”)

And in 2019 a DevOps engineer for Exponea’s email list validation site blogged about why they postponed integrating Istio and deploying it into production. (“Istio made some changes that made Istio undeployable in the multitenant setup…”)

It’s undeniably a popular site. When Jacobs blogged about the site back in 2019, the link attracted over 500 upvotes — and another 236 comments. (Like “A team at my work has spent a stupid amount of time trying to nail down networking issues with hand-rolled k8 in AWS… Total pain in the ass.”) Henning himself joined the conversation, sharing another German developer’s list of serverless failure stories.

And his even scored a mention in Sysdig’s game “Cards Against Containers,” where the answer on one of the cards is “Browsing instead of doing actual work.” But his talk at GoTo concluded with a pointed invitation for others to contribute their own postmortems — and a reminder of why it’s necessary.

“I think as an industry we need to move there, that we are more open to sharing our failures and incidents and postmortems to learn from each other.”


Group Created with Sketch.
TNS owner Insight Partners is an investor in: Sysdig.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.