Kubernetes Networking Bug Uncovered and Fixed
The bad news is that you can install newer visions of Kubernetes and — Bang! — you can’t network with your worker nodes. The good news is a Mirantis engineer found some fixes and is sharing them.
There’s nothing quite so much as setting up a bright new Kubernetes cluster, deploying it, and — what the heck! — your worker nodes lose network connectivity. You can’t ssh into them. You can’t ping them. They’re “Gone.”
Finding the Problem
Of course, they’re not actually gone; you just can’t get to them. After the problem was discovered in mid-September, the search for the problem began.
The heart of the problem is that iptables 1.8.8 breaks compatibility with older versions for some rules. What happens is when a new “-m mark” rule is created with iptables 1.8.8, it cannot be read correctly by older versions of iptables: This was actually an expected result. The root cause is that newer iptables-nft uses a native nftables expression to match on packet mark, while older ones use the xtables extension in kernel. The netfilter developers didn’t expect users to run two different versions of iptables. Of course, with containers, that can often happen.
As Mirantis software engineer Jussi Nummelin explains, “If the host is using iptables 1.8.8, when components like kube-router CNI start to write their own network policy and rules with an older version of iptables — kube-router ships with iptables 1.8.7 — things go BOOM! because kubelet, via iptables 1.8.8 as supplied by the host, writes:
-A KUBE-FIREWALL -m comment –comment “kubernetes firewall for dropping marked packets” -m mark –mark 0x8000/0x8000 -j DROP
“Then kube-router, which uses the earlier version of iptables supplied by kube-proxy, does its normal iptables-save; modify/add rules; iptables-restore, but doing so with the older version results in reading and re-inserting the rule as:”
-A KUBE-FIREWALL -m comment –comment “kubernetes firewall for dropping marked packets” -j DROP
Get that? As an ancient network administrator, I’ll explain what happens. Because the rule is “corrupted,” it blocks ALL network traffic on the host. Or, as Nummelin put it, “You know you’re in trouble when you can’t even ping localhost anymore.”
In addition, Nummelin explained, “while in this specific case, we’re using kube-router, all of this can actually happen with any other networking components that use iptables in pods/containers:
Why yes, networking is very fragile in Kubernetes.
The fix is to “ensure every single networking component is actually using the same version of iptables as the host, and also that all of those are also using the same backend (legacy vs. nftables).”
So why do we need the same version? “Can’t Kubernetes just compensate? No, no it can’t. That’s because, Nummelin said, “the netfilter team really does not have any guarantees on version compatibility. (The iptables team, in particular, specifically can’t guarantee backward compatibility, given that it’s a pre-containers era tool, which was built on the assumption that one would never have more than one version of iptables on the host.”
That’s fair. Netfilter and IPTable, after all, date back to 1999. And, if you think they’re bad, you never had to use their predecessor, ipchains.
So what can you do?
- Detecting the iptables mode using the iptables-wrappers script: This gives us the maximum probability of getting everything working in the same mode.
- Shipping the iptables binary with k0s: This way, operating system upgrades can’t break things because k0s never relies on the version provided by the OS.
- Shipping iptables 1.8.7 with k0s: This way, we stay in sync with other components, and we actually test the combinations.
If you’re not using k0s, you do have a couple of other options:
- Downgrade iptables on your host to 1.8.7 to eliminate the incompatibility.
- Run kubelet with –feature-gates=IPTablesOwnershipCleanup=true, which will cause it to not create the problematic “-j DROP” rule. As Dan Winship, @danwinship, points out on Github, “Of course, this is an alpha feature, and you may run into problems with components that assume kubelet still creates those rules, but if you do, then you can report them and help that KEP move forward.”
I’ve read through the GitHub comments both on the Kubernetes and IPTables sides. It’s in a word, “messy,” and I don’t see a fix coming anytime soon that will make Kubernetes users happy. So, for now, I think downgrading host iptables to 1.8.7 is your best move.