Favorite Social Media Timesink
When you take a break from work, where are you going?
Video clips on TikTok/YouTube
X, Bluesky, Mastodon et al...
Web surfing
I do not get distracted by petty amusements
Linux / Operations

How Meta Patches Linux at Hyperscale

Patching Linux is easy. Except when you need to patch tens of thousands of servers without downtime. Here's how Meta does it.
Dec 1st, 2023 3:00am by
Featued image for: How Meta Patches Linux at Hyperscale
Feature image by Casey Allen on Unsplash.    

RICHMOND, Va. — Anyone with a tech clue can patch a Linux server. But, patching thousands of them without any downtime, that’s not easy.

At the Linux Plumbers Conference, the invite-only conference of top Linux kernel developers earlier this month, Meta Linux kernel engineer Breno Leitao explained how Facebook pulls the trick off with its millions of servers around the world.

If you were to use ordinary techniques, Leitao said it would take more than 45 days to roll out a new kernel to all machines. As he put it, “Draining and un-draining hosts is hard.” You can say that again.

That may be fine if it’s a minor update, but if it’s a security patch, that won’t work.

So, Meta uses Kernel Live Patching (KLP) with Red Hat‘s Kpatch, to deliver fast patches. In KLP, you can apply the latest security updates to Linux kernels without rebooting. This maximizes system uptime and availability.

Live Kernel Patches

Kernel live patches are delivered as packages with modified code that are separate from the main kernel package. The live patches are cumulative, so the latest patch contains all fixes from the previous ones for the kernel package. Each kernel live package is tied to the exact kernel revision for which it is issued.

Live patches won’t work on everything, though. You can’t patch data or structure. Another problem is that extra engineering work is usually required to make a live patch. As Leitao warned, “It’s not just as simple as compiling the live patch, and knowing it’ll be safe and applying it. These are kernel modules, you can break things if you’re not careful. There are no guarantees provided that the patch itself is correct.”

Kpatch works by comparing the original and patched kernels and then uses a customized kernel module to patch the new code into the running kernel. The Kpatch process then watches the stack of existing processes using ftrace to see if a patch can be made without any harmful effects.

When it’s safe, it redirects the running code to the patched functions and then removes the now outdated code. And, there you are, your server’s patched, and there’s been no downtime.

Of course, it’s not that simple in practice. Leitao explained, “At Meta, when we apply a live patch, it usually takes one to two seconds to apply the patch to the host. That’s to a single host, obviously not to like the whole fleet of servers, but one to two seconds for a host is really, really fast compared to even kexec,” the Linux kernel mechanism for booting a new kernel. It doesn’t require any downtime or workload migration, you just apply the live patch, and off you go.”

How to Patch Millions of Machines

But, when you’re talking about millions of machines, that’s not the entire story. Meta will find bugs during their patch rollouts, so the administrators start by patching a release candidate tier. So, as the package roller delivers the RPM-based patches, the servers’ health is automatically checked as well.

Meta looks for crashes, major alarms, and application problems and performances in the new kernels. This data is pulled up from a variety of sources, including crashes, netconsole results, and core dumps. If the error rate goes over one crash per thousand servers, the patch is pulled, and the old kernel is restored.

With over a billion users, Facebook also keeps a close eye on performance. As Leitao said, “The live patch performance overhead is small, but there is always a concern when a relatively hot function is patched.”

While Meta uses Kpatch, there are alternatives. SUSE offers kGraft; while Oracle uses Ksplice; and Canonical supports Livepatch. Regardless of the code, they all deliver similar results.

So, if you’d rather not have downtime with your servers, data centers, and clouds, follow Meta’s example and use live patching. You’ll be glad you did.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.