How to Keep up with Linux Bugs? Jump Upstream!

In a recent blog post on the Google Security Blog, Google Kernel Security Engineer Kees Cook penned a call to arms by the title of “Linux Kernel Security Done Right” for organizations that rely on Linux but don’t contribute to the upstream Linux kernel.
In the post, Cook argues that many of these organizations are caught in a seemingly endless cycle of trying to keep up with the latest updates, often spinning their wheels and expending effort to fix issues within their own forks of the Linux kernel. The struggle comes when trying to find a balance between staying up to date with all the latest updates, or just updating the most “important” updates. Part of the issue here, Cook points out, is that even the stable kernel releases, which provide “bug fixes only,” each contain close to 100 new fixes per week. As such, many organizations find themselves more and more out of date on their particular branch, creating their own fixes.
By contrast, he says, their time would be better spent contributing that same effort to developing on upstream Linux, taking an “upstream first” approach, and keeping their local version of Linux up to date. In his words, “downstream redundancy can be moved into greater upstream collaboration.” Cook offers both Chrome OS and Android as examples of this method, wherein Google has managed to avoid duplicate “in-house” efforts.
The inspiration for the post comes partly from his time at Google, where approaches to working with the Linux kernel vary by team, leading to precisely this sort of duplicate effort, said Cook in an interview with The New Stack. One team might be heads-down working on a specific feature in their branch of the kernel until they were happy with it before finally another team would notice, asking how to get it. Because the two teams might be operating on completely different versions, the effort would then have to be expended porting it from one version to another.
“This creates all this redundant work, whereas having started with developing that against an upstream kernel initially, suddenly everyone has it available, all the other teams, all the other people involved,” said Cook.
Another part of the problem, said Cook, is that companies find themselves with a bit of technical debt, in terms of being so far behind the latest Linux kernel, but he argues that it is a one-time effort to bring everything up to date.
“When you’re starting with some new product or service, it looks really easy to just take a fork and work on it in the corner and you’re like, ‘Yeah, I’ll just stay up to date with the things that are important that need to be fixed,’ and then a year later, you’re thousands and thousands of bugs behind. And so that amplifies quickly over time,” explained Cook.
While some have argued that the technical debt is simply too great for some organizations to abandon their branch and move to Linux stable, Cook argues that the move may seem counterintuitive, but it is actually beneficial in the end.
“The way out of the technical debt hole is to start your new products on the latest kernel with that process in place, and then what problems that you encounter will quickly inform what’s needed to catch up old products because you’ll discover what needs testing, what use cases are truly important, and things along those lines,” he said. “That’s less clear when you start from way in the past trying to go into the new kernel.”
In his blog post, Cook says that their “most conservative estimates” put the Linux kernel and its toolchains at needing “at least 100 engineers,” and so part of his endeavor here is to convince companies that their upstream-first participation with the Linux kernel benefits them first and foremost, but also has the added benefit of helping the community at large. Another part of the way for companies to get up to speed, said Cook, is testing.
“Every single organization is going to need a process for updating kernels. The point is to get it to a speed at which you can keep up with at least the stable kernel updates because then you benefit from everyone else’s work,” said Cook. “If you’re working on the upstream kernel, and you’re doing the tests against the current kernel, and sending in those test results, those developers get that immediate feedback. And then those bugs never actually make it out into a release version of the kernel. That’s the best place we can be as a community. It is certainly work to get there, but I think ultimately, the effort to move to that is less than trying to keep up, poorly, with really, really old software.”
If none of that has you convinced, Cook offers one further point — if you develop the fix for something, you get to dictate (generally) how it works, rather than relying on someone else and their interests.
“What’s better than that?” asked Cook. “Why would you want to take some other thing that you have to shape and rearrange that doesn’t quite fit your needs? I think that reminder of development agency is important. If you don’t see the thing you need, you can just create it yourself and show other people why it’s important and do the work to make it happen. And I think there’s a lot of power in people understanding that they can just choose to have that agency.”