Culture / Development

What Happens When You Find a Decades-Old Bug?

21 Feb 2021 6:00am, by

It’s been said that 90% of programming is debugging (and that the other 10% must involve writing those bugs). But what’s often left unsaid is that some bugs offer a welcome challenge — mysteries daring us to unravel a seeming impossibility, in an open-ended process that can be educational if not entertaining. It’s a chance to explore the deepest workings of our code and its supporting tools while solving a modern-day riddle. And at the end of the trail: bragging rights.

This month two intrepid explorers shared their own stories…

J. Paul Reed, a Senior Applied Resilience Engineer at Netflix, Reed recently dug into a potentially decades-old inconsistency he happened to find in cron,  the venerable Unix daemon that runs commands at a scheduled time. With cron, the time when the job should run can be specified with digits, but also with symbolic names (like “Wed” or “Feb”) instead of numbers, Reed pointed out on Twitter — adding that many programmers also use lists and ranges. (“To run the job on weekdays, for example, put 1-5.”)

But suddenly one day, Reed discovered a potentially show-stopping issue. According to its man page, ranges and lists of names “are not allowed.”

This horrified Reed. “Raise your hand if you have crontab entries somewhere with a ‘mon-fri’ or a ‘jan,apr,jul,oct’. Yah… I do too.”

The possibility existed that his protocol-violating cronjobs had not actually been running. Maybe they’d never actually run. No, that couldn’t be right. Some of them had generated emails. Reed had seen it…

This called for an investigation…

First, some tests. Reed created cronjobs using both lists and ranges of weekday names — and then “waited in suspense EVERY MORNING to see if the cronjob ran.” And they did. All week long, the emails arrived. So with one sanity check firmly in place, he’d read the source code that actually handles the parsing of instructions from crontabs — which for Reed was the load_entry() function in the cronie package (used in Red Hat systems).

And he found, even in its get_number() subroutine, that the code was designed for the parsing of strings holding the names of months and the names of weekdays. There was even a comment that had explicitly said as much.

/* No numbers, look for a string if we have any */

Reed delved even further, digging into the origins of the code for cron, which in 2007 had been forked from vixie-cron 4.1…

But its man page still carried that same warning — that ranges and lists of names were not allowed — even while its code still seemed to support ranges and lists with the names of months and weekdays. He had trouble locating any versions of vixie-cron that were any older.

But then the story takes an interesting turn, as Reed reached out to internet pioneer Paul Vixie, who’d created not just vixie-cron, but also several other well-known UNIX programs, including SENDS, proxynet, and rtty.

Vixie was on Twitter and responded in a direct message that earlier versions of the source code had been shared in the online newsgroup comp.unix.sources.

Back in the early 1990s, version 2.0 of vixie-cron had been posted to the newsgroup — as three files that had to be spliced into one. But even then, Reed discovered, its man page still insisted that ranges or lists of names “are not allowed.” Yet this older code still also contained subroutines named get_list() and get_range(), and even a similar comment buried in get_number().

/* try to find the name in the name list */

Reed’s conclusion? Even back in vixie-cron version 2.0, lists and ranges of names were supported.

There was nothing left to do but ping Paul Vixie on Twitter again, and ask him why the man page says the ranges aren’t allowed when the code clearly supports it. And it all led to a triumphant and climactic moment in which Vixie’s reply dispelled the whole mystery with 22 simple words. “Oh, I didn’t support ranges or lists of names at first.

“When I fixed the code, I didn’t fix the man page.”

A Linux Bug from the 1990s

The same month Apollon Oikonomopoulos, the CIO at Skroutz, a comparison shopping engine in Greece, told the story of discovering an even more elusive bug. On the company’s engineering blog, Oikonomopoulos describes how snapshots of its (anonymized) production data are transferred into the development environment, which creates its own writeable snapshots. There’s a 600GB MariaDB instance and a 200-GB MongoDB instance holding this data. “This setup has worked rather well for the better part of a decade and has managed to scale from 15 developers to 150,” Oikonomopoulos wrote.

There was just one problem. “Every now and then, an rsync transfer from a source server would hang indefinitely for no apparent reason.”

They couldn’t reproduce the issue — though adding a rate limit made it happen less often, “with frequency ranging from once or twice per week to once every three months.” But when it started happening every day, it was time for a thorough investigation.

After checking their network interface controller, and using debugging tools like strace and gdb on the hung processes, the debugging team found their first clue with the ss utility (for investigating sockets): the data-receiving process appeared to have consumed all its incoming data, even though the data-sending process showed more data that was queued up and ready to send. Eventually, they decided to investigate how the Linux kernel handles TCP flow control — the nuts and bolts of how incoming packets get their identifying sequence numbers, and how their receipt is acknowledged.

Unfortunately, the actual real-time values for those variables are locked deep within the kernel’s internal state — that is, not exposed to users — so they turned to the SystemTap tool/scripting language. Using it to dump the internal state of the problematic TCP connections, they discovered an issue with one of those variables.

There’s a built-in check to prevent the re-receiving of data that had already been sent earlier. But in TCP the numbers for incoming segments are simple integers, and with a fixed size — meaning that eventually they’ll hit their highest-possible value, and start over again from the lowest possible value (like an odometer flipping from 9999 to 00000). This was foreseen; there’s a subroutine that should accurately compare two sequence numbers, even if one of them has “wrapped around.” But it appeared to sporadically fail.

Oikonomopoulos took what he’d learned to the maintainers of the Linux kernel — specifically, to their network development mailing list at Kernel.org. Apparently, the TCP code had a separate “bulk receiver fast-path” that was optimized for bulk data, where the bug had been lurking over all these years. And one maintainer estimated that the bug had been present since Linux 2.1.8, release in November of 1996. The information from Oikonomopoulos was enough for the kernel maintainers to quickly create a patch — within two hours — and one week later  Oikonomopoulos returned to share his results. “It has been almost a week and we have had zero hangs in 60 rsync runs, so I guess we can call it fixed.

“At the same time, we didn’t notice any ill side-effects.”

He also shared his thoughts with the kernel maintainers about why so many years elapsed while the bug remained undetected. It’s triggered by the peculiarities of rsync’s asynchronous protocols, and is also likely to happen only with a high-volume and “stable, long-running, low- or moderately-paced TCP connections in local networks where packet loss is minimal.” Finally, the bug is likely to be mistaken for a routine network glitch — an ordinary lost connection — which is easily resolved by reconnecting.

“In the unlikely event that your application lingers forever in the persist state,” Oikonomopoulos wrote on the mailing list, “it certainly takes a fair amount of annoyance to sidestep your ignorance, decide that this might indeed be a kernel bug, and go after it :)”

But he added a new thought on his company’s engineering blog. “When you do, you are rewarded with a nice adventure, where you get to learn about internet protocol history, have a glimpse at kernel internals, and witness open source work in motion!”


WebReduce

Feature image: The first bug, found in 1945 when Grace Hopper worked on a Mark II computer at Harvard University (Naval Surface Warfare Center)

A newsletter digest of the week’s most important stories & analyses.