Fuzzing: An Old Testing Technique Comes of Age
Both proprietary and open source development tends to have more developers than testers. As a result, automated testing has become increasingly common. In the last year, fuzzing — testing with dummy or random data — has become particularly widespread, and its popularity seems likely to continue.
Fuzzing’s name is newer than the concept itself. Computer scientist Gerald Weinberg recalls that when he worked at IBM and Mercury Project in the late 1950s “it was our standard practice to test programs by inputting decks of punch cards taken from the trash. We also used decks of random number punch cards. We weren’t networked in those days, so we weren’t much worried about security, but our random/trash decks often turned up undesirable behavior.
Weinberg adds that “every programmer I knew […] used the trash-deck technique.” In the decades that followed, fuzzing continued to be widely used, although the current name only came into use in 1988 following a class project, taught by Barton Miller at the University of Wisconsin. Before then, it was known by various names as random testing or monkey testing.
Fuzzing’s current popularity might be said to begin with Google’s introduction of a cloud service called ClusterFuzz with a collection of fuzzing tests to improve security in the Chromium web browser since 2012. This practice has had several spinoffs. For example, Behdad Esfahbod used ClusterFuzz in the development of the font renderer HarfBuzz when he ported it to Chromium. Among developers, fuzzing gained even more recognition when it was used to understand the capacities of the Shellshock virus in 2014.
However, fuzzing has become widely-known in 2016-17. In September 2016, Microsoft began Project Springfield, a cloud-based fuzzing service for detecting security bugs. Similarly, at the start of 2017, Google’s ClusterFuzz expanded into OSS-Fuzz, a project jointly sponsored by the Linux Foundation‘s Core Infrastructure Initiative to improve security in key open source applications. In May 1917, OSS-Fuzz announced that it was working with 47 open source projects, and had discovered over one thousand bugs, including 264 potential security vulnerabilities. OSS-Fuzz’s tests include checks for memory leaks; heap, global, and stack buffer overflows, stack overflows, timeouts, and other common bugs.
More recently, the reputation of fuzzing has benefited from an endorsement from Linux creator Linus Torvalds in discussing the development of the 4.14 Linux kernel. It is “worth mentioning,” Torvalds wrote in an email, “how much random fuzzing people are doing, and [how] it’s finding things. We’ve always done fuzzing (who remembers the old “crashme” program that just generated randomcode [sic] and jumped to it? We used to do that quite actively very early on), but people have been doing some nice targetted [sic] fuzzing of driver subsystems, etc., and there been various fixes … coming out of those efforts.”
The reasons for this burst of interest in fuzzing have not been analyzed, but the general trends seem plain enough. In the past, testing has often been overlooked and automating it is an obvious way to increase the level of testing without increasing costs or enlisting new developers. Noticeably, too, public concern about security motivates corporations and open source projects alike to improve their testing quickly. In open source, in particular, the maturity of applications often means that manual testing is more time-consuming and inefficient than ever before. Under these conditions, today fuzzing is making more an more sense.
How LibreOffice Uses Fuzzing
To understand how and why fuzzing is becoming a common testing practice, an example is useful. One such example is offered by LibreOffice, the free-licensed office suite. Red Hat developer and long-time LibreOffice contributor Caolán McNamara explained that, in the past the project used American Fuzzy Lop (afl) on its own hardware. McNamara himself has experimented with using afl to test user interface crashes, although this line of testing has not been fully implemented.
However, LibreOffice has been using OSS-Fuzz since soon after it was announced, “and have steadily added fuzz targets and enabled additional fuzz engines over the [last] year,” he said. The focus of LibreOffice’s testing is the import filters for graphic formats such as .png and .jpeg, and text file formats such as Microsoft Word’s .docx and LibreOffice’s own .odt (Open Document Text format). Some text file filters are also being fuzzed separately by LibreOffice’s sister project, The Document Liberation Project, which develops filters for outdated and obsolete proprietary formats.
Currently, LibreOffice has 43 fuzz targets that use OS-Fuzz, and plans to add another three. These targets use three of OSS-Fuzz’s resources: the default libFuzzer, afl, and Undefined Behaviour Sanitizer (ubsan). So far, this combination of tools has identified 349 bugs, all of which have been fixed, except for some timeout bugs that cannot be reproduced.
In addition to fuzzing, McNamara says, “we continuously crash-test our import filters by loading a large corpus of documents (just shy of 100,000) downloaded from various public Bugzilla attachments.” The project uses afl-cmin “to find the smallest subset of each format that exercises the most code paths, and publicly stores them so that they can be used by other projects that are also working with import filters.
“Fuzzing has been really useful for us,” McNamara concludes. “It’s remorseless in ferreting out painful cases. We’ve found and fixed hundreds of legacy bugs, and, with OSS-Fuzz integrated into our continuous testing, what I’m most pleased with is that we are detecting new bugs early on the next day after they appear in our code, and long before they could appear in a released version.”
The Limits of Fuzzing
At the risk of pointing out the obvious, as much as fuzzing can streamline testing, simply implementing it is not enough to reduce bugs or tighten security. As pointed out by Richard Brown, the former openSUSE chair, “fuzzing indicates the possible presence of bugs, not the actual presence of bugs.” In other words, the results of fuzzing still usually require human judgment. Nor, as Hanno Böck, who writes a developer’s blog about fuzzing suggests in his contrast of responses to his testing of .deb and .rpm package managers are the results of fuzzing much use if the infrastructure does not exist to support or correct them.
Still, when accompanied by standard testing procedures, fuzzing has proved itself many times to be a way of doing more with the same resources. In the last year, fuzzing has come into its own, and in the future, its adaption is only likely to increase.