Modal Title

Unsafe Rust in the Wild

A look at research into how unsafe Rust is used in practice, and a warning urging programmers to take extra care when dealing with unsafe code.
Sep 29th, 2022 7:00am by
Featued image for: Unsafe Rust in the Wild
Image via Pixabay.

Rust is a systems programming language designed to have much stronger type safety than traditional systems languages such as C. More importantly, the safety guarantee is embedded in the language itself and checked at compile time. Before a Rust program can pass the compiler, common programming errors that plague traditional languages, such as uninitialized variables, dangling pointers, memory leaks and even data races, are mostly eliminated.

Two of the most prominent safety features of Rust are ownership and lifetime. Generally speaking, every piece of memory data in a Rust program has exactly one variable that is the owner of the data. When the variable goes out of scope, the memory it owns is released. With this ownership mechanism, Rust could ensure that a program is by and large free of memory leaks without resorting to either explicit release or garbage collection. Ownership can be permanently transferred or moved when an object variable is assigned to another, or it can be temporarily borrowed when a variable is assigned to a reference.

Because references can potentially cause aliasing-related memory access problems, Rust imposes strict rules so that, for any given object, only multiple read-only references, or a single mutable reference and no other references, are allowed at any time.

These safety features, and others, make Rust a much safer language than traditional languages. If a program can be successfully compiled, the programmer can be confident that it is free of memory bugs. Once the initial learning curve is overcome and the programmer has mastered the Rust safety features, the development efficiency can be quite high because most, if not all, of the commonly occurring yet hard-to-pinpoint errors are detected at compile time rather than runtime.

Unsafe Rust

In practice, however, the strict rules of Rust on ownership, lifetime and references, can be overly restrictive, and as a result programmers sometimes have to find ways to get around the rules. For such situations, the Rust language itself provides a mechanism: unsafe Rust.

Unsafe Rust refers to Rust code that is marked with the unsafe keyword and performs operations that are not allowed by the safety rules such as dereferencing a raw pointer. In other words, unsafe code has the potential to break safety guarantees.

Unsafe Rust is a necessary ”escape hatch.” Indeed, according to Brian Anderson and Lars Bergstrom, both formerly at Mozilla, as well as others, many fundamental features of Rust itself are implemented with unsafe code. For example, unsafe code allows the Vec type to manage its buffer efficiently, enables the std::io module to interact with the operating system and so on.

It is important to point out that just because a piece of code is marked as unsafe does not necessarily mean that it is unsafe. Unsafe means that the compiler cannot ensure safety of the code; consequently, the responsibility for safety falls on the programmer who creates the unsafe code.

Although Rust is a relatively young language, it has been in use for many years and has gained impressive popularity. In this article, we will investigate the usage of unsafe Rust in the wild. Results and findings from a few recent studies on the real-world use of unsafe Rust are discussed as well as the results from our own experiments, and they are presented in a unified way. The purpose is to help demystify unsafe Rust and to provide answers to the most relevant questions about unsafe Rust, such as:

  • How commonly is unsafe Rust used?
  • For what purposes is unsafe Rust mostly used?
  • Is unsafe Rust used in ways that conform to software engineering principles?

Kinds of Unsafe Code

Rust has three kinds of unsafe code: unsafe function, unsafe block and unsafe traits. They are all marked with the keyword unsafe. This is in keeping with the implicit principle that the default case is generally more conservative. Marking unsafe code helps to increase the programmer’s awareness that unsafe code is being created and extra care must be taken. It also makes unsafe code more conspicuous for code reviewing or software engineering tools.

Unsafe function — An unsafe function is defined with the unsafe keyword preceding the fn keyword. For example, the function String::from utf8 unchecked()
from the standard library is defined as follows:


This function takes a vector of bytes (u8) and creates a String from it. A Rust String is a sequence of unicode characters stored in UTF-8 format, which is a variable-length encoding. This function is unsafe because, for efficiency purposes, it does not check if the input vector of bytes is indeed a sequence of UTF-8-encoded characters. It is up to the caller to ensure that the input is as expected.

In general, a function is marked as unsafe because it has some preconditions regarding its inputs that must be met before the function can be safely called, and the conditions cannot be checked by the compiler. In the example above, the compiler cannot check if a vector of bytes is a valid sequence of UTF-8 characters. Such conditions are sometimes referred to as the function’s “contract.” It is therefore the caller’s responsibility to ensure that the contract is met before safety can be guaranteed.

Unsafe block — An unsafe block is a block of code enclosed in a pair of curly braces that is preceded with the unsafe keyword. For example, when a raw pointer is being dereferenced, it must be enclosed in an unsafe block. Also, an unsafe function can only be called within an unsafe block. The body of an unsafe function is automatically an unsafe block. The following example shows the above unsafe function being called inside an unsafe block.


In this example, the UTF-8 encoding of the euro sign, which consists of the three bytes 0xE2, 0x82 and 0xAC, is put in a vector. Then the unsafe function String::from utf8 unchecked() is called in the unsafe block to convert the vector into an s string.

A natural question at this point is when to use unsafe function and when to use unsafe block. Generally speaking, functions, especially public functions, should not be gratuitously marked as unsafe. A function should be marked as unsafe only when it has a calling contract. Note that just because a function is unsafe does not mean that its caller function should automatically be unsafe as well. If the caller can ensure the preconditions of the unsafe function, then the call is perfectly safe. For example, we can have the following safe function even though it contains an unsafe block.


In fact, it is quite common to have safe functions contain unsafe blocks. It is actually good programming practice to have unsafe code encapsulated this way inside safe functions.

Unsafe trait — A Rust trait is similar to an interface in Java: It generally contains declarations of a set of related methods. As such, from an object-oriented point of view, a trait can be viewed as an abstract base class. Programmers can then implement the trait for a given data type by implementing the methods of the trait.

An unsafe trait has the unsafe keyword preceding the trait keyword. All implementations of an unsafe trait must be marked as unsafe. Unsafe traits can be a source of confusion. For one thing, although the implementation of an unsafe trait must be marked as unsafe impl, the methods of the implementation are not automatically unsafe and therefore do not have to be called inside an unsafe block. Also, it is not very clear when a trait needs to be declared as unsafe. Relevant information is scattered and inadequate.

Perhaps due to these reasons, among others, unsafe traits are not common in practice. In the Rust standard library, for example, we have found only 12 unsafe traits, eight of which are marked as experimental, leaving only four that are considered mature.

In summary, when a function is declared unsafe, the caller must ensure its contract is met. Unsafe operations such as calling an unsafe function must be put inside unsafe blocks. And when a trait is declared unsafe, the implementer must take extra care to ensure that the contract for the methods is respected.

Unsafe Rust in the Wild

We will now discuss the practical usage of this feature. Rust has enjoyed widespread adoption. In regard to unsafe Rust, some questions naturally arise, such as:

  • Is unsafe Rust used often?
  • For what purposes is unsafe Rust used?
  • Is unsafe Rust used in sensible ways?

We believe answers to these questions are of great interest to many, in particular programmers new to Rust and creators of software engineering tools. A few recent in-depth analyses of the practical use of unsafe Rust were conducted by examining a large number of real-world Rust crates (libraries or executables). They include the following, which are listed with a code name for convenient reference going forward:

We have collected some data on our own as well. In the following section, we will present the results in a unified way, which should help answer the questions listed above.

Datasets

Astrauskas et al. [AST20] examined all the crates available at crates.io as of January 2020. Of those crates, 31,867 can be compiled and were used in their study. Evans et al. [EVA20] also studied the crates registered at crates.io but from an earlier date, in September 2018. Out of about 18,000 available crates, they selected 13,096 that could be compiled. Of those, a subset of 462 popular crates, which account for 90% of downloaded crates, were studied separately. They also included 400 or so crates used by the Servo web browser engine from Mozilla.

The work by Qin et al. [QIN20] covered five real-world Rust applications and five popular libraries with a total of about 849,000 lines of source code. We have also collected some data on our own from two real-world Rust projects, namely the Redox operating system (415 dependent crates) and the rustc compiler for Rust (98 crates). For Redox, we use its dependent libraries; for rustc, we use libraries from the compiler and library sub-directories. A brief summary of the datasets is given in Table 1.

How Often Is Unsafe Code Used?

To determine how often unsafe code is used, Evans and Astrauskas checked the number of crates for use of the three types of unsafe code. We did the same for Redox and rustc. However, Qin reported on the number of instances for each type of unsafe code, instead of percentage of crates. These results are shown in Table 2.

Between Evans and Astrauskas, the number of investigated crates.io crates increased from about 13K to about 32K. The proportion of crates with unsafe code reduced from 29.4% to 23.6%, meaning that among the newly registered crates, more than 80% were free of unsafe code, compared with about 70% with the first snapshot.

Despite this decreasing trend, the number of crates with unsafe code remains quite significant, suggesting that unsafe code is used quite commonly. This is even more prominent if we consider only the most popular crates. As shown in Table 2, 52.5% of the popular crates have unsafe code. Therefore, in an average real-world Rust project, you can expect a significant number of the dependent crates to have unsafe code. This is confirmed by the data we collected on Redox and rustc, which, respectively, have 56.6% and 53.1% of the crates with unsafe code.

Evans et al. analyzed the call graphs of the functions they examined. They consider a safe function containing unsafe blocks to be possibly unsafe. Indeed, if the code in an unsafe block does prove to be unsafe, the enclosing safe function stops being safe. With this notion, they found that the crates that have no unsafe code in themselves and in their dependent crates, namely those that can be considered really safe through static analysis, account for only 27% of all the crates.

Another way to understand the prevalence of unsafe code is to look at the overall number of unsafe functions. We list the available data in Table 3.

According to Astrauskas, 7.5% of the functions in the crates from crates.io are declared unsafe. For Redox, the percentage is much lower, at 1.7%. For rustc, though, the number is very high, at 31.1%. The reason for this is that one crate, core-arch, which presumably deals with hardware architectures, alone defines 18,524, or 94.6%, of the unsafe functions. This crate has a total of 22,095 functions, so it predominantly consists of unsafe functions. If we exclude this outlier crate, the percentage of unsafe functions in all the other crates is 2.7%.

Unsafe Traits

Compared with unsafe blocks and unsafe functions, unsafe traits are much rarer. As Table 2 showed, for the crates.io crates, both [AST20] and [EVA20] report that about 1% of crates contain unsafe trait declarations.

Our measurements show that for the Redox and rustc datasets, the percentages of crates with unsafe trait declarations are 5.8% and 5.1%, respectively, slightly higher than the popular set of Evans.

Additionally, we investigated unsafe trait declarations in Rust’s standard library. Based on the version accessed in early September 2022, there are 172 traits in the standard library. Among those, 12, or 7%, are marked as unsafe. And among the 12 unsafe traits, eight are experimental.

In Table 4, we show the actual number of unsafe trait declarations compared with the total number of traits. You can see that the numbers are generally rather low, confirming that unsafe traits are not used commonly.

The data in this section suggest that the usage of unsafe Rust is common and prevalent if we consider the concept of possibly unsafe crates.

Kinds of Unsafe Code Operations

What kinds of unsafe code operations are commonly found in unsafe blocks/functions?

Both Astrauskas et al. and Evans et al. answer this question by examining the code in unsafe blocks/functions. Astrauskas et al. lists about 13 most common unsafe code types, while Evans et al. gives three top types, which coincide with the top three in Astrauskas et al. The results are shown in Table 5.

By far the most prevalent operation in unsafe blocks/functions is function call to unsafe functions. According to both Astrauskas et al. and Evans et al., this kind of unsafe code appears in nearly 90% of the unsafe blocks/functions.

The second most common kind of unsafe code is the dereferencing of raw pointers, which is clearly unsafe. If we only look at the popular crates, however, calling unsafe function drops to 64%, while the dereferencing of raw pointers becomes much more common than in the entire set of crates (25.9% vs. 6.4%).

An explanation offered by Evans is that the popular crates have more interactions with C libraries than the general crates. The third most common unsafe code type is the use of mutable static variables. Basically these are writeable global variables, which inherently have thread-safety issues and consequently are only allowed in unsafe blocks.

Furthermore, it is pointed out in Astrauskas that for a vast majority (83.5%) of all the functions that use unsafe code, calling unsafe function is the only reason for using unsafe. This indicates that to ensure the safety of unsafe code, most of the work lies in ensuring that the contracts of unsafe functions are met when they are called.

Finally, we note that the results in Table 5 could be further refined by tracing the calls to unsafe functions to reveal the basic unsafe code type. For example, suppose we have an unsafe function that dereferences raw pointers and several unsafe blocks that call this function. In that case, if we trace the calls to the unsafe function, we will find that all the unsafe blocks eventually involve dereferencing raw pointers.

Reasons for Using Unsafe Rust

The researchers took different approaches to investigate this topic.

Astrauskas et al. looked at about six probable reasons for using unsafe Rust, such as data sharing, incompleteness of type checker, documentation. Some cases turn out to be quite rare. The top three reasons from their work are listed in Table 6.

The most common reason is interaction with foreign functions, which are always considered unsafe. This is followed by the need to bypass Rust’s strict safety rules. Although performance is listed as a top reason, they point out that using unsafe code for performance purposes is concentrated in only a few crates that use it heavily.

In Evans et al., the authors investigate the reasons for using unsafe code by conducting a survey of 20 programmers. Table 7 lists the top three reasons from the survey.

The first two reasons coincide with top reasons others found. The third reason was not reported by others. Here the programmers appear to be using unsafe code for convenience. This is not an intended purpose of unsafe code, so this reason does not seem to be compelling. Programmers should only use unsafe for its intended purposes.

In Qin et al., the authors analyzed the source code to find reasons for using unsafe code. Their categorization method is different from the others, and the top three reasons are listed in Table 8 along with the percentage of usage. The top reason, reusing existing code, includes operations like calling foreign functions, which is the top reason identified in Astrauskas et al. The other two top reasons are also identified by other researchers. In particular, Qin states that in some cases the performance of an unsafe function can be four to five times faster than the corresponding safe version.

Combining these findings, we conclude that the top reasons for using unsafe code include:

  • Interoperations with foreign languages, particularly C.
  • Getting around safety restrictions that are too restrictive for programming tasks.
  • Improving performance.

How Well Is Unsafe Rust Used?

We have seen some aspects of the usage of unsafe Rust in practice. The final question we want to probe is how well it is used. In other words, is unsafe Rust used in a principled manner?

Based on advocated practices, Astrauskas et al. suggested three principles for unsafe Rust:

  1. Unsafe code should be used sparingly.
  2. Unsafe code should be straightforward and self-contained.
  3. Unsafe code should be well encapsulated.

The first principle does not seem to be adhered to in practice. As has been shown, more than 23% of the registered crates have unsafe code, and if we consider a safe function containing unsafe code as possibly unsafe, then Evans et al. has shown that only 27% of the crates they examined are truly safe.

As to the second principle, Astrauskas et al. concluded that it is generally observed in practice. This is evidenced by two indicators. First, most of the unsafe blocks are quite small, with the average being only 22 MIR (the Rust compiler’s intermediate representation) statements. Second, when an unsafe block calls an unsafe function, the target is mostly in the same crate — only 7.4% of calls are to other crates.

For the third principle, Astrauskas et al. mostly measured the proportion of unsafe functions that are declared public. Here we see a bimodal distribution: for 78.5% of the crates they examined, either none (34.7%) or all (43.8%) of the unsafe functions are public. Further analysis of the latter category reveals that those crates mostly are for purposes such as interfacing with foreign languages, embedded programming and so on, and unsafe functions therein are not meant to be encapsulated. Therefore, it appears that there are efforts on the programmers’ part not to expose unsafe functions.

Finally, it is noted in Astrauskas et al. that many unsafe functions do not have their contracts well documented. This is certainly a practice that needs improvement.

The survey conducted by Evans et al. indicated that programmers in general are aware of the potential safety issues of unsafe code and take active measures to test and verify such code. Among the common techniques are reading the code very carefully, adding runtime checks, writing more unit tests, having discussions with others and so on.

Qin et al. noted that in general unsafe code usages are for good or unavoidable reasons. However, they also found that safety bugs are much more likely to occur in unsafe code than in safe code. This further reinforces the principle that unsafe code should be used sparingly and for good reasons. They further suggested that both the language itself and tools should be improved to help reduce the potential safety problems caused by unsafe code.

Conclusion

Since, unsafe Rust is an indispensable tool when it is necessary to bypass the strict safety rules, interface with foreign languages or improve performance to fulfill systems programming needs.

In this article, we gave an introduction to unsafe Rust and assembled results and findings from a few recent studies on unsafe Rust as well as our own experiments. Based on these results, we found that unsafe Rust is used quite often in practice, and its use largely conforms to software engineering principles. However, safety bugs are more likely to occur in unsafe code than in safe code. so programmers should take extra care when dealing with unsafe code. Language and tool improvements are urgently needed to improve the situation.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Real.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.