Catch Performance Regressions: Benchmark eBPF Program

This is the fourth in a five-part series. Read Part 1, Part 2 and Part 3.
In this series we learned what eBPF is, the tools to work with it, why eBPF performance is important and how to track it with continuous benchmarking. We created a basic eBPF XDP program, line by line in Rust using Aya. Then we went over how to evolve a basic eBPF XDP program to new feature requirements. When we left off, our last feature change had caused a major performance regression in production. In this installment, we will look at how to prevent that from happening by benchmarking both our eBPF and userspace applications. All of the source code for the project is open source and is available on GitHub.
As promised, production is on fire 🔥🔥🔥, and you’re reconsidering all of your life choices that have led you to this moment. It’s been three weeks since we implemented our FizzBuzzFibonacci
feature, and everything seemed great… until it wasn’t. After hours of chasing your tail, staring at your “observability” dashboard in disbelief and lots of manual profiling, you’ve finally narrowed down the culprit. Our is_fibonacci
helper function:
- fn is_fibonacci(n: u8) -> bool {
- let (mut a, mut b) = (0, 1);
- while b < n {
- let c = a + b;
- a = b;
- b = c;
- }
- b == n
- }
Then you realize you are recalculating the fibonacci sequence every time you receive a packet. Things weren’t so bad when your only endpoint was at 1.0.0.0
, but yesterday you added 1.255.255.255
and 2.255.255.255
.
The fix is simple, though there aren’t that many numbers in the fibonacci sequence below 256:
- fn is_fibonacci(n: u8) -> bool {
- let (mut a, mut b) = (0, 1);
- while b < n {
- let c = a + b;
- a = b;
- b = c;
- }
- b == n
- matches!(n, 0 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 | 144 | 233)
- }
Instead of recalculating the fibonacci sequence every time you receive a packet, we can just hard code the 13 necessary values.
With our changes made and an emergency push to production, we are able to put out the fire 🧯. But why did we have to wait until production to catch this‽ We should try to shift this detection as far left as possible, to development. The best way to accomplish this is with software benchmarks.
There are two major categories of software benchmarks: micro-benchmarks and macro-benchmarks. Micro-benchmarks operate at a level similar to unit tests. For example, a benchmark for our is_fibonacci
function would be a micro-benchmark. Macro-benchmarks operate at a level similar to integration tests. For example, a benchmark for our spawn_agent
function would be a macro-benchmark.
The three popular options for micro-benchmarking in Rust are: libtest bench, Criterion and Iai.
Though part of the Rust standard library, libtest bench is still considered unstable, so it is only available on nightly compiler releases. To work on the stable Rust compiler, a separate benchmarking harness needs to be used. Neither is being actively developed, though.
The most actively maintained benchmarking harness within the Rust ecosystem is Criterion. It works on both stable and nightly Rust compiler releases, and it has become the de facto standard within the Rust community. Criterion is also much more feature-rich compared to libtest bench.
An experimental alternative to Criterion is Iai, from the same creator as Criterion. However, it uses instruction counts instead of wall clock time: CPU instructions, L1 accesses, L2 access and RAM accesses. This allows for single-shot benchmarking since these metrics should stay nearly identical between runs.
We’re going to use the most popular option of the three, Criterion. All three benchmark harnesses are supported by Bencher, the continuous benchmarking tool that we will be using later in this series. To use Criterion to benchmark our code, we will need to refactor things a bit. Criterion does not play nicely with the compiler options required to generate eBPF bytecode. To work around this, we will move some of our eBPF helper functions over to a separate Rust library, called a crate, so they can be tested independently. For convenience, we’ll just add them to the crate we already have for our SourceAddr
map messages.
Within that crate we will need to update our Cargo.toml
file. Cargo.toml
is like make
for the 21st century. We will add the following:
- [dev-dependencies]
- criterion = “0.4”
- [[bench]]
- name = “source_addr”
- harness = false
Going line by line:
- The following lines contain development dependencies.
- Add Criterion version 0.4 to our crate.
- –
- Configuration options for
cargo bench
. - The name of our benchmark test file is
source_addr
. - Do not use the default benchmarking harness (libtest bench).
Then we will update our SourceAddr
code to add a new
associated function:
- impl SourceAddr {
- pub fn new(source_addr: u32) -> Option<SourceAddr> {
- is_fibonacci(source_addr as u8)
- .then_some(SourceAddr::Fibonacci)
- .or(match (source_addr % 3, source_addr % 5) {
- (0, 0) => Some(SourceAddr::FizzBuzz),
- (0, _) => Some(SourceAddr::Fizz),
- (_, 0) => Some(SourceAddr::Buzz),
- _ => None,
- })
- }
- }
This new
associated function does exactly the same work as before when creating a SourceAddr
in the eBPF XDP program. It has just been transposed to be defined on the SouceAddr
type directly. We also have to copy over our is_fibonacci
helper function that we fixed earlier in this post.
Now we are ready to write some benchmarks! Inside of the source_addr
benchmark test file that we noted in our Cargo.toml
, we will add the following:
- fn bench_source_addr(c: &mut Criterion) {
- c.bench_function(“SourceAddr”, |b| {
- b.iter(|| {
- for i in 0..256 {
- SourceAddr::new(i);
- }
- })
- });
- }
- criterion_group!(benches, bench_source_addr);
- criterion_main!(benches);
Going line by line:
- Create a function that takes in a mutable Criterion benchmark collector.
- Using that benchmark collector, add a benchmark named “SourceAddr”.
- The benchmark should be run many times to ensure precision.
- This is our actual benchmark code for all numbers from 0 to 256 exclusive…
- Create a new
SourceAddr
with that number. - –
- –
- –
- –
- –
- –
- Register our
bench_source_addr
with thebenches
test group. - Run the
benches
test group as ourmain
function for the benchmark harness.
If we then run cargo bench
, we get something that looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ cargo bench Finished bench [optimized] target(s) in 1m 03s Running unittests src/lib.rs (/home/epompeii/Code/bencher/examples/ebpf/target/release/deps/ebpf_common-23fb553d75cce271) running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s Running benches/source_addr.rs (/home/epompeii/Code/bencher/examples/ebpf/target/release/deps/source_addr-090d31387638198e) SourceAddr time: [538.49 ns 540.24 ns 542.58 ns] Found 13 outliers among 100 measurements (13.00%) 9 (9.00%) high mild 4 (4.00%) high severe |
We have successfully created micro-benchmarks for our eBPF code! 🎉
To test things at a higher level, we will move now to macro-benchmarking. Things are going to get a bit more complicated as we are going to have to get our benchmark times from the Linux kernel itself. There are three options for doing so: kernel.bpf_stats_enabled
, bpftool prog profile
and bpftool prog run
.
kernel.bpf_stats_enabled
collects run_time_ns
and run_cnt
on all eBPF programs when it is enabled. However, it is disabled by default. It was added in Linux kernel version 5.1.
The bpftool prog profile
is similar to Iai
that we looked at earlier. It collects instruction counts instead of wall clock time: CPU instruction, LLD load, LLC misses, and cycles. It was added in Linux kernel version 5.7 and also requires bpftool
built with clang >= 10.0.0.
The command bpftool prog run
runs a specific eBPF program. It must be provided with the input data and context (except for XDP), and it returns the output data and context. Unfortunately, it is only for a narrow subset of eBPF program types at this point. Hopefully this changes in the future. It was added in Linux kernel version 4.12.
- BPF_PROG_TYPE_CGROUP_SKB
- BPF_PROG_TYPE_FLOW_DISSECTOR
- BPF_PROG_TYPE_LWT_IN
- BPF_PROG_TYPE_LWT_OUT
- BPF_PROG_TYPE_LWT_SEG6LOCAL
- BPF_PROG_TYPE_LWT_XMIT
- BPF_PROG_TYPE_SCHED_ACT
- BPF_PROG_TYPE_SCHED_CLS
- BPF_PROG_TYPE_SOCKET_FILTER
- BPF_PROG_TYPE_XDP
Even though our example is an XDP program, which is supported by bpftool prog run
, we are going to use kernel.bpf_stats_enabled
as it is applicable to all eBPF program types. This will require us to update our userspace source code. We will make changes to this code to make it more testable.
- #[tokio::main]
- async fn main() -> Result<(), anyhow::Error> {
- let opt = Opt::parse();
- let shutdown = Arc::new(AtomicBool::new(false));
- let ebpf_shutdown = shutdown.clone();
- ebpf::run(&opt.iface, ebpf_shutdown).await?;
- info!(“Waiting for Ctrl-C…”);
- signal::ctrl_c().await?;
- info!(“Exiting…”);
- shutdown.store(true, Ordering::Relaxed);
- Ok(())
- }
Going line by line:
- This is the same asynchronous main function as before.
- –
- Again, parse the command line arguments.
- –
- Create a thread-safe
shutdown
boolean, which is initialized tofalse
. - Make a copy of our
shutdown
boolean. - Pass the interface command line argument and the copy of the
shutdown
boolean to our newrun
helper function. - –
- Again, log
- And await Ctl+C
- This time log that we are exiting.
- But then set our
shutdown
boolean totrue
. - –
- Return an empty
Ok
. - –
Now let’s take a look at that run
helper function:
- pub async fn run(iface: &str, shutdown: Arc<AtomicBool>) -> Result<Process, anyhow::Error> {
- env_logger::init();
- let mut bpf = Bpf::load(include_bytes_aligned!(“../path/to/ebpf-bin”))?;
- BpfLogger::init(&mut bpf)?;
- let program: &mut Xdp = bpf.program_mut(
fun_xdp
).unwrap().try_into()?; - program.load()?;
- program.attach(&opt.iface, XdpFlags::default())?;
- let pid = std::process::id();
- let prog_fd = bpf.program(
fun_xdp
).unwrap().fd().unwrap().as_raw_fd(); - let handle = tokio::spawn(async move { spawn_agent(&mut bpf, shutdown).await });
- Ok(Process {
- pid,
- prog_fd,
- handle,
- })
- }
- pub struct Process {
- pub pid: u32,
- pub prog_fd: i32,
- pub handle: tokio::task::JoinHandle<Result<(), anyhow::Error>>,
- }
Going line by line:
- This is another asynchronous function that takes in the network interface to attach to
iface
and a thread-safeshutdown
boolean. TheResult
isProcess
ifOk
and a catch-allErr
otherwise. - Initialize logging for userspace.
- Load our compiled eBPF bytecode. Aya makes recompiling our eBPF source code into bytecode easy, so it automatically happens before our userspace code is compiled.
- Initialize logging from our eBPF program.
- From our eBPF bytecode, get our
fun_xdp
eBPF XDP program. - Load the
fun_xdp
eBPF XDP program into the kernel using the default flags. - Attach our
fun_xdp
eBPF XDP program to a network interface that was set by theiface
command line argument to our binary. - Get the current process ID.
- Get the file descriptor for the
fun_xdp
eBPF XDP program. - Create a task handle for our updated
spawn_agent
helper function that we will look at next. - Return an
Ok
Process
- With the process ID
- eBPF XDP program file descriptor
- And the
spawn_agent
task handle - –
- –
- –
- The type definition for a
Process
including - A process ID
- A program file descriptor
- And an asynchronous task handle
Creating this separate run
function allows us to perform integration-level testing for our macro-benchmarks. Finally, let’s take a look at that update spawn_agent
function:
- async fn spawn_agent(bpf: &mut Bpf, shutdown: Arc<AtomicBool>) -> Result<(), anyhow::Error> {
- let mut xdp_map: aya::maps::Queue<_, SourceAddr> =
- aya::maps::Queue::try_from(bpf.map_mut(“SOURCE_ADDR_QUEUE”)?)?;
- loop {
- while let Ok(source_addr) = xdp_map.pop(0) {
- info!(“{:?}”, source_addr);
- }
- if shutdown.load(Ordering::Relaxed) {
- break;
- }
- }
- Ok(())
- }
This updated version of spawn_agent
is nearly identical to its previous incarnation, except we now check the shutdown
boolean on each iteration at lines 9 through 11. This allows for a clean shutdown from our custom benchmarking harness.
Creating a custom benchmarking harness in Rust is not actually as complicated as it sounds. First we will update our Cargo.toml
file:
- [dev-dependencies]
- inventory = “0.3”
- [[bench]]
- name = “xdp”
- harness = false
Just as with the micro-benchmarks, we will need to add a development dependency. This time, though, since we’re building our own, we will use the inventory
crate instead of Criterion. We’ll also configure cargo bench
to run our xdp
benchmark test file and not use the default harness.
That xdp
benchmark test file will contain our custom test harness:
- #[derive(Debug)]
- pub struct EBpfBenchmark {
- pub name: &’static str,
- pub benchmark_fn: fn() -> f64,
- }
- inventory::collect!(EBpfBenchmark);
Each of our benchmarks will be represented as an EBpfBenchmark
, with a name
and function to run. We then use the inventory
crate to collect all EBpfBenchmark
s at compile time.
Next let’s look at the main
:
- fn main() {
- let mut results = Vec::new();
- for benchmark in inventory::iter::<EBpfBenchmark> {
- let benchmark_name = benchmark.name.parse().unwrap();
- let json_metric = JsonMetric::new((benchmark.benchmark_fn)(), None, None);
- results.push((benchmark_name, json_metric));
- }
- let adapter_results = AdapterResults::new_latency(results).unwrap();
- let adapter_results_str = serde_json::to_string_pretty(&adapter_results).unwrap();
- println!(“{}”, adapter_results_str);
- let mut file = File::create(“../target/results.json”).unwrap();
- file.write_all(adapter_results_str.as_bytes()).unwrap();
- }
Going line by line:
- This is the
main
function for our custom benchmark harness. - Create a vector of
results
. - –
- For each
EBpfBenchmark
that was collected byinventory
at compile time… - Parse the benchmark name.
- Run the benchmark function and store the results as a JSON value.
- Add the result to the
results
. - –
- –
- Create a JSON object containing all of the results.
- Serialize the JSON results to a string.
- Print the JSON results string.
- Create a results file.
- Save the JSON results string to the results file.
- –
Now we are ready to add our first benchmark to our inventory
:
- inventory::submit!(EBpfBenchmark {
- name: “fun_xdp”,
- benchmark_fn: fun_xdp_benchmark
- });
This submits a benchmark named fun_xdp
with a benchmark function fun_xdp_benchmark
that looks like:
- fn fun_xdp_benchmark() -> f64 {
- let rt = Runtime::new().unwrap();
- let shutdown = Arc::new(AtomicBool::new(false));
- let ebpf_shutdown = shutdown.clone();
- let process = rt.block_on(async { ebpf::run(IFACE, ebpf_shutdown).await.unwrap() });
- let _resp = rt.block_on(async { reqwest::get(“https://bencher.dev”).await.unwrap() });
- let bpf_stats = get_bpf_stats(&process);
- shutdown.store(true, Ordering::Relaxed);
- bpf_stats
- }
Going line by line:
- This
fun_xdp_benchmark
function takes in no arguments and returns a 64-bit floating point number. - Create an asynchronous runtime.
- –
- Create a thread-safe
shutdown
boolean, which is initialized tofalse
. - Make a copy of our
shutdown
boolean. - Pass a statically defined test network interface and the copy of the
shutdown
boolean to ourrun
helper function. Block on this call. - –
- Create some network traffic, say visiting the home page for the best continuous benchmarking tool. Block on this call.
- –
- Get the
bpf_stats
for the current process by calling ourget_bpf_stats
helper function. - –
- Set the
shutdown
boolean totrue
. - –
- Return the
bpf_stats
. - –
Taking a look at the get_bpf_stats
helper function:
- fn get_bpf_stats(process: &Process) -> f64 {
- let fd_info = File::open(format!(“/proc/{}/fdinfo/{}”, process.pid, process.prog_fd)).unwrap();
- let reader = BufReader::new(fd_info);
- let (mut run_time_ns, mut run_ctn) = (None, None);
- for line in reader.lines().flatten() {
- if let Some(l) = line.strip_prefix(“run_time_ns:”) {
- run_time_ns = l.trim().parse::<u64>().ok();
- } else if let Some(l) = line.strip_prefix(“run_cnt:”) {
- run_ctn = l.trim().parse::<u64>().ok();
- }
- }
- match (run_time_ns, run_ctn) {
- (Some(run_time_ns), Some(run_ctn)) if run_ctn != 0 => run_time_ns as f64 / run_ctn as f64,
- _ => 0.0,
- }
- }
Going line by line:
get_bpf_stats
takes in a reference to aProcess
and returns a 64-bit floating point number.- Open the
fdinfo
pseudo-file for our eBPF XDP program. - Create a buffered reader for the
fd_info
file so we don’t load it into memory all at once. - Initialize our
run_time_ns
andrun_ctn
statistics toNone
. - Read each line in the
fd_info
file. - If the line starts with
run_time_ns:
… - Then trim and parse the remainder as a 64-bit floating point number.
- Otherwise, if the line starts with
run_cnt:
… - Then trim and parse the remainder as a 64-bit floating point number.
- –
- –
- Check the value of
run_time_ns
andrun_ctn
- If
run_time_ns
andrun_ctn
have both been set toSome
andrun_ctn
doesn’t equal zero, then find the average run time by dividingrun_time_ns
byrun_ctn
. - Otherwise just return zero
- –
- –
Now we are almost ready to run our macro-benchmark! First, we have to enable bpf_stats
as it is disabled by default.
- $ sudo sysctl -w kernel.bpf_stats_enabled=1
- kernel.bpf_stats_enabled = 1
Then we need to compile our eBPF code into release mode, as our benchmarks are run in release mode.
- $ cargo xtask build-ebpf –release
- …
Then inside of the userspace crate, we can finally run our benchmarks. We will have to run as root
, though, because loading an eBPF program requires elevated permission.
- $ sudo -E $(which cargo) bench
- Finished bench [optimized] target(s) in 0.13s
- …
- Running benches/xdp.rs (/home/epompeii/Code/bencher/examples/ebpf/target/release/deps/xdp-d7c85fd4c85d089e)
- {
- “fun_xdp”: {
- “latency”: {
- “value”: 18249.727272727272,
- “lower_bound”: null,
- “upper_bound”: null
- }
- }
- }
Boom! The output is the JSON we serialized and printed in our main
function above. The JSON format used is Bencher Metric Format (BMF). BMF is the JSON format for Bencher, the continuous benchmarking tool that we will use later in this series. Macro-benchmarking an eBPF program in Rust with a custom benchmarking harness is complete! 🎉
In this post, we have come a long way toward preventing performance regressions. We have refactored both our eBPF and userspace source code to be more testable, and added both micro-benchmarks and macro-benchmarks. Working locally, we can easily see the performance improvements and regressions our changes make. For the same reasons that unit tests are run in CI to prevent feature regressions, benchmarks should also be run in CI to prevent performance regressions. This will require a continuous benchmarking tool. In the next and final installment in this series, we will look at using Bencher to track both our micro- and macro-benchmarks to catch performance regressions in CI.