Which agile methodology should junior developers learn?
Agile methodology breaks projects into sprints, emphasizing continuous collaboration and improvement.
Scrumban (a combination of Scrum and Kanban)
Extreme Programming (XP)
Other methodology
Bah, Waterfall was good enough for my elders, it is good enough for me
Junior devs shouldn’t think about development methodologies.
DevOps / Operations / Software Development

Catch Performance Regressions: Benchmark eBPF Program 

A look at how to prevent a major performance regression in production by benchmarking both our eBPF and userspace applications.
Jul 14th, 2023 8:39am by
Featued image for: Catch Performance Regressions: Benchmark eBPF Program 
Photo via Shutterstock.

This is the fourth in a five-part series. Read Part 1, Part 2 and Part 3.

In this series we learned what eBPF is, the tools to work with it, why eBPF performance is important and how to track it with continuous benchmarking. We created a basic eBPF XDP program, line by line in Rust using Aya. Then we went over how to evolve a basic eBPF XDP program to new feature requirements. When we left off, our last feature change had caused a major performance regression in production. In this installment, we will look at how to prevent that from happening by benchmarking both our eBPF and userspace applications. All of the source code for the project is open source and is available on GitHub.

As promised, production is on fire 🔥🔥🔥, and you’re reconsidering all of your life choices that have led you to this moment. It’s been three weeks since we implemented our FizzBuzzFibonacci feature, and everything seemed great… until it wasn’t. After hours of chasing your tail, staring at your “observability” dashboard in disbelief and lots of manual profiling, you’ve finally narrowed down the culprit. Our is_fibonacci helper function:

  1. fn is_fibonacci(n: u8) -> bool {
  2. let (mut a, mut b) = (0, 1);
  3. while b < n {
  4. let c = a + b;
  5. a = b;
  6. b = c;
  7. }
  8. b == n
  9. }

Then you realize you are recalculating the fibonacci sequence every time you receive a packet. Things weren’t so bad when your only endpoint was at, but yesterday you added and

The fix is simple, though there aren’t that many numbers in the fibonacci sequence below 256:

  1. fn is_fibonacci(n: u8) -> bool {
  2. let (mut a, mut b) = (0, 1);
  3. while b < n {
  4. let c = a + b;
  5. a = b;
  6. b = c;
  7. }
  8. b == n
  9. matches!(n, 0 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 | 144 | 233)
  10. }

Instead of recalculating the fibonacci sequence every time you receive a packet, we can just hard code the 13 necessary values.

With our changes made and an emergency push to production, we are able to put out the fire 🧯. But why did we have to wait until production to catch this‽ We should try to shift this detection as far left as possible, to development. The best way to accomplish this is with software benchmarks.

There are two major categories of software benchmarks: micro-benchmarks and macro-benchmarks. Micro-benchmarks operate at a level similar to unit tests. For example, a benchmark for our is_fibonacci function would be a micro-benchmark. Macro-benchmarks operate at a level similar to integration tests. For example, a benchmark for our spawn_agent function would be a macro-benchmark.

The three popular options for micro-benchmarking in Rust are: libtest bench, Criterion and Iai.

Though part of the Rust standard library, libtest bench is still considered unstable, so it is only available on nightly compiler releases. To work on the stable Rust compiler, a separate benchmarking harness needs to be used. Neither is being actively developed, though.

The most actively maintained benchmarking harness within the Rust ecosystem is Criterion. It works on both stable and nightly Rust compiler releases, and it has become the de facto standard within the Rust community. Criterion is also much more feature-rich compared to libtest bench.

An experimental alternative to Criterion is Iai, from the same creator as Criterion. However, it uses instruction counts instead of wall clock time: CPU instructions, L1 accesses, L2 access and RAM accesses. This allows for single-shot benchmarking since these metrics should stay nearly identical between runs.

We’re going to use the most popular option of the three, Criterion. All three benchmark harnesses are supported by Bencher, the continuous benchmarking tool that we will be using later in this series. To use Criterion to benchmark our code, we will need to refactor things a bit. Criterion does not play nicely with the compiler options required to generate eBPF bytecode. To work around this, we will move some of our eBPF helper functions over to a separate Rust library, called a crate, so they can be tested independently. For convenience, we’ll just add them to the crate we already have for our SourceAddr map messages.

Linux graphic

Within that crate we will need to update our Cargo.toml file. Cargo.toml is like make for the 21st century. We will add the following:

  1. [dev-dependencies]
  2. criterion = “0.4”
  3. [[bench]]
  4. name = “source_addr”
  5. harness = false

Going line by line:

  1. The following lines contain development dependencies.
  2. Add Criterion version 0.4 to our crate.
  3. Configuration options for cargo bench.
  4. The name of our benchmark test file is source_addr.
  5. Do not use the default benchmarking harness (libtest bench).

Then we will update our SourceAddr code to add a new associated function:

  1. impl SourceAddr {
  2. pub fn new(source_addr: u32) -> Option<SourceAddr> {
  3. is_fibonacci(source_addr as u8)
  4. .then_some(SourceAddr::Fibonacci)
  5. .or(match (source_addr % 3, source_addr % 5) {
  6. (0, 0) => Some(SourceAddr::FizzBuzz),
  7. (0, _) => Some(SourceAddr::Fizz),
  8. (_, 0) => Some(SourceAddr::Buzz),
  9. _ => None,
  10. })
  11. }
  12. }

This new associated function does exactly the same work as before when creating a SourceAddr in the eBPF XDP program. It has just been transposed to be defined on the SouceAddr type directly. We also have to copy over our is_fibonacci helper function that we fixed earlier in this post.

Now we are ready to write some benchmarks! Inside of the source_addr benchmark test file that we noted in our Cargo.toml, we will add the following:

  1. fn bench_source_addr(c: &mut Criterion) {
  2. c.bench_function(“SourceAddr”, |b| {
  3. b.iter(|| {
  4. for i in 0..256 {
  5. SourceAddr::new(i);
  6. }
  7. })
  8. });
  9. }
  10. criterion_group!(benches, bench_source_addr);
  11. criterion_main!(benches);

Going line by line:

  1. Create a function that takes in a mutable Criterion benchmark collector.
  2. Using that benchmark collector, add a benchmark named “SourceAddr”.
  3. The benchmark should be run many times to ensure precision.
  4. This is our actual benchmark code for all numbers from 0 to 256 exclusive…
  5. Create a new SourceAddr with that number.
  6. Register our bench_source_addr with the benches test group.
  7. Run the benches test group as our main function for the benchmark harness.

If we then run cargo bench, we get something that looks like this:

We have successfully created micro-benchmarks for our eBPF code! 🎉

To test things at a higher level, we will move now to macro-benchmarking. Things are going to get a bit more complicated as we are going to have to get our benchmark times from the Linux kernel itself. There are three options for doing so: kernel.bpf_stats_enabled, bpftool prog profile and bpftool prog run.

kernel.bpf_stats_enabled collects run_time_ns and run_cnt on all eBPF programs when it is enabled. However, it is disabled by default. It was added in Linux kernel version 5.1.

The bpftool prog profile is similar to Iai that we looked at earlier. It collects instruction counts instead of wall clock time: CPU instruction, LLD load, LLC misses, and cycles. It was added in Linux kernel version 5.7 and also requires bpftool built with clang >= 10.0.0.

The command bpftool prog run runs a specific eBPF program. It must be provided with the input data and context (except for XDP), and it returns the output data and context. Unfortunately, it is only for a narrow subset of eBPF program types at this point. Hopefully this changes in the future. It was added in Linux kernel version 4.12.


Even though our example is an XDP program, which is supported by bpftool prog run, we are going to use kernel.bpf_stats_enabled as it is applicable to all eBPF program types. This will require us to update our userspace source code. We will make changes to this code to make it more testable.

  1. #[tokio::main]
  2. async fn main() -> Result<(), anyhow::Error> {
  3. let opt = Opt::parse();
  4. let shutdown = Arc::new(AtomicBool::new(false));
  5. let ebpf_shutdown = shutdown.clone();
  6. ebpf::run(&opt.iface, ebpf_shutdown).await?;
  7. info!(“Waiting for Ctrl-C…”);
  8. signal::ctrl_c().await?;
  9. info!(“Exiting…”);
  10., Ordering::Relaxed);
  11. Ok(())
  12. }

Going line by line:

  1. This is the same asynchronous main function as before.
  2. Again, parse the command line arguments.
  3. Create a thread-safe shutdown boolean, which is initialized to false.
  4. Make a copy of our shutdown boolean.
  5. Pass the interface command line argument and the copy of the shutdown boolean to our new run helper function.
  6. Again, log
  7. And await Ctl+C
  8. This time log that we are exiting.
  9. But then set our shutdown boolean to true.
  10. Return an empty Ok.

Now let’s take a look at that run helper function:

  1. pub async fn run(iface: &str, shutdown: Arc<AtomicBool>) -> Result<Process, anyhow::Error> {
  2. env_logger::init();
  3. let mut bpf = Bpf::load(include_bytes_aligned!(“../path/to/ebpf-bin”))?;
  4. BpfLogger::init(&mut bpf)?;
  5. let program: &mut Xdp = bpf.program_mut(fun_xdp).unwrap().try_into()?;
  6. program.load()?;
  7. program.attach(&opt.iface, XdpFlags::default())?;
  8. let pid = std::process::id();
  9. let prog_fd = bpf.program(fun_xdp).unwrap().fd().unwrap().as_raw_fd();
  10. let handle = tokio::spawn(async move { spawn_agent(&mut bpf, shutdown).await });
  11. Ok(Process {
  12. pid,
  13. prog_fd,
  14. handle,
  15. })
  16. }
  17. pub struct Process {
  18. pub pid: u32,
  19. pub prog_fd: i32,
  20. pub handle: tokio::task::JoinHandle<Result<(), anyhow::Error>>,
  21. }

Going line by line:

  1. This is another asynchronous function that takes in the network interface to attach to iface and a thread-safe shutdown boolean. The Result is Process if Ok and a catch-all Err otherwise.
  2. Initialize logging for userspace.
  3. Load our compiled eBPF bytecode. Aya makes recompiling our eBPF source code into bytecode easy, so it automatically happens before our userspace code is compiled.
  4. Initialize logging from our eBPF program.
  5. From our eBPF bytecode, get our fun_xdp eBPF XDP program.
  6. Load the fun_xdp eBPF XDP program into the kernel using the default flags.
  7. Attach our fun_xdp eBPF XDP program to a network interface that was set by the iface command line argument to our binary.
  8. Get the current process ID.
  9. Get the file descriptor for the fun_xdp eBPF XDP program.
  10. Create a task handle for our updated spawn_agent helper function that we will look at next.
  11. Return an Ok Process
  12. With the process ID
  13. eBPF XDP program file descriptor
  14. And the spawn_agent task handle
  15. The type definition for a Process including
  16. A process ID
  17. A program file descriptor
  18. And an asynchronous task handle

Creating this separate run function allows us to perform integration-level testing for our macro-benchmarks. Finally, let’s take a look at that update spawn_agent function:

  1. async fn spawn_agent(bpf: &mut Bpf, shutdown: Arc<AtomicBool>) -> Result<(), anyhow::Error> {
  2. let mut xdp_map: aya::maps::Queue<_, SourceAddr> =
  3. aya::maps::Queue::try_from(bpf.map_mut(“SOURCE_ADDR_QUEUE”)?)?;
  4. loop {
  5. while let Ok(source_addr) = xdp_map.pop(0) {
  6. info!(“{:?}”, source_addr);
  7. }
  8. if shutdown.load(Ordering::Relaxed) {
  9. break;
  10. }
  11. }
  12. Ok(())
  13. }

This updated version of spawn_agent is nearly identical to its previous incarnation, except we now check the shutdown boolean on each iteration at lines 9 through 11. This allows for a clean shutdown from our custom benchmarking harness.

Creating a custom benchmarking harness in Rust is not actually as complicated as it sounds. First we will update our Cargo.toml file:

  1. [dev-dependencies]
  2. inventory = “0.3”
  3. [[bench]]
  4. name = “xdp”
  5. harness = false

Just as with the micro-benchmarks, we will need to add a development dependency. This time, though, since we’re building our own, we will use the inventory crate instead of Criterion. We’ll also configure cargo bench to run our xdp benchmark test file and not use the default harness.

That xdp benchmark test file will contain our custom test harness:

  1. #[derive(Debug)]
  2. pub struct EBpfBenchmark {
  3. pub name: &’static str,
  4. pub benchmark_fn: fn() -> f64,
  5. }
  6. inventory::collect!(EBpfBenchmark);

Each of our benchmarks will be represented as an EBpfBenchmark, with a name and function to run. We then use the inventory crate to collect all EBpfBenchmarks at compile time.

Next let’s look at the main:

  1. fn main() {
  2. let mut results = Vec::new();
  3. for benchmark in inventory::iter::<EBpfBenchmark> {
  4. let benchmark_name =;
  5. let json_metric = JsonMetric::new((benchmark.benchmark_fn)(), None, None);
  6. results.push((benchmark_name, json_metric));
  7. }
  8. let adapter_results = AdapterResults::new_latency(results).unwrap();
  9. let adapter_results_str = serde_json::to_string_pretty(&adapter_results).unwrap();
  10. println!(“{}”, adapter_results_str);
  11. let mut file = File::create(“../target/results.json”).unwrap();
  12. file.write_all(adapter_results_str.as_bytes()).unwrap();
  13. }

Going line by line:

  1. This is the main function for our custom benchmark harness.
  2. Create a vector of results.
  3. For each EBpfBenchmark that was collected by inventory at compile time…
  4. Parse the benchmark name.
  5. Run the benchmark function and store the results as a JSON value.
  6. Add the result to the results.
  7. Create a JSON object containing all of the results.
  8. Serialize the JSON results to a string.
  9. Print the JSON results string.
  10. Create a results file.
  11. Save the JSON results string to the results file.

Now we are ready to add our first benchmark to our inventory:

  1. inventory::submit!(EBpfBenchmark {
  2. name: “fun_xdp”,
  3. benchmark_fn: fun_xdp_benchmark
  4. });

This submits a benchmark named fun_xdp with a benchmark function fun_xdp_benchmark that looks like:

  1. fn fun_xdp_benchmark() -> f64 {
  2. let rt = Runtime::new().unwrap();
  3. let shutdown = Arc::new(AtomicBool::new(false));
  4. let ebpf_shutdown = shutdown.clone();
  5. let process = rt.block_on(async { ebpf::run(IFACE, ebpf_shutdown).await.unwrap() });
  6. let _resp = rt.block_on(async { reqwest::get(“”).await.unwrap() });
  7. let bpf_stats = get_bpf_stats(&process);
  8., Ordering::Relaxed);
  9. bpf_stats
  10. }

Going line by line:

  1. This fun_xdp_benchmark function takes in no arguments and returns a 64-bit floating point number.
  2. Create an asynchronous runtime.
  3. Create a thread-safe shutdown boolean, which is initialized to false.
  4. Make a copy of our shutdown boolean.
  5. Pass a statically defined test network interface and the copy of the shutdown boolean to our run helper function. Block on this call.
  6. Create some network traffic, say visiting the home page for the best continuous benchmarking tool. Block on this call.
  7. Get the bpf_stats for the current process by calling our get_bpf_stats helper function.
  8.  Set the shutdown boolean to true.
  9. Return the bpf_stats.

Taking a look at the get_bpf_stats helper function:

  1. fn get_bpf_stats(process: &Process) -> f64 {
  2. let fd_info = File::open(format!(“/proc/{}/fdinfo/{}”,, process.prog_fd)).unwrap();
  3. let reader = BufReader::new(fd_info);
  4. let (mut run_time_ns, mut run_ctn) = (None, None);
  5. for line in reader.lines().flatten() {
  6. if let Some(l) = line.strip_prefix(“run_time_ns:”) {
  7. run_time_ns = l.trim().parse::<u64>().ok();
  8. } else if let Some(l) = line.strip_prefix(“run_cnt:”) {
  9. run_ctn = l.trim().parse::<u64>().ok();
  10. }
  11. }
  12. match (run_time_ns, run_ctn) {
  13. (Some(run_time_ns), Some(run_ctn)) if run_ctn != 0 => run_time_ns as f64 / run_ctn as f64,
  14. _ => 0.0,
  15. }
  16. }

Going line by line:

  1. get_bpf_stats takes in a reference to a Process and returns a 64-bit floating point number.
  2. Open the fdinfo pseudo-file for our eBPF XDP program.
  3. Create a buffered reader for the fd_info file so we don’t load it into memory all at once.
  4. Initialize our run_time_ns and run_ctn statistics to None.
  5. Read each line in the fd_info file.
  6. If the line starts with run_time_ns:
  7. Then trim and parse the remainder as a 64-bit floating point number.
  8. Otherwise, if the line starts with run_cnt:
  9. Then trim and parse the remainder as a 64-bit floating point number.
  10. Check the value of run_time_ns and run_ctn
  11. If run_time_ns and run_ctn have both been set to Some and run_ctn doesn’t equal zero, then find the average run time by dividing run_time_ns by run_ctn.
  12. Otherwise just return zero

Now we are almost ready to run our macro-benchmark! First, we have to enable bpf_stats as it is disabled by default.

  1. $ sudo sysctl -w kernel.bpf_stats_enabled=1
  2. kernel.bpf_stats_enabled = 1

Then we need to compile our eBPF code into release mode, as our benchmarks are run in release mode.

  1. $ cargo xtask build-ebpf –release

Then inside of the userspace crate, we can finally run our benchmarks. We will have to run as root, though, because loading an eBPF program requires elevated permission.

  1. $ sudo -E $(which cargo) bench
  2. Finished bench [optimized] target(s) in 0.13s
  3. Running benches/ (/home/epompeii/Code/bencher/examples/ebpf/target/release/deps/xdp-d7c85fd4c85d089e)
  4. {
  5. “fun_xdp”: {
  6. “latency”: {
  7. “value”: 18249.727272727272,
  8. “lower_bound”: null,
  9. “upper_bound”: null
  10. }
  11. }
  12. }

Boom! The output is the JSON we serialized and printed in our main function above. The JSON format used is Bencher Metric Format (BMF). BMF is the JSON format for Bencher, the continuous benchmarking tool that we will use later in this series. Macro-benchmarking an eBPF program in Rust with a custom benchmarking harness is complete! 🎉

In this post, we have come a long way toward preventing performance regressions. We have refactored both our eBPF and userspace source code to be more testable, and added both micro-benchmarks and macro-benchmarks. Working locally, we can easily see the performance improvements and regressions our changes make. For the same reasons that unit tests are run in CI to prevent feature regressions, benchmarks should also be run in CI to prevent performance regressions. This will require a continuous benchmarking tool. In the next and final installment in this series, we will look at using Bencher to track both our micro- and macro-benchmarks to catch performance regressions in CI.

Group Created with Sketch.
TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.