TNS
VOXPOP
Which agile methodology should junior developers learn?
Agile methodology breaks projects into sprints, emphasizing continuous collaboration and improvement.
Scrum
0%
Kanban
0%
Scrumban (a combination of Scrum and Kanban)
0%
Extreme Programming (XP)
0%
Other methodology
0%
Bah, Waterfall was good enough for my elders, it is good enough for me
0%
Junior devs shouldn’t think about development methodologies.
0%
ARCHITECTURE
ENGINEERING
OPERATIONS
CHANNELS
THE NEW STACK

# Catch Performance Regressions: Benchmark eBPF Program

A look at how to prevent a major performance regression in production by benchmarking both our eBPF and userspace applications.
Jul 14th, 2023 8:39am by
Photo via Shutterstock.

This is the fourth in a five-part series. Read Part 1, Part 2 and Part 3.

In this series we learned what eBPF is, the tools to work with it, why eBPF performance is important and how to track it with continuous benchmarking. We created a basic eBPF XDP program, line by line in Rust using Aya. Then we went over how to evolve a basic eBPF XDP program to new feature requirements. When we left off, our last feature change had caused a major performance regression in production. In this installment, we will look at how to prevent that from happening by benchmarking both our eBPF and userspace applications. All of the source code for the project is open source and is available on GitHub.

As promised, production is on fire 🔥🔥🔥, and you’re reconsidering all of your life choices that have led you to this moment. It’s been three weeks since we implemented our `FizzBuzzFibonacci` feature, and everything seemed great… until it wasn’t. After hours of chasing your tail, staring at your “observability” dashboard in disbelief and lots of manual profiling, you’ve finally narrowed down the culprit. Our `is_fibonacci` helper function:

1. fn is_fibonacci(n: u8) -> bool {
2. let (mut a, mut b) = (0, 1);
3. while b < n {
4. let c = a + b;
5. a = b;
6. b = c;
7. }
8. b == n
9. }

Then you realize you are recalculating the fibonacci sequence every time you receive a packet. Things weren’t so bad when your only endpoint was at `1.0.0.0`, but yesterday you added `1.255.255.255` and `2.255.255.255`.

The fix is simple, though there aren’t that many numbers in the fibonacci sequence below 256:

1. fn is_fibonacci(n: u8) -> bool {
2. let (mut a, mut b) = (0, 1);
3. while b < n {
4. let c = a + b;
5. a = b;
6. b = c;
7. }
8. b == n
9. matches!(n, 0 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 | 144 | 233)
10. }

Instead of recalculating the fibonacci sequence every time you receive a packet, we can just hard code the 13 necessary values.

With our changes made and an emergency push to production, we are able to put out the fire 🧯. But why did we have to wait until production to catch this‽ We should try to shift this detection as far left as possible, to development. The best way to accomplish this is with software benchmarks.

There are two major categories of software benchmarks: micro-benchmarks and macro-benchmarks. Micro-benchmarks operate at a level similar to unit tests. For example, a benchmark for our `is_fibonacci` function would be a micro-benchmark. Macro-benchmarks operate at a level similar to integration tests. For example, a benchmark for our `spawn_agent` function would be a macro-benchmark.

The three popular options for micro-benchmarking in Rust are: libtest bench, Criterion and Iai.

Though part of the Rust standard library, libtest bench is still considered unstable, so it is only available on nightly compiler releases. To work on the stable Rust compiler, a separate benchmarking harness needs to be used. Neither is being actively developed, though.

The most actively maintained benchmarking harness within the Rust ecosystem is Criterion. It works on both stable and nightly Rust compiler releases, and it has become the de facto standard within the Rust community. Criterion is also much more feature-rich compared to libtest bench.

An experimental alternative to Criterion is Iai, from the same creator as Criterion. However, it uses instruction counts instead of wall clock time: CPU instructions, L1 accesses, L2 access and RAM accesses. This allows for single-shot benchmarking since these metrics should stay nearly identical between runs.

We’re going to use the most popular option of the three, Criterion. All three benchmark harnesses are supported by Bencher, the continuous benchmarking tool that we will be using later in this series. To use Criterion to benchmark our code, we will need to refactor things a bit. Criterion does not play nicely with the compiler options required to generate eBPF bytecode. To work around this, we will move some of our eBPF helper functions over to a separate Rust library, called a crate, so they can be tested independently. For convenience, we’ll just add them to the crate we already have for our `SourceAddr` map messages.

Within that crate we will need to update our `Cargo.toml` file. `Cargo.toml` is like `make` for the 21st century. We will add the following:

1. [dev-dependencies]
2. criterion = “0.4”
3. [[bench]]
5. harness = false

Going line by line:

1. The following lines contain development dependencies.
2. Add Criterion version 0.4 to our crate.
3. Configuration options for `cargo bench`.
4. The name of our benchmark test file is `source_addr`.
5. Do not use the default benchmarking harness (libtest bench).

Then we will update our `SourceAddr` code to add a `new` associated function:

9. _ => None,
10. })
11. }
12. }

This `new` associated function does exactly the same work as before when creating a `SourceAddr` in the eBPF XDP program. It has just been transposed to be defined on the `SouceAddr` type directly. We also have to copy over our `is_fibonacci` helper function that we fixed earlier in this post.

Now we are ready to write some benchmarks! Inside of the `source_addr` benchmark test file that we noted in our `Cargo.toml`, we will add the following:

1. fn bench_source_addr(c: &mut Criterion) {
3. b.iter(|| {
4. for i in 0..256 {
6. }
7. })
8. });
9. }
11. criterion_main!(benches);

Going line by line:

1. Create a function that takes in a mutable Criterion benchmark collector.
3. The benchmark should be run many times to ensure precision.
4. This is our actual benchmark code for all numbers from 0 to 256 exclusive…
5. Create a new `SourceAddr` with that number.
6. Register our `bench_source_addr` with the `benches` test group.
7. Run the `benches` test group as our `main` function for the benchmark harness.

If we then run `cargo bench`, we get something that looks like this:

We have successfully created micro-benchmarks for our eBPF code! 🎉

To test things at a higher level, we will move now to macro-benchmarking. Things are going to get a bit more complicated as we are going to have to get our benchmark times from the Linux kernel itself. There are three options for doing so: `kernel.bpf_stats_enabled`, `bpftool prog profile` and `bpftool prog run`.

`kernel.bpf_stats_enabled` collects `run_time_ns` and `run_cnt` on all eBPF programs when it is enabled. However, it is disabled by default. It was added in Linux kernel version 5.1.

The `bpftool prog profile` is similar to `Iai` that we looked at earlier. It collects instruction counts instead of wall clock time: CPU instruction, LLD load, LLC misses, and cycles. It was added in Linux kernel version 5.7 and also requires `bpftool` built with clang >= 10.0.0.

The command `bpftool prog run` runs a specific eBPF program. It must be provided with the input data and context (except for XDP), and it returns the output data and context. Unfortunately, it is only for a narrow subset of eBPF program types at this point. Hopefully this changes in the future. It was added in Linux kernel version 4.12.

• BPF_PROG_TYPE_CGROUP_SKB
• BPF_PROG_TYPE_FLOW_DISSECTOR
• BPF_PROG_TYPE_LWT_IN
• BPF_PROG_TYPE_LWT_OUT
• BPF_PROG_TYPE_LWT_SEG6LOCAL
• BPF_PROG_TYPE_LWT_XMIT
• BPF_PROG_TYPE_SCHED_ACT
• BPF_PROG_TYPE_SCHED_CLS
• BPF_PROG_TYPE_SOCKET_FILTER
• BPF_PROG_TYPE_XDP

Even though our example is an XDP program, which is supported by `bpftool prog run`, we are going to use `kernel.bpf_stats_enabled` as it is applicable to all eBPF program types. This will require us to update our userspace source code. We will make changes to this code to make it more testable.

1. #[tokio::main]
2. async fn main() -> Result<(), anyhow::Error> {
3. let opt = Opt::parse();
4. let shutdown = Arc::new(AtomicBool::new(false));
5. let ebpf_shutdown = shutdown.clone();
6. ebpf::run(&opt.iface, ebpf_shutdown).await?;
7. info!(“Waiting for Ctrl-C…”);
8. signal::ctrl_c().await?;
9. info!(“Exiting…”);
10. shutdown.store(true, Ordering::Relaxed);
11. Ok(())
12. }

Going line by line:

1. This is the same asynchronous main function as before.
2. Again, parse the command line arguments.
3. Create a thread-safe `shutdown` boolean, which is initialized to `false`.
4. Make a copy of our `shutdown` boolean.
5. Pass the interface command line argument and the copy of the `shutdown` boolean to our new `run` helper function.
6. Again, log
7. And await Ctl+C
8. This time log that we are exiting.
9. But then set our `shutdown` boolean to `true`.
10. Return an empty `Ok`.

Now let’s take a look at that `run` helper function:

1. pub async fn run(iface: &str, shutdown: Arc<AtomicBool>) -> Result<Process, anyhow::Error> {
2. env_logger::init();
3. let mut bpf = Bpf::load(include_bytes_aligned!(“../path/to/ebpf-bin”))?;
4. BpfLogger::init(&mut bpf)?;
5. let program: &mut Xdp = bpf.program_mut(`fun_xdp`).unwrap().try_into()?;
7. program.attach(&opt.iface, XdpFlags::default())?;
8. let pid = std::process::id();
9. let prog_fd = bpf.program(`fun_xdp`).unwrap().fd().unwrap().as_raw_fd();
10. let handle = tokio::spawn(async move { spawn_agent(&mut bpf, shutdown).await });
11. Ok(Process {
12. pid,
13. prog_fd,
14. handle,
15. })
16. }
17. pub struct Process {
18. pub pid: u32,
19. pub prog_fd: i32,
21. }

Going line by line:

1. This is another asynchronous function that takes in the network interface to attach to `iface` and a thread-safe `shutdown` boolean. The `Result` is `Process` if `Ok` and a catch-all `Err` otherwise.
2. Initialize logging for userspace.
3. Load our compiled eBPF bytecode. Aya makes recompiling our eBPF source code into bytecode easy, so it automatically happens before our userspace code is compiled.
4. Initialize logging from our eBPF program.
5. From our eBPF bytecode, get our `fun_xdp` eBPF XDP program.
6. Load the `fun_xdp` eBPF XDP program into the kernel using the default flags.
7. Attach our `fun_xdp` eBPF XDP program to a network interface that was set by the `iface` command line argument to our binary.
8. Get the current process ID.
9. Get the file descriptor for the `fun_xdp` eBPF XDP program.
10. Create a task handle for our updated `spawn_agent` helper function that we will look at next.
11. Return an `Ok` `Process`
12. With the process ID
13. eBPF XDP program file descriptor
14. And the `spawn_agent` task handle
15. The type definition for a `Process` including
16. A process ID
17. A program file descriptor
18. And an asynchronous task handle

Creating this separate `run` function allows us to perform integration-level testing for our macro-benchmarks. Finally, let’s take a look at that update `spawn_agent` function:

1. async fn spawn_agent(bpf: &mut Bpf, shutdown: Arc<AtomicBool>) -> Result<(), anyhow::Error> {
2. let mut xdp_map: aya::maps::Queue<_, SourceAddr> =
4. loop {
5. while let Ok(source_addr) = xdp_map.pop(0) {
7. }
9. break;
10. }
11. }
12. Ok(())
13. }

This updated version of `spawn_agent` is nearly identical to its previous incarnation, except we now check the `shutdown` boolean on each iteration at lines 9 through 11. This allows for a clean shutdown from our custom benchmarking harness.

Creating a custom benchmarking harness in Rust is not actually as complicated as it sounds. First we will update our `Cargo.toml` file:

1. [dev-dependencies]
2. inventory = “0.3”
3. [[bench]]
4. name = “xdp”
5. harness = false

Just as with the micro-benchmarks, we will need to add a development dependency. This time, though, since we’re building our own, we will use the `inventory` crate instead of Criterion. We’ll also configure `cargo bench` to run our `xdp` benchmark test file and not use the default harness.

That `xdp` benchmark test file will contain our custom test harness:

1. #[derive(Debug)]
2. pub struct EBpfBenchmark {
3. pub name: &’static str,
4. pub benchmark_fn: fn() -> f64,
5. }
6. inventory::collect!(EBpfBenchmark);

Each of our benchmarks will be represented as an `EBpfBenchmark`, with a `name` and function to run. We then use the `inventory` crate to collect all `EBpfBenchmark`s at compile time.

Next let’s look at the `main`:

1. fn main() {
2. let mut results = Vec::new();
3. for benchmark in inventory::iter::<EBpfBenchmark> {
4. let benchmark_name = benchmark.name.parse().unwrap();
5. let json_metric = JsonMetric::new((benchmark.benchmark_fn)(), None, None);
6. results.push((benchmark_name, json_metric));
7. }
11. let mut file = File::create(“../target/results.json”).unwrap();
13. }

Going line by line:

1. This is the `main` function for our custom benchmark harness.
2. Create a vector of `results`.
3. For each `EBpfBenchmark` that was collected by `inventory` at compile time…
4. Parse the benchmark name.
5. Run the benchmark function and store the results as a JSON value.
6. Add the result to the `results`.
7. Create a JSON object containing all of the results.
8. Serialize the JSON results to a string.
9. Print the JSON results string.
10. Create a results file.
11. Save the JSON results string to the results file.

Now we are ready to add our first benchmark to our `inventory`:

1. inventory::submit!(EBpfBenchmark {
2. name: “fun_xdp”,
3. benchmark_fn: fun_xdp_benchmark
4. });

This submits a benchmark named `fun_xdp` with a benchmark function `fun_xdp_benchmark` that looks like:

1. fn fun_xdp_benchmark() -> f64 {
2. let rt = Runtime::new().unwrap();
3. let shutdown = Arc::new(AtomicBool::new(false));
4. let ebpf_shutdown = shutdown.clone();
5. let process = rt.block_on(async { ebpf::run(IFACE, ebpf_shutdown).await.unwrap() });
6. let _resp = rt.block_on(async { reqwest::get(“https://bencher.dev”).await.unwrap() });
7. let bpf_stats = get_bpf_stats(&process);
8. shutdown.store(true, Ordering::Relaxed);
9. bpf_stats
10. }

Going line by line:

1. This `fun_xdp_benchmark` function takes in no arguments and returns a 64-bit floating point number.
2. Create an asynchronous runtime.
3. Create a thread-safe `shutdown` boolean, which is initialized to `false`.
4. Make a copy of our `shutdown` boolean.
5. Pass a statically defined test network interface and the copy of the `shutdown` boolean to our `run` helper function. Block on this call.
6. Create some network traffic, say visiting the home page for the best continuous benchmarking tool. Block on this call.
7. Get the `bpf_stats` for the current process by calling our `get_bpf_stats` helper function.
8.  Set the `shutdown` boolean to `true`.
9. Return the `bpf_stats`.

Taking a look at the `get_bpf_stats` helper function:

1. fn get_bpf_stats(process: &Process) -> f64 {
2. let fd_info = File::open(format!(“/proc/{}/fdinfo/{}”, process.pid, process.prog_fd)).unwrap();
4. let (mut run_time_ns, mut run_ctn) = (None, None);
5. for line in reader.lines().flatten() {
6. if let Some(l) = line.strip_prefix(“run_time_ns:”) {
7. run_time_ns = l.trim().parse::<u64>().ok();
8. } else if let Some(l) = line.strip_prefix(“run_cnt:”) {
9. run_ctn = l.trim().parse::<u64>().ok();
10. }
11. }
12. match (run_time_ns, run_ctn) {
13. (Some(run_time_ns), Some(run_ctn)) if run_ctn != 0 => run_time_ns as f64 / run_ctn as f64,
14. _ => 0.0,
15. }
16. }

Going line by line:

1. `get_bpf_stats` takes in a reference to a `Process` and returns a 64-bit floating point number.
2. Open the `fdinfo` pseudo-file for our eBPF XDP program.
3. Create a buffered reader for the `fd_info` file so we don’t load it into memory all at once.
4. Initialize our `run_time_ns` and `run_ctn` statistics to `None`.
5. Read each line in the `fd_info` file.
6. If the line starts with `run_time_ns:`
7. Then trim and parse the remainder as a 64-bit floating point number.
8. Otherwise, if the line starts with `run_cnt:`
9. Then trim and parse the remainder as a 64-bit floating point number.
10. Check the value of `run_time_ns` and `run_ctn`
11. If `run_time_ns` and `run_ctn` have both been set to `Some` and `run_ctn` doesn’t equal zero, then find the average run time by dividing `run_time_ns` by `run_ctn`.
12. Otherwise just return zero

Now we are almost ready to run our macro-benchmark! First, we have to enable `bpf_stats` as it is disabled by default.

1. \$ sudo sysctl -w kernel.bpf_stats_enabled=1
2. kernel.bpf_stats_enabled = 1

Then we need to compile our eBPF code into release mode, as our benchmarks are run in release mode.

1. \$ cargo xtask build-ebpf –release

Then inside of the userspace crate, we can finally run our benchmarks. We will have to run as `root`, though, because loading an eBPF program requires elevated permission.

1. \$ sudo -E \$(which cargo) bench
2. Finished bench [optimized] target(s) in 0.13s
3. Running benches/xdp.rs (/home/epompeii/Code/bencher/examples/ebpf/target/release/deps/xdp-d7c85fd4c85d089e)
4. {
5. “fun_xdp”: {
6. “latency”: {
7. “value”: 18249.727272727272,
8. “lower_bound”: null,
9. “upper_bound”: null
10. }
11. }
12. }

Boom! The output is the JSON we serialized and printed in our `main` function above. The JSON format used is Bencher Metric Format (BMF). BMF is the JSON format for Bencher, the continuous benchmarking tool that we will use later in this series. Macro-benchmarking an eBPF program in Rust with a custom benchmarking harness is complete! 🎉

In this post, we have come a long way toward preventing performance regressions. We have refactored both our eBPF and userspace source code to be more testable, and added both micro-benchmarks and macro-benchmarks. Working locally, we can easily see the performance improvements and regressions our changes make. For the same reasons that unit tests are run in CI to prevent feature regressions, benchmarks should also be run in CI to prevent performance regressions. This will require a continuous benchmarking tool. In the next and final installment in this series, we will look at using Bencher to track both our micro- and macro-benchmarks to catch performance regressions in CI.

TNS owner Insight Partners is an investor in: Pragma.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.