Technology / /

Linux Networking, Tracing and IO Visor, a New Systems Performance Tool for a Distributed World

28 Sep 2015 11:36am, by

There is more raw power in silicon per square millimeter than there has ever been. With the advent of distributed computing — and cloud and container technology running on specialized hardware — performance analysis tools are primarily facing two challenges:

  • Providing performance analysis across all these layers in a uniform, yet flexible way.
  • Doing it with minimum impact on the running system.

The main issue we discuss in this article is minimizing the overhead, or impact, on the system and giving the analysis tools as much flexibility as possible. We start with a small primer on tracing tools before moving on to IO Visor internals.

Tracing

Traditional debugging and profiling tools cannot pinpoint errors with confidence. With debuggers, you could miss errors that are transient — occurring only once in a while over very long runs. Also, the inherent issues with pausing the application in debuggers may distort the time-dependent execution profile, meaning that you may miss time-dependent bugs, such as race conditions, due to the program running more slowly than before. Aptly, these are called as heisenbugs.

To overcome this diagnosis nightmare, we use tracing: Run the program while recording information about its execution with minimal overhead. Similar to breakpoints in the code, special tracepoints can be inserted either statically (compile time) or dynamically (runtime). Operating system kernels have been a very important target for these tracers, owing to their complexity and importance in any infrastructure. Solaris, Linux and even Windows have had support for tracing tools for quite some time now. Tracing can be performed at both kernel and userspace level and has proved to be an indispensable tool for performance analysis.

Dynamic Tracing

The most important factor for ease of use and flexibility is the ability to dynamically trace the kernel and userspace applications at specific probe points. Dynamic tracing serves two purposes: First, the probes can be inserted as required, and second, we can remove/disable the probes when we are done with the analysis. Thus, we can precisely control the overhead of the tracing tool on the system.

As early as 2000, the LTT + DProbes project achieved an amazing feat: dynamic tracing for Linux; however, the project wasn’t integrated into the Linux kernel. In the intervening years, other projects have reinvented dynamic tracing, including Sun’s heavily-marketed and production-safe DTrace tool. Fortunately for Linux, various smaller-scale capabilities have been integrated over the years, including tracepoints, ftrace, perf_events, kprobes and uprobes. Numerous newer projects for Linux — including SystemTap, LTTng, and ktap — only build upon these tracing capabilities. For example, SystemTap uses the uprobes and kprobes mechanism to provide dynamic tracing. LTTng provides kprobes-based dynamic kernel tracing as well. However, these newer tools have not been integrated into the Linux kernel.

To save the day, another effort has started in the form of eBPF, which has found its way into the kernel mainline. As of Linux 4.3, it’s now possible to build a feature-rich tracer for Linux by writing a user space frontend to what the kernel has in-built. That’s what the IO Visor and BCC project is inching towards. Let’s have a look at it in detail.

If a performance engineer is given an in-kernel facility to collect, probe or filter data dynamically, it becomes an instant hit!

IO Visor

IO Visor was recently announced as a Linux Foundation collaborative project during LinuxCon Seattle. Even though tech jargon can be a bit vague sometimes, I would still try to define it as an infrastructure to efficiently and securely exploit I/O and networking applications in Linux – making them programmable and dynamic.

The main technology behind IO Visor is the extended Berkeley Packet Filter (eBPF). For quite sometime in the Linux kernel, a classic BPF has been there as a tiny packet-filtering VM. Apart from packet filtering, it was also used in seccomp for syscall filtering. Some kernel developers then decided to move ahead and take this facility out from the kernel network subsystem origins and make it more generic. They extended it, improving its architecture and adding support for JIT compilation of the BPF bytecode. It also comes with a new easy-to-use bpf() syscall and efficient data sharing mechanisms in the form of BPF maps. And before we knew it, the classic BPF had evolved to eBPF!

Now, we also need a way to express higher-level tracing needs in the form of a script/program, and then have a mechanism to convert that to eBPF bytecode for insertion and subsequent execution in the kernel eBPF VM. IO Visor developers soon started working on efficient ways to generate the eBPF bytecode and settled on a LLVM backend-based approach. A new BPF target can be used from LLVM v3.7 onwards to generate eBPF bytecode using clang. Things are getting interesting now. Have a look at how IO Visor is shaping up:

block-iovisor

Now an eBPF program could be written in a “restricted C”-like format and converted to an eBPF binary. eBPF bytecode can then be extracted from the generated binary and loaded in the kernel by the bpf() syscall for an eventual execution by the eBPF VM. A tiny eBPF JIT compiler can convert this bytecode to native code for an approximate three-fold improvement in speed on an average eBPF program execution.

This is a crucial thing. Why? Because, an eBPF program can be dynamically attached to certain locations, such as Kprobes, socket filters, etc. This means that at each Kprobe hit, the BPF program will be executed. To minimize any execution overhead, we need this eBPF program to be as fast as possible. With BPF maps, the dynamically inserted code could be used to collect/aggregate data in hash maps or arrays and send it to the userspace for performing further processing, analysis and display. With such an in-kernel infrastructure to execute code securely in the kernel, the possibilities are endless, specifically in networking, tracing and other performance analysis domains.

BCC

BPF Compiler Collection (BCC) is an ongoing effort by multiple contributors like PLUMgrid and Big Switch Networks to create an improved, simpler approach for using eBPF for tracing and networking needs. Using BCC, a developer can use a mix of the BPF syntax and Python to create complex performance analysis tools very easily.

Tracing Example

To use eBPF with BCC, have a look at the requirements and installation instructions first. Here is a simple example of an eBPF program I just came up with that prints “sys_open” — along with the process name and PID — each time a process tries to make an open() syscall:

#!/usr/bin/env python 

from bcc import BPF

prog = """
int message(void *ctx) {
  bpf_trace_printk("sys_open\\n");
  return 0;
}
"""
b = BPF(text=prog)
b.attach_kprobe(event="sys_open", fn_name="message")
print "{:>20} {:>20} {:>10}".format("TASK", "TIMESTAMP", "MESSAGE")
b.trace_print(fmt="{0:>20} {4:>20} {5:>10}") 

Here, the BPF program message() is essentially a call to the bpf_trace_printk() helper function that is attached to a Kprobe at sys_open() function call in the Linux kernel. The output is written to the kernel’s debugfs trace pipe (/sys/kerneldebug/tracing/trace_pipe) which is then printed to stdout by trace_print() with proper formatting. I chose to extract and print the task name {0}, timestamp {4} and the message {5} out of the trace pipe. Here is how the output from the trace tool looks:

$sudo ./opentrace.py
                TASK            TIMESTAMP    MESSAGE
                 vim        5022410.92052   sys_open
                 vim        5022410.92054   sys_open
                 vim        5022410.92058   sys_open
                 vim        5022413.99273   sys_open
                 vim        5022413.99282   sys_open
          irqbalance        5022416.97293   sys_open
          irqbalance        5022416.97326   sys_open
          irqbalance        5022416.97339   sys_open
             systemd        5022422.77757   sys_open

You see, with a few lines of BCC and eBPF code, one can actually create an inexpensive dynamic tracing tool! This was one of the simplest examples — IO Visor technology is getting stronger now. Have a look at some BCC examples and tools that are now able to generate more efficiently and accurately, considering the low overhead of eBPF. Brendan Gregg, along with PLUMgrid developers like Brenden Blanco, has been working to improve BCC to make developing tracing tools easier and more efficient. To get you more interested, here is a sample output from one of my favorite tools, funclatency, which Brendan recently ported to eBPF:

  $ ./funclatency -u do_sys_open
Tracing do_sys_open... Hit Ctrl-C to end.
^C
usecs               : count     distribution
    0 -> 1          : 20       |******                                  |
    2 -> 3          : 125      |****************************************|
    4 -> 7          : 83       |**************************              |
    8 -> 15         : 18       |*****                                   |
   16 -> 31         : 7        |**                                      |
   32 -> 63         : 2        |                                        |
Detaching...

You can see that how many times do_sys_open() was called in the kernel and the distribution of time it took. So the function latency was two to three microseconds, usually. There are more tools which were presented by Brendan Gregg recently on his latest blog post as well. You can learn more about BCC and its “superpowers” at BCC’s Github repo. If your appetite for IO Visor and its underlying technologies have still not been fulfilled, have a closer look under the hood by reading some docs.

Thanks to Alexei Starovoitov for his recent work on eBPF in the Linux Kernel, and Brendan Gregg for his input and feedback on this article.

Feature image: “Traces in the sand” by fdecomite is licensed under CC BY 2.0.


A digest of the week’s most important stories & analyses.

View / Add Comments