If you’ve ever inspected a running Linux kernel, you probably heard of ftrace. It’s an amazing tool for tracing flows and events in the kernel, and it is very versatile — but what if it doesn’t fully support your use-case?
Recently, while conducting research, I wanted to follow a specific kernel code flow: I was about to modify the behavior of an existing kernel mechanism, and I needed to understand the flow of data between different functions.
As with any open source project you’re trying to analyze, your go-to assistant is the source code. Linux, besides being open source, provides tons of dynamic tools for debugging, and they are a great supplement to static reading of the code. In this post, I’ll focus on one of these tools, ftrace, and present a modification I made to make its filtering capabilities more versatile.
At its base, ftrace (Function Tracer) is a dynamic function instrumentation infrastructure. It can be used to set dynamic traces on virtually all kernel functions, and also supports a large set of static tracepoints, used to record core kernel events. It is available in most modern Linux distributions.
It’s a powerful tool for tracing flows and events in the kernel. Providing both in-kernel API and usermode control interface via
tracefs, it’s arguably one of the most comprehensive tracing tools in Linux.
ftrace has many modes of operation, in this post I’ll focus on the
function_graph mode: It allows you to record the call graph originating from a given function, recursively.
In case you’ve never used it (or just as a slight reminder) — here’s how it looks when tracing the kernel path for a
vfs_read call on
That call graph can provide great aid in understanding flow, alongside reading the code.
This isn’t a starters guide for ftrace, so I will assume you know its basics from this point onward. If you want a longer introduction on ftrace and its additional modes, I recommend reading these articles: Debugging the kernel using Ftrace , Ftrace: The hidden light switch , and ftrace: trace your kernel functions.
Back here, I was trying to understand the receive flow of a specific TCP packet (which functions does it go through).
The TCP receive entry-point function is tcp_v4_rcv (https://elixir.bootlin.com/linux/v5.0/source/net/ipv4/tcp_ipv4.c#L1781). We can
function_graph it, but we’ll soon encounter a problem — there are heaps of data generated by graph-tracing this function: On my desktop, under a not-so-heavy network load, I was getting a 5MB/s trace file! That’s a lot of data.
There are two limitations pointed out here:
- The calls didn’t undergo any filtering — all calls were written to the trace file. I wanted to see the graph of
tcp_v4_rcvon a specific packet, not on all TCP packets processed by the system.
- There’s no information attached to the calls (for example, function arguments). Take a look at the sample graph above — there are only function names and durations. How can we differentiate the “interesting” calls?
Let’s seek help from existing ftrace features. For this purpose, we can start with the very long documentation.
ftrace can record the current PID/TGID and the command (
/proc/pid/comm), and report them in the trace file. As far as I can tell, no more specific details are available to be recorded on each entry.
ftrace also supports filtering by PID: Using the
set_ftrace_pid file, we can limit the traced calls to specific processes. For example, this will limit traces to the init process:
# cd /sys/kernel/tracing
# echo 1 > set_ftrace_pid
Sounds nice, eh? For many filtering use-cases, this is just enough: A simple program that writes its PID to
set_ftrace_pid and then invokes some kernel code you wish to trace.
In the network RX case, it’s impossible (though strictly speaking, this can be implemented by pinning a RX flow to a specific CPU [and its ksoftirqd/cpu thread]. But I wanted a more generic solution.): Network packets are processed in software interrupts running possibly on any CPU, in the context of virtually any thread of the system. There goes filtering by PID. (In Linux, software interrupts are a set of common, high priority jobs executed regularly in response to certain kernel/hardware events. Sometimes they complete work deferred from a hardware interrupt, but not necessarily. You can read more about them here).
I’m not aware of any other sophisticated filters allowed by ftrace. We can try to filter out the events externally (e.g in the case of network RX — filter away unwanted packets with iptables, this way they won’t end up in
… or, we can tweak ftrace to provide the filtering we need!
Tweaking ftrace’s Filtering
All following Linux references are based on kernel 5.0+.
What I had in mind: ftrace has its PID-based filtering. I can provide external filtering, and after a call, I wish to trace passes my filter, the PID of the currently running process can be “added” to
set_ftrace_filter, effectively making the process traced (and I’ll get the call graph I desired)
What happens when you write to
set_ftrace_pid? A quick search of this string yields
ftrace_init_tracefs (https://elixir.bootlin.com/linux/v5.0/source/kernel/trace/ftrace.c#L6593) which registers the tracefs file; And
ftrace_pid_write(https://elixir.bootlin.com/linux/v5.0/source/kernel/trace/ftrace.c#L6526) actually implements the write.
ftrace_pid_write is rather long for its seemingly simple purpose. Summing it up, this function does three things:
- Creates a
trace_pid_liststruct from the input PIDs (you can write multiple PIDs for tracing), and stores it for later use.
- Registers a
sched_switchevent handler called ftrace_filter_pid_sched_switch_probe https://elixir.bootlin.com/linux/v5.0/source/kernel/trace/ftrace.c#L6336).
- Updates tracers that use PID filtering to have ftrace_pid_func (https://elixir.bootlin.com/linux/v5.0/source/kernel/trace/ftrace.c#L146) as their tracing handler.
Effectively, the 3rd step enables filtering.
ftrace_pid_func is very simple: It is called instead of the original tracing handler and checks whether the current task should be traced or not, by
reading tr->trace_buffer.data->ftrace_ignore_pid. If (and only if) tracing should be performed, it calls the original handler.
Let’s see where this field is updated.
ftrace_pid_write has previously registered
ftrace_filter_pid_sched_switch_probe for the
sched_switch event. This handler will be called for each task switch in the system. It knows which is the next task to run.
ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
struct task_struct *prev, struct task_struct *next)
struct trace_array *tr = data;
struct trace_pid_list *pid_list;
pid_list = rcu_dereference_sched(tr->function_pids);
It calls trace_ignore_this_task (https://elixir.bootlin.com/linux/v5.0/source/kernel/trace/trace.c#L354)to see whether the new task’s PID was marked for tracing:
trace_ignore_this_task(struct trace_pid_list *filtered_pids, struct task_struct *task)
* Return false, because if filtered_pids does not exist,
* all pids are good to trace.
return !trace_find_filtered_pid(filtered_pids, task->pid);
The result is written into the
ftrace_ignore_pid field, itself a percpu. Per-CPU variables are variables that have multiple copies, one for each CPU of the system. Kernel code can easily access the percpu instance for the currently executing CPU. If an object can be logically bound to a CPU, then defining it as a percpu variable will remove all locking needs, because each CPU has its own copy (This has many other uses, and you can read about its kernel implementation here). This is a caching optimization: instead of filtering PIDs through the list whenever a task hits a traced function, the check of a PID is done once on task switch, and later the tracing decision is a mere cache read of this single boolean. Neat, and a very nice use of percpu as well! The result will be automatically invalidated upon the next call to
ftrace_filter_pid_sched_switch_probe, which will write a new value for the next task.
I think we know enough. If we can update the percpu
ftrace_ignore_pid, we can control whether tracing is enabled for the current CPU. If we can set it once we want to start tracing and unset it once we want to stop, we can use it to limit filtering at will.
So, I started writing a kernel module to work this out on my test machine. Shortly enough, I recalled I worked on a MicroPython port for the Linux kernel just for such occasions! Using it, we can quickly define a few hooks that’ll do the job.
Our goal: given an input TCP packet that matches a precondition (say, destination port is 9000) we’d like to get its ftrace call-graph of
We’ll start with a hook that’ll do the actual filtering on the packets. It has to be called before
ftrace-graph, otherwise, we won’t be able to control the filtering! If I hook
tcp_v4_rcv itself with ftrace, it runs after the ftrace-graph filtering decision, which is too late for us. I’m sure there’s a way around it, but I didn’t bother checking — instead, I hooked the
tcp_protocol->handler pointer, which is used to call
tcp_v4_rcv. This way, we’re surely called before it.
At this point, we have the
sk_buff and we can make the tracing decision. If we wish to trace, we can write to
tr->trace_buffer.data->ftrace_ignore_pid… Which is not in the exported headers, sadly.
I wanted this to be fully compatible with standard kernel headers, so accessing unexported structs is a bad idea. A workaround is to hook
trace_ignore_this_task with a simple handler that returns one only for our traced task. Then we can call
ftrace_filter_pid_sched_switch_probe as if a scheduler switch has happened. It will update the tracing status for the current task — and since we forced
trace_ignore_this_task to return 0, the current task will be marked for tracing.
Funny anecdote: ftrace won’t allow placing hooks on ftrace own code (such as
trace_ignore_this_task). But you can place
kprobe hooks, which are sometimes backed by ftrace, but in this case, it’s useful they are not :)
Using these two hooks, I can enable filtering based on the input
sk_buff. The final snippet is available here.
It can be pasted into the Python shell (in paste mode) to enable the hooks. Finally, to start tracing we’ll do the following:
# cd /sys/kernel/tracing
# echo 2 > set_ftrace_pid
# echo tcp_v4_rcv > set_graph_function
# echo function_graph > current_tracer
Two is the PID of
kthreadd — we need to write something into
PID-based filtering is enabled. This kernel thread is mostly asleep (it’s the thread from which new kernel threads are forked. As it isn’t scheduled in often, the chances it’ll handle any software interrupts are rare).
That’s it! After enabling ftrace and sending the to-be-filtered packet, I received exactly one trace in the trace buffer! It took some time, but this trace sure was helpful in helping me understand the relevant flow.
What we can take from here, besides the new ability of dynamic filtering, is this: As long as you know exactly what you seek, you can do it in Linux, even if it’s not supported in the code. All it takes is a bit of digging in the code, and perhaps adding a tiny patch here and there.
Note: The idea presented here — and the kernel Python itself — are both experimental. Do NOT try them on a production system. They are given as research and debugging tools for use on test machines only!
The Linux Foundation is a sponsor of The New Stack.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Granulate.