Watching activity in the kernel with the bpftrace tool

Programming Snapshot – bpftrace

© Lead Image © Zoya Fedorova, 123RF.com

© Lead Image © Zoya Fedorova, 123RF.com

Article from Issue 230/2020
Author(s):

Who is constantly creating the new processes that are paralyzing the system? Which process opens the most files and how many bytes is it reading or writing? Mike Schilli pokes inside the kernel to answer these questions with bpftrace and its code probes.

If you are tasked with discovering the cause of a performance problem on a Linux system that has slowed down to a crawl, you will typically turn to tools such as iostat, top, or mpstat to see exactly what is throwing a spanner in the works [1]. Not enough RAM? Lame hard disk? CPU overloaded? Or is network throughput the bottleneck?

Although a tool like top shows you the running processes, it cannot detect short-lived instances that start and end again immediately. Periodically querying the process list only makes sense to visualize long-running processes.

Fortunately, the Linux kernel already contains thousands of test probes known as Kprobes and tracepoints. Users can inject code, log events, or create statistics there. One totally hot tool for doing this is bpftrace. With simple one-liners, it injects into the kernel scripts that determine in real time metrics like bytes heading off into the network or onto the hard drive, or lists which processes open or close which files.

BPF stands for Berkeley Package Filter and testifies to the origin of the corresponding tool from the BSD world as a tracing tool for network packets. The practice of scattering probes throughout the code that are usually tacit, but run small snippets of code when triggered, proved so practical that it soon entered the Linux world as eBPF.

Once there, it lost its ties to network packets and conquered wide areas of the kernel code as a generic tracing concept. Good naming is hard work that engineers often shy away from, so the author of eBPF changed the name of his now popular work back to BPF. Of course, that complicates things for authors writing tutorials like this one, who are hard pressed to find an explanation as to why the BPF name has lost all meaning with respect to the product as it is today.

The approach of distributing dynamically deployable probes in kernel code came from the Sun world. For a long time, Solaris was the only operating system that allowed administrators to use DTrace to activate small pieces of D language code at strategic points, such as the system call entry point, and fire off counters or timers for performance analysis.

BPF on newer Linux kernels works in a similar way to DTrace, but has been rewritten (also for patent reasons). It executes instructions assigned to the probes in the BPF language, either in an interpreter or via a JIT compiler in native code, directly inside the kernel.

Status: Improving

The bpftrace programming language is very reminiscent of scripting with Unix veteran Awk, but it's still incomplete, and programmers sometimes struggle to complete even the simplest of tasks.

The bpftrace parser (implemented via the Unix veterans Lex and Yacc) is in a sorry state that doesn't even come close to the functionality of Awk – but maybe it will at some point. Netflix engineer Brendan Gregg and some open source friends are working on fixing it. Brendan's book on BPF [2] will be published in December 2019 (a preview is already available).

Back to the task at hand: How do you enable a probe in the kernel that outputs a message each time any userspace program calls the open() function to open a file? With this function, you'll be able to monitor in real time processes of active files. Turns out this is really easy to do. Listing 1 [3] shows the program code; Figure 1 shows the program output.

Listing 1

sys-open.bt

01 #!/usr/bin/bpftrace
02
03 interval:s:5
04 {
05   exit();
06 }
07
08 kprobe:do_sys_open
09 {
10   printf("%s %s\n", comm, str(arg1));
11 }
Figure 1: The code from Listing 1 reports in real time which processes are trying to open which files.

Compact Code

The actual work starts in line 8 with the definition of the kprobe:do_sys_open probe; the following block contains instructions to be executed when the probe triggers. When triggering it, the kernel tells the probe which file the open() system call wants to open. In the block, the printf() instruction outputs the Unix command of the triggering Unix process stored in the comm variable along with the first argument arg1, which carries the name of the file to be opened. Because printf() expects a string, but BPF saves arg1 as a character pointer, the standard str() function converts the pointer appropriately.

The code for the interval:s:5 event starting in line 3 is just some optional feature that cancels the program after five seconds. The event defines an interval of five seconds at which bpftrace jumps into the code block. The call to exit(), which shuts down the program, occurs here as soon as the block has been accessed for the first time. Tracing tools often use intervals like this to output consolidated statistics every few seconds. Once bpftrace has been installed on a Ubuntu system like this:

$ sudo apt-get update
$ sudo apt-get install bpftrace

all you need to do is run Listing 1 with sudo. It launches in the blink of an eye and keeps showing you which processes on the system are currently attempting to open which files. Before you get too excited, however, please note that bpftrace only works on relatively new kernels. Its creators recommend at least version 4.9, and preferably a series 5 kernel.

It is a very powerful tool. Astonished users will rub their eyes in amazement thinking about what just happened behind the scenes during the inconspicuous call: Bpftrace activated the do_sys_open kprobe in the kernel and translated the printf() statement into an internal format. It then installed the compiled code on the probe, causing it to display a message every time the kernel passes the probe. When the bpftrace call terminates, it deactivates the probe in the kernel and removes the injected code.

Full Tilt While Idle

How does this work inside the kernel? It would obviously be devastating for kernel performance if it had to check whether each probe is currently active and then carry on normally in the program in almost 100 percent of the cases when the probe is inactive. There are always very few, if any, probes active from thousands of possible ones.

Instead, the BPF technology, just like DTrace under Solaris, uses a trick: Normally, when the probe is inactive, it inserts a 5 byte no-op instruction into the code, which the processor skips with practically no impact at run time. If the user activates the probe, for example by calling bpftrace, BPF replaces the no-op instruction in the kernel with a jump address to the interpreter that executes the desired code.

No doubt, the CPU will consume time when executing the BPF instructions, which will slow down the kernel a bit. But since the processor stays in kernel mode and doesn't have to switch to user space every time, the probe can quickly refresh the desired statistics – then the flow continues with the actual kernel code.

However, if the infiltrated code were to block the kernel, the result would be devastating: The entire system would stop, which is tantamount to a computer crash. That's why BPF verifies the code before it is introduced and only inserts it if the analysis shows that it will terminate relatively quickly. This is why the bpftrace language does not offer for loops or similar constructs for which it cannot predict with certainty whether they will stop running in the foreseeable future.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • perf

    The kernel supports performance analysis with built-in tools via the Linux performance counters subsystem. perf is easy to use and offers a detailed view of performance data.

  • Tracing Tools

    Programs rarely reveal what they are doing in the background, but a few clever tools, of interest to both programmers and administrators, monitor this activity and log system functions.

  • Kernel News

     

  • How Does ls Work?

    A simple Linux utility program such as ls might look simple, but many steps happen behind the scenes from the time you type "ls" to the time you see the directory listing. In this article, we look at these behind-the-scene details.

  • Userspace Drivers

    New versions of the Linux kernel will support a special userspace driver
    model, but some technical pitfalls might limit the use of this interesting
    new feature.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News