Watching activity in the kernel with the bpftrace tool
Huge Selection
There's plenty of choice of probes in the kernel. From vfs_read
(the function that reads bytes from disk and can pass a count to a probe), through do_exe_cve
(for monitoring newly created Unix processes), to trace_pagefault_reg
(which is triggered when a memory page is reloaded), users can inspect the kernel's workings at will and discover in real time what's going on and where the bottlenecks are.
Figure 2 lists the probes that bpftrace prints when called with the -l
switch. BPF distinguishes between kprobe
s, which track important kernel functions by name, and tracepoint
probes, which the kernel maintainers manually maintain at a slightly higher logical level and which are thus more resilient to changes in the kernel. In contrast to userspace-facing kernel APIs, the kernel's internal functions are by no means guaranteed to be stable.
Potential for More
How about a script that outputs all newly created processes on the system in real time, including the command that was used to start them and their parameters? Listing 2 shows a one-liner that activates the sys_enter_execve
tracepoint and prints its argument list argv
in the args
structure.
Listing 2
procs-new.bt
01 #!/usr/bin/bpftrace 02 03 BEGIN 04 { 05 printf("New processes with arguments\n"); 06 } 07 08 tracepoint:syscalls:sys_enter_execve 09 { 10 join(args->argv); 11 }
Here you can see that the range of functions in bpftrace still has potential for more. For example, there is the join()
function, which uses spaces to join and output elements of a command line in args->argv
. It cannot return the result as a string, however, so you could format the output with printf()
. Hopefully, upcoming versions will resolve this issue.
The BEGIN
block from line 3 simply provides entertainment for the user. If you want the script to display a message or initialize a variable right at startup, this happens in the BEGIN
block as shown in Listing 2, based on the Awk programming model.
In the Thick of It
However, things become more complicated if a probe that detects a problem cannot output the desired data because it is located somewhere else. For example, to look at processes that try to open files that do not exist (or to which they have no access), Listing 3 taps into the sys_exit_openat
tracepoint, which the kernel runs through when the open()
system call returns.
Listing 3
opens-failed.bt
01 #!/usr/bin/bpftrace 02 03 tracepoint:syscalls:sys_enter_openat 04 { 05 @filename[tid] = args->filename 06 } 07 08 tracepoint:syscalls:sys_exit_openat 09 / @filename[tid] / 10 { 11 if ( args->ret < 0 ) { 12 printf("%s %s\n", comm, str(@filename[tid])); 13 }; 14 delete(@filename[tid]); 15 }
Using the condition args->ret < 0
, Bpftrace checks whether the return code from the system call was negative, which indicates that the desired file could not be opened. If so, we want the code to output the name of the process in question and the file name at this point. However, the exit
tracepoint does not have access to the file name, which was only present when the kernel previously ran the open()
function, tied to the sys_enter_openat
tracepoint (notice the subtle difference between enter
versus exit
).
The solution in this case is to have bpftrace create a data structure during the open()
call and somehow carry it over to exit
, which then extracts the filename from it and reports the error with the desired context. For this to happen, the script stores all names of opened files in a Map type data structure when entering open()
(i.e., in the sys_enter_openat
tracepoint), under the key of the current kernel thread ID, which is present in the predefined tid
variable. If the file fails to open later on, the sys_exit_openat
tracepoint can look up the name of the file in question in the map and notify the user of this and even tell it the command of the process in comm
that experienced the error.
The filter set in line 9 of Listing 3 is / @filename[tid] /
, and it ensures that the probe executes the following code if the kernel thread has previously set a file name in the map. If the call came from elsewhere than the sys_enter_openat
tracepoint defined above, the map entry won't exist, and the filter lets bpftrace ignore the event.
After reporting the incident, the code proceeds to line 14, which calls delete
to remove the map entry. If it forgot to do that, the map would grow indefinitely and eventually consume too much memory if the bpftrace script were to run for a longer period of time.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.