Core Technology

Article from Issue 203/2017

We all want our programs to run fast. With profiling, you can locate the code that is slowing you down and fix it.

While hardware becomes faster and faster, the software it runs becomes slower and slower. Although this is a joke, it carries a bit of truth. Quite often, the software uses suboptimal algorithms, introduces costly abstractions, or misses hardware caches. This is okay most of the time, because hardware is forgiving and has enough capacity for many tasks. The rest of the time, you want your code to perform as fast as possible.

Optimizing a program is not the same as rewriting it from scratch the right way. The majority of code is generally not in the "hot path." It's executed only once in awhile during the program's lifetime, and investing optimization efforts into these parts will never pay off. So, you need a tool to tell you which code paths are consuming most of the CPU time.

A technique called "profiling" comes into play here. You build an execution profile of the program (hence the name) that tells you which functions the program runs and how much time they take; then, you look at the profile and decide what you can do about it. While the last step is a bit of an art, the first step just requires a tool that is essential to any developer. Luckily, Linux offers several such tools.

Ye Olde gprof

In a nutshell, any profiler records functions it sees on the CPU and how long they last. A naive approach would be to inject a hook at the beginning and the end of every function. An "entry hook" can remember the function's name and the timestamp, so the "exit hook" can calculate the time elapsed with a mere subtraction.

Simple as it sounds, it won't work in a multiprocess operating system like Linux, because the kernel may preempt the function being executed virtually anywhere, and the wall clock difference between entry and exit times could be very different from the CPU time. Therefore, you might end up optimizing a function that was really preempted by a higher priority process. That's not good.

A real-world approach is to sample the program at predefined intervals of virtual execution time. Virtual execution time stops when the process is de-scheduled, so if you multiply the sample interval by the number of samples, you get the estimate of how long the function was running. You don't want to sample so often it affects performance, and you don't want the interval to be unnecessarily large, because it is the natural upper bound of the estimation error. Many Unix systems introduce a dedicated profil(2) system call for such sampling. In Linux, it's a library function [1] that runs on top of setitimer(2), thus having more overhead than it should. But you don't call profil() directly. GCC, glibc, and friends wrap these nuts and bolts to build a high-level tool: gprof.

I have never seen a Linux distribution that provides a build-essential equivalent but no gprof, so it is safe to assume you'll find gprof in the repositories, typically within binutils. In fact, gprof is just a utility to postprocess and display collected profiles. Before you can use it, you need to build your code with profiling instrumentation enabled.

To do this, pass the -pg switch to GCC to add some instrumentation (see the "A Word on Function Instrumentation" box): an initialization function, __monstartup(), and the corresponding atexit handler [2], _mcleanup(). It also embeds a call to mcount() in every function in your program. __monstartup() sets things up and calls profil(); mcount records the call stack. For this reason, you can't use -fomit-frame-pointer together with -pg. When your program stops, the _mcleanup() function calls write_gmon(), which saves the call graph, the program counter histogram that profil() collects, and some other data in a file. Typically, it's named gmon.out.

A Word on Function Instrumentation

The -pg switch of gcc is an example of a technique known as "function instrumentation," and it has many more applications, really.

You know that C++ programs can throw exceptions. Other languages, such as Python, do this as well, but, when it happens, you get a neat backtrace that you can use to learn the cause of the problem. What if you could do a similar thing with C++?

Function instrumentation is the answer. When you pass -finstrument-functions to GCC, it embeds two calls to each function. When a function is entered, __cyg_profile_func_enter() is called, and __cyg_profile_func_exit() runs on exit. Both accept a pointer to the function being called and a pointer to the site that called it. It is straightforward to use these two hooks to record the call chain. Then, if an exception happens, you can readily construct the backtrace, as in Python.

The best part is that you don't need to implement it yourself, because ready-to-use libraries take care of everything. For instance, try libcsdbg [4].

For postprocessing, you run gprof. The tool accepts many command-line switches [3], but generally you run it as

gprof foo gmon.out

where foo is the name of the binary you were profiling. This yields a lengthy output that looks similar to Figure 1. This output spans two main parts: the flat profile and the call graph. To show the flat profile only, run gprof -p; to display the call graph, run gprof -q. Both accept a symbol (e.g., function name) to use as a filter. You can also make gprof less verbose with -b: This way, you get data only and no explanatory text (Figure 2).

Figure 1: A typical gprof output. Here, I profile a program that calls either f() or g() randomly a million of times.
Figure 2: The same profile as in Figure 1, but terser because of the -b switch.

The flat profile shows which functions consumed CPU time. The most important piece of information is in the self column, which tells how much time the function spent executing. The cumulative column is the running sum, so if you are interested in the notorious 20 percent of functions that run 80 percent of the time, just cut the output after 80-something in this column. You also get the number of calls for each function (calls) and the average number of milliseconds spent in the function itself (self) or together with its children (total) per call.

The call graph tells you how these functions are reached. main() should be the top function, accounting for 100 percent of CPU time, because this is the function your program executes at the top level. Below it should be the functions that main() calls, along with the time accounted for them (self) and their children. It is possible that several paths lead to the same function, and it makes sense to optimize the path that is on the top of this list.

Profiling System-Wide

gprof is a venerable tool, and an even older technology, books, describe it in detail. gprof is still relevant today, but there are other options as well.

The main pain point in gprof is that you need to instrument a program before you can profile it. Instrumented programs are generally slower, but that's not the point. The point is you need to recompile your code, which makes it unsuitable for run-time analysis: You can't just attach to the process and see where it spends its time. You are instrumenting just one process, so if the system is performing poorly, gprof won't tell you the reason. Finally, gprof doesn't help when it comes to kernel space code.

This doesn't mean gprof is bad. It was designed to profile userspace code during development, and it handles this job quite well. For everything else, you just need a different tool. One such tool is perf [5], which comes from the Linux kernel team. perf relies on hardware support, such as performance counters, to monitor fairly tricky events. It will help you understand whether your code is underperforming (e.g., because it has a high L2 cache miss rate). But perf is also useful for "ordinary" profiling, and it can do it system-wide or per-process. It also profiles both in the userspace and in kernel space, but there are some prerequisites. First, your code should be compiled with frame pointer support. This is likely the case, as I explain in the "Your Friendly Neighbor the Frame Pointer" box. perf claims to support programs built with -fomit-frame-pointer as well, but I have never succeeded in getting a decent profile with other methods, such as DWARF. Your mileage may vary. Second, you'll need debug symbols for the code you profile and the libraries it relies on (e.g., glibc). Luckily, many distributions ship debug symbols in separate packages these days, so you can add them on an as-needed basis.

The perf tool (actually, a set of tools) should be in your distribution's repository. It comes from kernel sources and is naturally tied to the kernel version your Linux box runs. On Ubuntu, install linux-tools-X.Y.Z-generic or similar, if you are running a non-generic (e.g., lowlatency) kernel. Here, X.Y.Z is your kernel's version.

As I mentioned earlier, perf builds around performance events. There could be quite a few of them, depending on your hardware and kernel options, because some events (e.g., tracepoints or context switches) are software based. You can get a complete list of events supported in your system with sudo perf list. Note the use of sudo: perf list would run without superuser privileges, but the number of events it reports will be limited. Many perf subcommands understand the -e or --events switch, which tells which event you are interested in. It defaults to cycles, which counts CPU cycles and is a natural fit for a profiling task.

For system-wide performance analysis, perf top should be your first stop. It displays an ncurses-based top-like interface (Figure 3). The entries with higher overhead (i.e., executed most often unless you set otherwise with -e) come first. Each entry corresponds to a symbol – usually, a function. With debug symbols installed, you see the functions' names; otherwise, you get hexadecimals. Humans are not good at reading hexadecimals, which is why it's important to have debug symbols installed. Note that the Linux kernel is also listed as .

Figure 3: perf top displaying a high-level overview of my system's performance profile. Firefox pretends it does something in the background.

Each entry is attributed several properties, such as comm (a command being executed) and DSO. The later stands for Dynamic Shared Object, and if your binary is a dynamic executable (many are), the DSO is where you get the library name from which the symbol comes. Note the binary itself is also a DSO.

perf defaults to function-level granularity. For a higher level overview, you can sort the output by comm and then by DSO with:

sudo perf top --sort comm,dso

This way, you get process-level details.

Zoom into Processes

After you have learned which process is the culprit, it's time to drill it down. The default function-level granularity works well here, but you can add "another dimension" with -g, which enables call graph collection. There are some ways to fine-tune this [6]. You get a new column, Children, which accounts for the time spent in the functions called from the highlighted entry, but the main effect is that you can zoom into functions if you press Enter. While you can employ this to monitor individual processes in perf top (use -p switch), there is another way around.

The perf record command gathers the performance counters profile from a given command and saves it in a file, usually perf.dat. Then you can postprocess this data the way you want: I'll show some options in a second. The command for gathering the profile can be specified on the command line, or you can again use -p to sample given the process or processes. You specify a sampling frequency with -F or --freq, and it is a good idea not to use "beautiful" values, such as 10 or 100 here, because these numbers readily spring to mind if you just need "some frequency," so the code you profile may already have something that ticks at exactly the same interval. Instead of profiling the code, you will be profiling that timer's callbacks, which is not what you want.

Given all of these conditions, a typical perf record invocation may look like:

sudo perf record -g -F 99 -a -p $PID -- sleep 10

The -a tells perf to sample across all CPUs. The sleep command is just to set a duration for which you want perf to collect the samples. Here, it's 10 seconds. At the end, -p tells perf which process you want to profile.

To format the output collected, you use perf report (Figure 4). It shows the same ncurses-based text user interface you saw in perf top. You can zoom in to the call graph with the Enter key, and other key bindings are available as well (press h). For instance, the Esc key zooms out, and pressing a annotates the symbol under the cursor: You get a disassembly of the function showing you which instructions are most stressed.

Figure 4: perf report for the Firefox process. You see the reverse of de-mangled C++ symbols (is it a bug?) and call stacks.

The most popular profile viewer is probably perf report, but it's certainly not the only one. Hotspot [7], not to be confused with the Java VM, is a GUI counterpart that combines many features in a single Qt-based package (Figure 5), including built-in FlameGraphs (see the "Command of the Month" box) and more intuitive aggregation. Hotspot was released to the public while I was writing this column, so I didn't have much time to give it a try. If you did, please share your thoughts.

Figure 5: The hotspot front end wraps many perf goodies in a Qt5-based interface (Figure 5 is from Hotspot sources and is redistributed under GPLv3).

Command of the Month:

Coverage of perf would be rather incomplete without mentioning FlameGraphs.

FlameGraph is the call graph visualization that makes finding hot paths in the code easier. With an appropriate palette, these visualizations look much like flames (Figure 6), hence the name.

FlameGraphs are two-dimensional. Along the horizontal axis, stack population (that is, functions called) is shown. It is sorted alphabetically and doesn't indicate a passage of time. Each function (or, more precisely, call stack frame) is represented as a rectangle, and the wider the rectangle, the more often the function appears in the profile. The vertical axis represents the call sequence: Functions at the bottom call those at the top, so the largest flame is what is seen on the CPU the most.

Figure 6: A FlameGraph example. The real one is interactive: You can click to zoom in/out and look up symbols.

Brendan D. Gregg, who is the inventor of FlameGraph, also built a toolset to create FlameGraphs from perf.dat files (and many other sources). It's available online [8]; the main piece of this toolset, (yes, it's Perl), renders FlameGraphs as interactive SVG files. The README [8] provides all the details.

If you are serious about doing performance analysis in Linux, I highly recommend Brendan Gregg's other tutorials, blog posts, and tools available on his homepage [9].

Your Friendly Neighbor the Frame Pointer

You have already seen enough examples of how useful it could be to store the call chain that leads to the given point in the code. Frame pointer is a way to do it at the hardware level.

Frame pointer is just a CPU register. On x86, the BP/EBP/RBP registers serve this purpose. In code that uses frame pointers, each function typically begins with:

push %rbp
mov %rsp, %rbp

That is, %rbp always points to the current function's stack frame, and the previous function's stack frame pointer is just above it (remember that on x86, the stack grows down to the lower addresses). When the function returns, it does the reverse:

mov %rbp, %rsp
pop %rbp

There are even dedicated instructions for this: enter and leave, respectively.

Frame pointers take one register to store (and a few CPU cycles to maintain), so historically, compilers have been providing a way to disable this feature. Using it is often discouraged now. Registers aren't as scarce as they used to be in the 32-bit world, and the ability to reconstruct the call stack usually outweighs the pros of such optimization.

The Author

Valentine Sinitsyn works in a cloud infrastructure team and teaches students completely unrelated subjects. He also has a KDE Developer account he's never really used.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • perf

    The kernel supports performance analysis with built-in tools via the Linux performance counters subsystem. perf is easy to use and offers a detailed view of performance data.

  • Improved Multi-Threading Performance: Google's Perf Tools Version 1.0

    Google Performance Tools, part of Google's open source projects, is now available in version 1.0. Perf Tools is especially useful for optimizing multi-threaded programs.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95