Tools and techniques for performance tuning in Linux

An Example

Linux provides quick allocation and deallocation of frequently used objects in caches called "slabs." To provide better performance, Christopher Lameter introduced a new slabs manager called Slub.

However, we found that the scheduler performance benchmark known as hackbench reveals a big difference in run time with kernel 2.6.24/2.6.25-rc, between a system with 16 CPU cores and a system with eight CPU cores. Hackbench is expected to be faster on the 16-core system than on the 8-core system, but the testing result shows the first machine requires three times more run time than the second machine, which indicates a possible performance issue.

The vmstat utility provides the output shown in Listing 6.

Listing 6

Starting with vmstat

 

Notice the high context switch (cs) count and large number of running processes. In this case, hackbench simulates many chat rooms with a large number of users passing messages back and forth in each room. The lack of idle time in the system indicates that the CPU is very busy.

The next step is to use oprofile to find out where the CPU is spending its time. The oprofile data in Listing 7 shows that about 88% of the CPU time is spent in allocating slabs, adding to partially filled slabs, and freeing slabs. It shows that the benchmark generates lots of messages that are allocated and passed between processes and memory management, and that is where the program is spending the most time.

Listing 7

Studying CPU Usage with oprofile

 

This result indicates the need to take a closer look at what is going on with the slabs. A utility called slabinfo provides a report on slab activity. (The source code for the slabinfo utility is with the kernel source under Documents/vm/slabinfo.c.) To obtain information about the most actively used objects, invoke the slabinfo utility (see Listing 8).

Listing 8

slabinfo

 

The block objects, size 192 and 512, are actively used by hackbench messages: One is for the socket buffer header and one is for the message body.

Basically, the SLUB implementation keeps a per-cpu cache for each slab type. When the kernel allocates an object, it checks the per-cpu cache first without locking. Such allocation is very fast and is called a fast path. If the per-cpu cache hasn't freed objects, the kernel allocates from shared pages with a lock, which is slow. A slow path means more lock contentions. The free procedure also has a fast path and a slow path. Because free uses a distributed lock (page lock) and the allocation process uses more exclusive locks, allocation by fast path is more important.

For these two objects, we noted that the free operation is quite slow; however, allocation is not fast, either. For example, for objects of size 512, only 68% of allocation is by fast path, and 7% of free is by fast path.

To reduce the slow path allocation, we could ask for a bigger sized slab to increase the per-cpu object cache. To increase the default max_order of 1 and min_objects of 32, we add slub_max_order=3 slub_min_objects=32 to the kernel boot command line. This increases the number of objects that must fit into one slab for an allocation to be successful, which will reduce the chance that the kernel allocates objects by slow path.

This step improved the throughput significantly, requiring just one tenth the time needed in the previous test. By extensive testing with different slub_min_objects settings, we found the correlation between slub_min_objects and the CPU number.

Mostly, we get the best result with slum_min_objects=cpu_number*2. If slum_min_objects is equal to a bigger value, the result doesn't provide much improvement.

At this point, we went back to the 8-core machine and did extensive testing to confirm our findings. After we discussed the problem with the SLUB maintainers, a patch that scales slub_min_objects, as a function of the number of CPU cores, was merged into the Linux kernel.

Conclusions

In this article, we provided a quick tour of some useful tools for diagnosing common performance issues. Of course, this brief introduction is not intended as a comprehensive description of the performance tuning craft, but it should provide you with a good starting point for discovering and fixing performance bottlenecks on your Linux systems.

Power Performance

Power consumption is another aspect of system performance. Most recent processors are equipped with processor performance states (P-states) and sleep states (C-states). If the system is not fully loaded, it is better to switch to a P-state that operates the processor at a lower frequency and voltage. If the processor is idle, the system should switch to a sleep state.

To take advantage of these features, make sure the BIOS Speed Step and C-state features are enabled. To take advantage of the P-state feature in the CPU, you need to make sure that a suitable CPU frequency governor is enabled for the system. To see what governors are available, use:

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
ondemand userspace performance

With the following command, you can determine the current governor:

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

The ondemand governor has the best power-saving characteristics and is typically recommended, whereas the performance governor will put the CPU at the maximum frequency and voltage. To switch to the ondemand governor, issue the following command:

# echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

To take advantage of the CPU C-states, you need to enable the tickless idle feature in the kernel. The Linux kernel has a periodic timer tick that wakes up the CPU. This tick prevents the CPU from going into the sleep state. With the recent addition of the tickless idle, the Linux kernel removed this timer tick, which allows the CPU to sleep for a longer time in power-saving mode. If you compile your own kernel, you should enable the option CONFIG_NO_HZ=y.

The PowerTOP utility [3] is a useful tool for checking P-state and C-state status in the system. PowerTOP will show the current P-state and C-state, report on which applications wake up the CPU, and provide additional power-saving hints tailored to your system.

Additional power-saving tips can be found at the Less Watts website [4].

The Author

Tim Chen is a staff engineer of the Open Source Technology Center at Intel Corporation. His current focus is mainly on Linux performance. Before working at Intel, he worked at Trillium Digital Systems on telecommunications systems and at Hughes Space and Communications on mobile satellite systems. He graduated from UCLA in 1995 with a Ph.D. degree in Electrical Engineering.

Alex Shi joined Intel's Open Source Technology as a software engineer in 2005. He works on Linux performance and power tuning.

Yanmin Zhang, from Open Source Technology Center of Intel Corporation, has worked on Linux projects for five years, including processor and chipset enabling, which cover Intel i386, x86-64, and Itanium architectures and PCI-Express. He is currently working on the Linux Kernel Performance project. Before joining Intel, Yanmin worked for Bell Labs Lucent Technology on network management system development.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel 2.6.25: 64 Bit Systems At Risk

    The changelog for kernel 2.6.25.11 includes just a single entry, however, it seems to be so important that the Kernel Stable Team urgently advises users to upgrade the kernel on 64 bit multiple user systems.

  • Timechart: Zoom in on Operating System

    Intel developer Arjan van de Ven is working on a new tool named Timechart that records Linux system performance in detailed graphics.

  • Linux 2.6.25 without Closed Source USB Drivers

    A controversial patch for the imminent kernel 2.6.25 is causing much debate in the developer community: in a similar move to one he made two years ago, the well-known kernel developer Greg Kroah-Hartman has submitted a patch that prevents closed source USB drivers from using the kernel's USB driver API.

  • Torvalds Releases Kernel 2.6.25: GPL Only Restriction Imposed

    Linus Torvalds has released the new 2.6.25 kernel just slightly behind schedule. Besides improvements to the CFS scheduler and a plethora of new drivers, the kernel also introduces a political aspect: it debars non-GPLd USB drivers.

comments powered by Disqus

Direct Download

Read full article as PDF:

030-036_tuning.pdf  (2.10 MB)

News