Zack's Kernel News

Zack's Kernel News

Article from Issue 179/2015

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Background Memory Compaction

Vlastimil Babka remarked on the fact that memory compaction only occurs at certain times, such as the kernel swap daemon, kswapd, balancing a zone of memory after failing to find a large enough region of free memory to satisfy a user request. Compaction solves this problem by grouping allocated memory together and leaving empty space grouped together as well.

Vlastimil acknowledged that there were other times when Linux might compact memory, but he wasn't satisfied. If the system waited until an allocation request came in before compacting the memory it needed to satisfy that request, this would create latency problems for larger allocation requests. Vlastimil proposed that, "To improve the situation, we need an equivalent of kswapd, but for compaction. E.g. a background thread which responds to fragmentation and the need for high-order allocations (including hugepages) somewhat proactively."

He considered and discarded the various existing threads that could be used. He felt that extending kswapd to include memory compaction would complicate its design too much, and extending khugepaged would have the drawback of being tied to THP (Transparent HugePages) configurations. Transparent HugePages is an abstraction layer that sits on top of normal memory allocation and deals in very large blocks of memory. Extending khugepaged would fail to compact memory that was allocated in any other way.

Vlastimil's approach was to create a new daemon, kcompactd, which would run as a separate thread on each CPU and perform compaction at 15-second intervals. If system memory starts to get tight, the memory shortage would inevitably slow down any attempt at memory compaction. In that case, Vlastimil said, kcompactd would detect the situation and avoid doing memory compaction.

David Rientjes replied, saying he liked the code and the idea, but he worried that other kernel folks might object to adding another per-node thread. He suggested returning to the idea of piggybacking compaction on the existing khugepaged thread. He pointed out that, in practice, there might not be anyone actually doing memory allocation without using THPs that would have an effect on fragmentation, in which case, no harm, no foul.

David identified what he felt was the likeliest candidate – the SLUB allocator, an allocation system for kernel objects needing high efficiency. If SLUB didn't create a need for on-demand memory compaction, David said, nothing would. And he hadn't seen any reports of that need, so he said, "I'm inclined to think that the current trouble spot for memory compaction is THP allocations."

Vlastimil agreed that he hadn't seen any reports of such problems in a SLUB context. But he said that even so, while THP represented the most significant need for compaction, "I wouldn't discount the others."

David suggested a two-pronged approach. Put the compaction code into khugepaged to deal with the biggest case – THP compaction – and then also to "schedule a workqueue in the page allocator when MIGRATE_ASYNC compaction fails for a high-order allocation on a node and to have that local compaction done in the background." He felt this solution would be relatively trivial to implement.

But Vlastimil objected, saying, "I think pushing compaction in a workqueue would meet a bigger resistance than new kthreads. It could be too heavyweight for this mechanism and what if there's suddenly lots of allocations in parallel failing and scheduling the work items?"

And if separate per-CPU threads were needed in this case, Vlastimil went on, it made sense to include THP compaction in those threads as well.

The two continued their debate, delving into the more technical details of which aspects of memory allocation and defragmentation happened at which times, trying to identify the best place for a generalized compaction system. But meanwhile, Joonsoo Kim offered some input on SLUB allocator compaction issues: "In fact, some of our product had trouble with SLUB's high-order allocation 5 months ago. At that time, compaction didn't help with high-order pages, and compaction attempts were frequently deferred. It also caused many reclaims to make a high order page, so I suggested masking out __GFP_WAIT and adding __GFP_NO_KSWAPD when trying SLUB's high-order allocation to reduce reclaim/compaction overhead. Although using high-order page in SLUB has some gains, such as reducing internal fragmentation and reducing management overhead, the benefit is marginal compared to the cost of making a high-order page. This solution improves system response time in our case."

David replied:

On a fragmented machine, I can certainly understand that the overhead involved in allocating the high-order page outweighs the benefit later, and it's better to fallback more quickly to page orders if the cache allows it.

I believe that this would be improved by the suggestion of doing background synchronous compaction. So regardless of whether __GFP_WAIT is set, if the allocation fails, then we can kick off background compaction that will hopefully defragment memory for future callers. That should make high-order atomic allocations more successful as well.

Joonsoo also added, "In the embedded world, there is another candidate, the ION allocator. When launching a new app, it tries to allocate high-order pages for graphic memory and fallback to low order pages as following sequence (8, 4, 0). Success affects system performance. It looks like case similar to THP. I guess it can be also benefit from periodic compaction."

The discussion continued, and at one point, David clarified a distinction he wanted to make between different approaches to doing compaction that he felt should both be incorporated into any solution. First, he said there was "periodic compaction that would be done at certain intervals regardless of fragmentation or allocation failures to keep fragmentation low." Second, he said there was "background compaction that would be done when a zone reaches a certain fragmentation index for high orders, similar to extfrag_threshold, or an allocation failure."

Vlastimil asked, regarding background compaction, "Is there a smart way to check the fragmentation index without doing it just periodically, and without polluting the allocator fast paths?" David averred, "We certainly don't want to add fastpath overhead for this in the page allocator nor in any compound page constructor." He added, "The downside to doing it only in the slowpath, of course, is that the allocation has to initially fail. I don't see that as being problematic, though, because there's a good chance that the initial MIGRATE_ASYNC direct compaction will be successful: I think we can easily check the fragmentation index here and then kick off background compaction if needed."

The discussion continued. Eventually Mel Gorman came into it, saying he approved of the idea of some kind of background compaction process in general. However, he offered some technical suggestions about the ongoing debate between Vlastimil and David, and he also remarked, "There will be different opinions on periodic compaction, but to be honest, periodic compaction also could be implemented from userspace using the compact_node sysfs files. The risk with periodic compaction is that it can cause stalls in applications that do not care if they fault the pages being migrated. This may happen even though there are zero requirements for high-order pages from anybody."

To which David replied:

When THP is enabled, I think there is always a non-zero requirement for high-order pages. That's why we've shown an increase of 1.4% in cpu utilization over all our machines by doing periodic memory compaction. It's essential when THP is enabled and no amount of background compaction kicked off with a trigger similar to kswapd (which I have agreed with in this thread) is going to assist when a very large process is exec'd.

That's why my proposal was for background compaction through kcompactd kicked off in the allocator slowpath and for periodic compaction on, at the minimum, THP configurations to keep fragmentation low. Dave Chinner seems to also have a usecase-absent THP for high-order page cache allocation.

I think it would depend on how aggressive you are proposing background compaction to be, whether it will ever be MIGRATE_SYNC over all memory, or whether it will only terminate when a fragmentation index meets a threshold.

The discussion ended there. But it seems clear that everyone agrees that some form of background memory compaction should be implemented. Whether the kernel needs a new set of per-CPU threads, whether it should distribute the work among existing threads as much as possible, or whether some other solution should be found isn't known. Meanwhile, whatever code the next set of patches touches, new input will undoubtedly come from the folks deeply involved in that code. So, the final design of Vlastimil's patches might still be some ways away.

Enabling Per-Subsystem Tracepoints

Tal Shorer noticed that enabling the CONFIG_TRACING option in the kernel would compile tracepoints for all subsystems, even if the user only wanted to test one particular subsystem. This caused him enough of a performance penalty that he felt he could never enable CONFIG_TRACING on his system.

He posted a patch to allow users to enable tracepoints on a subsystem-by-subsystem basis, rather than implement this for all subsystems; however, he selected gpio as a sample, saying that if it passed muster, he could create equivalent patches for all other subsystems.

Unfortunately he picked an odd time in the development cycle to post these patches. He posted just before the merge window opened, and Steven Rostedt had to put new code on hold during the merge window. So a couple of weeks passed between the time Tal posted his patches and the time Steven could give them a look-see. Note to all: Don't post new code when a merge window is about to open.

Finally, Steven's biggest suggestion was just to add a comment to part of the code and create a macro to avoid one place that duplicated some code. Tal posted a new version of the code, and that was that.

This is one of those types of features that seems to sail into the kernel without anyone taking notice. No one objects, and there are no controversial implementation details or security concerns. Someone writes a patch, everyone essentially likes it, and in it goes. I think this has to do with the fact that Tal is working on higher-level code that mostly adjusts configuration options, as opposed to delving into deeper kernel features, fast paths, and other crucial areas.

Exposing Internal CPU Data to Users Doing System Monitoring

Prarit Bhargava wanted to allow non-root users to have read-access to the /dev/cpu/X/msr files. He said, "No damage can be done by reading the MSR values, and it allows non-root users to run system monitoring software."

MSRs (model-specific registers) are a class of registers in the X86 processor series that Intel explicitly does not guarantee to behave the same from one chip version to the next. Instead, Intel provides the rdmsr and wrmsr assembly language instructions, which allow the operating system to read and write MSRs and identify the set of specific MSR features available on a given chip. The OS can then make its own choices about which features to use and which not.

In practice, Intel has decided that various MSRs are valuable enough to keep in future processors, and so the term MSR is becoming more of a misnomer over time. Typically, these registers implement features useful for debugging, tracing, monitoring, and other meta-operations on a running system.

Prarit's rationale for letting regular users read MSRs was that regular users sometimes had legitimate reasons to write system monitoring software. For example, he said, anyone trying to do load balancing would benefit from knowing how busy a given CPU was at a given time. But, he said, "the only way to get this data is to run as root, or use setcap to allow userspace access for particular programs. Both of these options are clunky at best."

He posted a patch to open up read access to regular users, but H. Peter Anvin felt Prarit was too quick to dismiss the security concerns. Peter felt that damage could indeed be done by a malicious user who gained read access to MSR registers, and Brian Gerst said, "Some MSRs contain addresses of kernel data structures, which can be used in security exploits." He added, "The proper way to do this is to write a driver to only expose the MSRs that the user tools need, and nothing else."

This made sense to Prarit, and he agreed to code up another version of his patch.

At the same time, Andy Lutomirski got interested in the whole problem and started pursuing what he felt was an alternative approach to Prarit's. Ingo Molnár and Andy had speculated together that writing a specific PMU (Performance Monitoring Unit) driver would be the way to go. This driver could access the MSRs itself, providing only the information needed for system monitoring. There was no need, the two agreed, to expose the whole MSR interface to arbitrary potential exploits.

Andy posted his own set of patches, to which Prarit said, "I just sat down to do something similar to what Andy had proposed here :)."

Prarit agreed with this approach and liked Andy's code, and said he'd start working on a way to expand on what Andy had accomplished already.

In general, with input from various folks, Prarit concluded that whether they implemented a PMU driver or a more general MSR reading driver, the driver itself had to implement a whitelist of data items that could be accessed by the user. Specifically, if a new bit of MSR data appears in a future chip, the driver must default to keep that data private from the user – the reason being that there was no way to know what that new bit of data might be, and so it could expose a security vulnerability.

Prarit felt that there were really three options at this point – either implement Andy and Ingo's PMU driver or expose a whitelisted set of data at /dev/cpu/X/msr, or expose a whitelisted set of data somewhere in SysFS, with each piece of exposed data represented by its own virtual file.

Meanwhile, Len Brown pointed out that Lawrence Livermore National Laboratory was keenly interested in the outcome of this discussion. Len said that, for security, they agreed with the importance of whitelisting data; but he also said, "For performance, they absolutely can not afford a system call for every single MSR access. Here an ioctl to have the msr driver perform a vector of accesses in a single system call seems the way to go."

Andy was surprised at this, saying that "On a sane kernel, a syscall is about 120 cycles." But he didn't squawk too loudly against an ioctl. Peter, however, did squawk. He remarked, "Every time I have heard about people having issues with performance for MSR access, it is because they are doing cross-CPU accesses, which means a neverending stream of IPIs. You get immensely better performance by tying a thread to a CPU and only accessing the local CPU from that thread. This has addressed any performance problems anyone has ever come to me with. As Andy and Ingo have already pointed out, the MSR access itself is pretty much as expensive as the system call overhead."

The discussion petered out with no clear conclusion, but it does at least seem that users will gain some form of limited, read-only access to MSRs, paving the way for new types of userland system monitoring software.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on string handling routines and speeding up database workloads.

  • Kernel News

    In kernel news: Heap Hardening Against Hostile Spraying; and Core Contention Improvements … or Not.

  • VIA Puts Nano CPU in Compact Server

    VIA Technologies out of Taiwan is building around its Nano processor the M'SERV S2100 minicomputer for system integrators and OEMs with three USB ports designed for the small office/home office (SOHO) and small to medium enterprise (SME) market.

  • Performance Tuning Toolbox

    Tune up your systems and search out bottlenecks with these handy performance tools.

  • Virtual Memory

    Virtual memory makes your system safer and more efficient. But what is it really? We take a look inside this powerful feature that is built into Linux.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More