Zack's Kernel News

Zack's Kernel News

Article from Issue 187/2016
Author(s):

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Speeding Up Futexes…Maybe

Thomas Gleixner pointed out that the kernel's 'futex' locks used a global hash value to keep track of state changes, but that the hash values were not guaranteed to be unique. The kernel could handle these collisions, but it would slow things down. He posted a patch to provide a reference counting mechanism to allow threads to guarantee no hash collisions. He said, "this creates a futex private state that avoids hash collisions and on NUMA systems also cross-node memory accesses."

To make use of the mechanism, he said, "each thread has to attach to the futex before any other operations on that futex." He explained the usage of the sys_futex() call to increment and decrement the futex reference count.

Thomas added that, although the user interface could be simplified by having threads automatically attach to the futex, he didn't want to implement it that way. Apart from increasing memory usage, the futex attachment code would have its own performance issues if all threads on the system made use of it. As long as it was used only by threads that actually needed to avoid those hash collisions – in other words, for real-time software – the code would provide a measurable performance increase.

As an alternative to his approach, Thomas said that it would be possible to increase the size of the global hash – but he didn't want to do that because "that just reduces the probability and does not exclude the chance to hit a bad spot."

To illustrate the importance of finding a fix that would work for real-time software, he described a hash collision detector that he and others had written to measure the extent of the problem. As he described it, "In case of a collision, the time jumps massively due to the atomic_inc/dec and the spinlock operation on the hash bucket and the resulting cache line bouncing. Once it detects a collision it stays there for 60 seconds and then switches to attached mode. This immediately brings back the performance on the involved scanners. A collision with two threads and therefore two futexes on different nodes results in a performance degradation of 30 and more percent in this test."

Thomas concluded, "On real-time enabled systems, even a one-time collision can have fatal consequences due to possibly unbound priority inversions."

Various folks had comments. Some pointed out a few bugs in the code, and there was a consideration of alternatives. At one point, Rasmus Villemoes said that even with Thomas's code, it was still possible to have a hash collision. And, he went on, "since different threads can attach to different sets of futexes, one thread may successfully attach to a futex, while another fails – the second thread is then permanently prevented from operating on that futex."

Thomas agreed that this was a problem, but he said, "There is not much we can do about that except adding it to the documentation."

Elsewhere, Ingo Molnár thought that Thomas might have been too quick to dismiss the possibility of having threads automatically attach to futexes, to hide that requirement from the user. Ingo pointed out that the memory allocation and other issues were mostly setup costs and were not associated with each and every usage. As he put it, "allocation/deallocation costs are a second order concern IMHO, because most of the futex's usage is the lock/unlock operations."

Ingo predicted that "large systems will want to have collision-free futexes most of the time, and they don't want to modify every futex using application or library. So this is a mostly kernel-side system sizing question/decision, not really a user-side system purpose policy question."

And, Ingo concluded, "an ABI distinction and offloading the decision to every single application that wants to use it and hardcode it into actual application source code via an ABI is pretty much the _WORST_ way to go about it."

Given the need to avoid adding permanent and unchangeable application-binary-interface elements to the kernel, Ingo suggested, "don't add any ABI details, but make futexes auto-attached on NUMA systems (and obviously PREEMPT_RT systems), i.e. make it a build-time or boot-time decision at most; don't start a messy 'should we used attached futexes or not' decisions on the ABI side, which we know from Linux ABI history won't be answered and utilized very well by applications!"

Linus Torvalds agreed, saying:

"Do *not* make this a visible new ABI.

You will find that people will make exactly the wrong choices – either not using it (because the futex is deep in a standard library!) when they want to, or using it when they shouldn't (because the futex is deep in a standard library, and the library writer knows *his* code is so important that it should get a special faster futex).

So I absolutely detest this approach. It's the wrong way to go about things. User space does *not* know whether they want to use this or not, and they *will* be wrong.

So automatically using a local hashtable (for private mutexes – I think people need to just accept that a shared mutex is more costly) according to some heuristic is definitely the way to go. And yes, the heuristic may be well be – at least to start – 'this is a preempt-RT system' (for people who clearly care about having predictable latencies) or 'this is actually a multi-node NUMA system, and I have heaps of memory'.

Then, add a tunable (for root, not per-futex) to allow people to tweak it.

Because the *last* thing you want is programmers saying 'I'm so important that I want the special futex'. Because every single programmer thinks they are special and that _their_ code is special. I know – because I'm special."

Torvald Riegel responded, saying, "From a glibc perspective, I agree that this shouldn't require an extension of the ABI unless it's really the only possible way to solve this. For "special" mutex kinds such as PI mutexes, the change in the interface might be justifiable – but for ordinary mutexes, there's no good place to add the attach/detach calls in each thread: An implementation of, say, C11 mutexes cannot easily estimate whether it should use attached futexes, and it would have to track whether a particular mutex has been attached to by the current thread; this might just move the overhead of tracking and caching associations from the kernel to userspace."

Carlos O'Donell also agreed that changing the ABI would not be good. He said:

"We had similar requests in glibc to add APIs to tweak the parameters of the elision for locks backed by hardware transactional memory.

The person submitting the patches always thinks this is a great API because it allows them to write tests to verify their own work (which means it still might be useful for internal testing or developing auto-tuning).

Users have no clue what to do with the API and, worse, the state space of the parameters is immense. You can't possibly do any kind of sensible optimization without knowing a lot about the hardware.

So no public API was ever added in glibc for pthread_mutex_lock elision parameters. Either the parameters work by default or you have to post patches to change the auto-tuning used internally in glibc."

That was the end of the discussion. It seems as though any attempt to modify the kernel's ABI is going to need a very strong justification – one that hasn't been met by this futex optimization code. The optimization does give a significant speed improvement, but it's likely to go into the kernel as something behind-the-scenes that applies to all threads, or not at all.

Expanding Cgroups to Include Workqueues

Bandan Das posted some patches to implement cgroup-aware workqueues. Cgroups are used to partition off parts of a running system and make them appear to be an independent system. A lot of cloud service offerings use cgroups to offer remote access to virtual servers, without exposing the real servers underneath to potentially malicious hacking. Using cgroups, you can have many virtual systems running apparently independently on a single piece of hardware.

The problem with cgroups is that it's incredibly complex to isolate the various portions of a given hardware system and make them appear to be completely independent. The effort to implement cgroups under Linux is very much a gradual growing-out of features. Over time, as cgroups become more featureful, the virtual servers that use them appear to be more and more like fully independent running systems.

Workqueues are a kernel feature that allow code to defer certain low-priority actions to a time when the system load can better handle them. For example, if you know you'll need to allocate memory, but you also know you won't need that memory right now, you could send the allocation to a workqueue, to be done when the system has time to do it.

One of the features of Bandan's patch was to associate the items in a cgroup-aware workqueue with worker threads in the given cgroup. This would allow the kernel to make sure that resources allocated to a given cgroup couldn't be overrun by workqueues in that cgroup.

As with any attempt to implement something for cgroups, the technical details tend to become insane. Security issues that would not crop up when implementing a feature for the regular kernel become crucial when implementing it for cgroups. As a result, many kernel features go through an agonizing process of partial implementation for cgroups while the security issues are resolved.

But, in the case of workqueues, the technical issues seemed less security-centric and more focused on unexplained speed and resource issues. For example, in his initial post, Bandan reported some performance issues with his patch, and Tejun Heo replied, "Where is performance regression coming from? Why is there *any* performance penalty?"

Bandan replied, "I am still investigating this but creating more worker threads could be one. Since all work gets queued to the default pwq in this implementation, we do end up creating workers in the middle of a run."

Tejun offered to help track down the slowdown, and Michael Rapoport said that "we better understand what causes regression with your current patches and maybe then we'll be smarter to get to the right direction." Bandan replied, "Agreed, let's try to understand the cause of the "underperformance" with wqs. I disabled WQ_CGROUPS; that effectively disables my changes and I can still consistently reproduce the lower numbers."

So, it's possible the slowdown isn't even associated with Bandan's workqueue code. Regardless, it's clear that cgroup-aware workqueues is still just beginning to come together, and there will undoubtedly be a whole slew of security-related objections once the patches are submitted for wider testing.

Cgroups are insane, but they seem to be the best way to do virtualization. Instead of running inside an emulator that slows down everything on the virtual server, all software on the virtual system runs natively and simply believes itself to be on a completely different machine.

Cleaning Up Media Device Registration

Shuah Khan noticed that a media device could sometimes hang the system if the user tried to release the device while a media ioctl was in progress. She proposed some basic ways to stop that from happening. When a media device belonged to more than one driver, for example, she wanted the kernel to maintain a reference count so that any driver unregistering the device would not inadvertently free the device while other drivers still needed it. Likewise, she said, if a media device is still in use when an application unregisters it, the device should not be released until after the application exits. Shuah posted some patches to accomplish these goals and a few other related features.

Takashi Iwai had a procedural objection, saying that Shuah should focus first on stabilizing the API, so that drivers could be successfully converted. At that point, he said, "we create a solid git branch that may be used for multiple subsystems, and I'll merge usb-audio stuff through the sound git tree." This made sense to Shuah. She explained that her implementation was mostly a useful test-case, and that the Git tree would be the right way to go, once the patches needed wider testing.

Mauro Carvalho Chehab also liked this plan, saying, "After we have this properly fixed and stabilized on media, I'll pass you a stable topic branch. This way, you can test a new version of the sound/usb patch and apply it on your tree when it fits well for you."

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • sysdig

    Many Linux diagnostic tools require knowledge of a special syntax, which complicates handling and confuses the output. Sysdig groups several important tools into a single interface.

  • Kernel News

    Zack Brown reports on container-aware cgroups, a different type of RAM chip on a single system, new SARA security framework, and improving GPIO interrupt handling.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News