Zack's Kernel News

Zack's Kernel News

Article from Issue 183/2016
Author(s):

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Incremental Updates to the OOM Killer

Michal Hocko gathered up a few ideas from Mel Gorman and Oleg Nesterov about how to improve the OOM (out-of-memory) killer. This is a bizarre little corner of the kernel whose job is to decide which processes to kill when RAM has become so overused that it threatens to lock up the whole system. The idea is that if it chooses correctly, the OOM killer can restore the system to usability. Of course, if it chooses incorrectly, nobody's happy.

As Michal put it, "The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path."

However, a variety of folks, such as Tetsuo Handa, had shown that under some workloads the oom victim could get trapped in the D state and never exit, thus holding onto the extra memory reserves indefinitely.

Michal posted a patch to create a new kernel thread called oom_reaper that would reclaim that memory if possible. The problem with this, Michal noted, was that if the oom victim was in the process of core dumping and didn't have enough memory to complete the job, it would terminate without producing a proper debuggable dump. He justified this however by saying that "the overall system health is more important than debugability of a particular application."

Typically, the kernel developers would try to avoid creating a whole new kernel thread because the system can only tolerate so many threads at a time, and the more kernel threads there are, the fewer user threads there can be. Michal felt that it was necessary in this case because an out-of-memory condition naturally makes normal execution paths less reliable – that's the whole point of having an OOM killer. An independent kernel thread would have the best chance of actually getting to run.

Johannes Weiner felt that Michal put too much emphasis on making the most common cases run smoothly. He agreed that this would be necessary, but he felt that the big problem was how to resolve memory deadlocks. There had to be some way for the OOM killer to switch from trying to kill one process to trying to kill another or perhaps even allrunning processes in order to keep the kernel itself up. Before solving that problem, any other consideration seemed cosmetic to him.

However Michal said he wasn't trying to solve that issue; he just wanted to make one troubling class of deadlocks and corner cases go away. He said there was never going to be a truly correct OOM killer implementation because it was always going to be a mishmash of heuristics and best guesses. Instead of trying to solve the problem in the code, Michal thought it would be better to allow the system administrator to choose their own policy via configuration files. That way, some amount of customization could be based on the use case.

Tetsuo agreed with Johannes, though, saying that "the last resort solution has higher priority than smoothening the common case." He added, "Please do consider offering the last resort solution first. That will help reducing unexplained hangup/reboot troubles."

But, Michal really wanted to stay on topic. He said that he didn't have a last resort solution to offer and that all the last resort proposals he'd seen had their own difficult problems to solve. He was offering an incremental improvement to the current OOM killer, and, he said, "I really do not want to make this thread yet another mess of unrelated topics."

There followed a technical discussion of Michal's patch with Tetsuo and others, with only minor deviations into larger problems that he didn't want to address. He reined those in, and at one point, Mel Gorman gave his thumbs up for the patch, offering a couple of minor suggestions. The technical discussion continued for a bit, with Michal posting fixes and updates in response to comments.

The OOM killer is really one of the thorny roses of the Linux kernel. It's a lovely, beautiful thing to have until you actually try to touch it, and then you notice the blood and stinging pain. Michal's effort to clamp down on any broader discussion of overall OOM killer behavior shows how difficult it can be to make even minor changes to that part of the kernel. Everyone wants the big magical fix: the perfect rose that has no thorns.

Locks Are Hard; Watch Out!

Linus Torvalds had some advice for anyone writing kernel code: Be careful with locks! Some folks were doing some work with kthreads, and they needed to use some locking, so they posted this as part of their patch:

while (!trylock(worker)) {
    if (work->canceling)
        return;
        cpu_relax();
    }
queue;
unlock(worker);

Linus immediately said:

People, you need to learn that code like the above is *not* acceptable. It's busy-looping on a spinlock, and constantly trying to *write* to the spinlock.

It will literally crater performance on a multi-socket SMP system if it ever triggers. We're talking 10x slowdowns, and absolutely unacceptable cache coherency traffic.

These kinds of loops absolutely *have* to have the read-only part. The 'cpu_relax()' above needs to be a loop that just tests the lock state by *reading* it, so the cpu_relax() needs to be replaced with something like

while (spin_is_locked(lock)) cpu_relax();

instead (possibly just "spin_unlock_wait()" – but the explicit loop might be worth it if you then want to check the "canceling" flag independently of the lock state too).

In general, it's very dangerous to try to cook up your own locking rules. People *always* get it wrong.

Linus went on to say:

… people need to realize that locking is harder than they think, and not cook up their own lock primitives using things like trylock without really thinking about it a *lot*.

Basically, 'trylock()' on its own should never be used in a loop. The main use for trylock should be one of:

  • [some]thing that you can just not do at all if you can't get the lock.
  • avoiding ABBA deadlocks: if you have an A->B locking order, but you already hold B, instead of 'drop B, then take A and B in the right order', you may decide to first 'trylock(A)' – and if that fails you then fall back on the 'drop and relock in the right order'.

But if what you want to create is a 'get lock using trylock', you need to be very aware of the cache coherency traffic issue at least.

It is possible that we should think about trying to introduce a new primitive for that 'loop_try_lock()' thing. But it's probably not common enough to be worth it – we've had this issue before, but I think it's a 'once every couple of years' kind of thing rather than anything that we need to worry about.

The 'locking is hard' issue is very real, though. We've traditionally had a *lot* of code that tried to do its own locking, and not getting the memory ordering right etc. Things that happen to work on x86 but don't on other architectures etc.

POSIX Compliance

Linus Torvalds and Al Viro had an interesting exchange recently about POSIX and practicality. Al had noticed that the behavior of the readlink() system call seemed wrong in certain places, and he wanted to fix them.

The readlink() system call returns the name of the thing pointed to by a symbolic link. One problem Al noticed was that calls to readlink() would update the atime value the time of the most recent access even in the event of an input/output error or an out-of-memory error. Typically, depending on the kind of access, Linux might update the atime of an underlying filesystem object (the file, directory, or whatnot), or the symlink itself. But, regardless, according to POSIX, readlink() should not update the atime after either of those errors.

Linus replied, "I really don't think anybody cares, but I also don't think anybody cares about the current behavior, so we can certainly fix it to match POSIX wording."

Another problem Al noticed was that readlink() was supposed to fail when used on anything other than a symlink. And, according to POSIX, it should fail with an EINVAL error (invalid argument). But, Al found a weird and unlikely case where readlink() could succeed when run on a directory. The Andrews File System (AFS) is a distributed filesystem that automounts directories as the user attempts to go into them. Those directories look like directories to the stat() system call. But, instead of using the open() system call, AFS uses readlink() to find out what it should open.

Al thought that the whole AFS approach was a kludge that should be fixed, but Linus came back with his position on POSIX compliance in general:

I don't think POSIX is necessarily relevant here.

We have had magic file behavior outside the scope of POSIX before, and we will have it in the future. It makes perfect sense to use readlink() for management tools for automounting, even if the normal operation is to treat the thing as a directory.

Not everything is within the domain of POSIX.

Al pointed out that since the AFS directories could only be opened via readlink(), while appearing to stat() to be a directory, there was no way for user code to know that it should open those directories with readlink() instead of the usual open().

Linus replied that only the AFS management tools needed to know about it, and they already did because it was their own filesystem. Linus added:

Not everything has to be "generic". Sometimes it's good enough to just have the ability to get the work done. Now, if it turns out that others also want to do this, maybe somebody decides "let's add flag -V to 'ls', which forces a 'readlink()' on all the targets, whether links or not, and shows the information".

I could imagine other special files having "a single line of information about the file" that they'd expose with readlink(). Who knows?

So there is *potential* for just making it generic, but that doesn't mean that it necessarily has to act that way.

Linus went on to say:

… it's not necessarily just readlink() either. I still think it might be a perfectly fine idea to allow non-directories to act as directories in some case (by exposing "readdir" and "lookup").

But readdir() really doesn't sound horrible either. How about unix domain sockets (or named pipes) giving their link information when you do readdir() on them?

Quite frankly, I think allowing those kinds of unified interfaces is better than the current situation where you have to use a "getpeername()" system call etc. If it's a filesystem object, why not allow filesystem operations to work on it?

We expose some things in /proc as symlinks, things that actually would work better as non-symlinks, exactly *because* we want to expose not just the end result of what they point to, but also a *description* of what they point to. So we have those odd "pseudo-symlinks" in /proc that don't actually really do a pathname walk on the symlink content they expose, but still *look* like symlinks just because readdir() is such a useful thing to have.

Al cautioned against exposing lookup() for non-directories, which he said would cause a major nightmare with locking code. He added, "The situation is convoluted enough as it is; playing with parallel lookups is going to be interesting in itself and I'd rather not mix it with attempts to accommodate for hybrid objects"

At that point, the discussion veered back to a technical consideration of AFS behaviors, but it's interesting to see the two of them go at it. Al basically knows everything, while Linus has a clearer grasp of big-picture issues. Having said that, though, Al has made it clear in the past that if certain big-picture issues go the wrong way, he'd fork off his own "VirAl" kernel and take it in his own direction. He probably has the clout to bring a lot of developers with him, but so far it's never come to that.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Improving the Android low memory killer; randomizing the Kernel stack; and best practices.

  • Kernel News

    In kernel news: Heap Hardening Against Hostile Spraying; and Core Contention Improvements … or Not.

  • Kernel News

    Zack Brown reports on container-aware cgroups, a different type of RAM chip on a single system, new SARA security framework, and improving GPIO interrupt handling.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on string handling routines and speeding up database workloads.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News