Zack's Kernel News

Zack's Kernel News

Article from Issue 266/2023

Chronicler Zack Brown reports on the little links that bring us closer within the Linux kernel community.

Best Laid Plans

When the Linux kernel encountered a runtime warning, Alexander Popov didn't like that it had only two possible responses: Ignore the whole thing, or panic (i.e., crash 'n' burn). The crash 'n' burn response would trigger if the user had set the panic_on_warn flag. Otherwise, the warning would be ignored. Alexander felt that a nice middle-of-the-road response would be for the kernel to simply stop whatever it was that caused the warning. This way at least the system could still function.

Alexander also pointed out some security problems with the current state of affairs. He said that to avoid the extreme response, "panic_on_warn is usually disabled on production systems." And, "From a security point of view, kernel warning messages provide a lot of useful information for attackers. Many GNU/Linux distributions allow unprivileged users to read the kernel log, so attackers use kernel warning info leak in vulnerability exploits."

Alexander proposed a compromise so that system administrators and distribution maintainers would not feel the need to completely disable all responses to kernel warnings. He said, "Let's introduce the pkill_on_warn boot parameter. If this parameter is set, the kernel kills all threads in a process that provoked a kernel warning. This behavior is reasonable from a safety point of view described above. It is also useful for kernel security hardening because the system kills an exploit process that hits a kernel warning."

Alexander posted a patch for consideration.

Kees Cook replied approvingly, "I like this idea. I can't tell if Linus would tolerate it, though. But I really have wanted a middle ground like BUG(). Having only WARN() and panic() is not very friendly."

Paul E. McKenney suggested that if Alexander's patch was accepted, it should blacklist a few specific kernel threads that should cause a kernel panic in response to a warning, even if the gentler boot parameter was selected.

Andrew Morton suggested that the whole feature didn't need to be a boot parameter but could be runtime configurable instead.

Meanwhile, Petr Mladek felt that the whole kit and caboodle was such an obscure kernel feature to begin with, that "I wonder who uses it in practice and what is the experience. The problem is that many developers do not know about this behavior. They use WARN() when they are lazy to write more useful message or when they want to see all the provided details: task, registry, backtrace."

He added, "It somehow reminds me the saga with %pK. We were not able to teach developers to use it correctly for years and ended with hashed pointers."

The %pK saga refers to an option to the printk() kernel function, used for outputting pointers that needed to be hidden from unprivileged users. The behavior of that particular format specifier was dependent on the configuration state that needed to be set, and the specific cases when %pK should or should not be used were dependent on who would be reading that output. You can see how developers might be confused.

But that's neither here nor there. Regarding Alexander's current proposal, Petr suggested that the whole panic-on-warning scenario was overkill, and he pointed out, "What about pr_err(), pr_crit(), pr_alert(), pr_emerg()? They inform about even more serious problems. Why a warning should cause panic/pkill while an alert message is just printed?"

To which Alexander replied, "That's a good question. [...] From a security point of view, a kernel warning output is interesting for attackers as an infoleak. The messages printed by pr_err(), pr_crit(), pr_alert(), pr_emerg() provide less information."

As for Petr's assertion that WARN() was used incorrectly by kernel developers, Steven Rostedt replied:

"WARN() Should never be used just because of laziness. If it is, then that's a bug. Let's not use that as an excuse to shoot down this proposal. WARN() should only be used to test assumptions where you do not believe something can happen. I use it all the time when the logic prevents some state, and have the WARN() enabled if that state is hit. Because to me, it shows something that shouldn't happen happened, and I need to fix the code.

"Basically, WARN should be used just like BUG. But Linus hates BUG, because in most cases, these bad areas shouldn't take down the entire kernel, but for some people, they WANT it to take down the system."

From this point, the discussion descended into implementation details, with everyone chiming in about exactly what the behavior should be under circumstances X, Y, and Z. At a certain point, Linus Torvalds offered his analysis:

"There are only two valid uses for panic-on-warn:

(a) test boxes (particularly VM's) that are literally running things like syzbot and want to report any kernel warnings

(b) the 'interchangeable production machinery' fail-fast kind of situation

"So in that (a) case, it's literally that you consider a warning to be a failure case, and just want to stop. Very useful as a way to get notified by syzbot that 'oh, that assert can actually trigger'.

"And the (b) case is more of a 'we have 150 million machines, we expect about a thousand of them to fail for any random reason any day _anyway_ – perhaps simply due to hardware failure, and we'd rather take a machine down quickly and then perhaps look at why only much later when we have some pattern to the failures'.

"You shouldn't expect panic-on-warn to ever be the case for any actual production machine that _matters_. If it is, that production maintainer only has themselves to blame if they set that flag.

"But yes, the expectation is that warnings are for 'this can't happen, but if it does, it's not necessarily fatal, I want to know about it so that I can think about it'.

"So it might be a case that you don't handle, but that isn't necessarily _wrong_ to not handle. You are ok returning an error like -ENOSYS for that case, for example, but at the same time you are 'If somebody uses this, we should perhaps react to it'.

"In many cases, a 'pr_warn()' is much better. But if you are unsure just _how_ the situation can happen, and want a call trace and information about what process did it, and it really is a 'this shouldn't ever happen' situation, a WARN_ON() or a WARN_ON_ONCE() is certainly not wrong.

"So think of WARN_ON() as basically an assert, but an assert with the intention to be able to continue so that the assert can actually be reported. BUG_ON() and friends easily result in a machine that is dead. That's unacceptable.

"And think of 'panic-on-warn' as people who can deal with their own problems. It's fundamentally not your issue. They took that choice, it's their problem, and the security arguments are pure BS – because WARN_ON() just shouldn't be something you can trigger anyway."

And regarding Alexander's patch, Linus said "Honestly, I don't see the point. [...] I'd like to hear of an actual _use_ case. That's different. That's somebody actually _using_ that pkill to good effect for some particular load. [...] That said, I don't much care in the end. But it sounds like a pointless option to just introduce yet another behavior to something that should never happen anyway."

The discussion continued briefly. But without support from Linus, the issue petered out after a short time.

This is a very common occurrence in kernel development and probably quite common throughout the open source world: Someone has an idea that seems to address something significant, and the only way to raise the issue is to actually code it up and send it out for review. But then having done all that work to enable others to join in the conversation, it turns out the true situation is slightly different, and the idea ends up being discarded.

Revision Control Theory

Not everyone remembers that Linus Torvalds wrote the Git revision control system like they remember he wrote Linux. The Git backstory is pretty amazing, but it's current story is ongoing. Recently Linus made some comments comparing Git to other revision control systems and specifically to Darcs – a powerful Git alternative by David Roundy.

The subject came up when Greg Kroah-Hartman submitted a massive pull request for a giant pile of kernel driver code coming from dozens of other developers. In submitting the pull request, Greg remarked, "Note, you will have a merge conflict in the drivers/net/wireless/silabs/wfx/sta.c file, please just take the change that came in from the wifi tree. We thought as I had pulled the same merge point from the wifi developers this type of conflict wouldn't have happened, but for some reason git flags it as something to pay attention to and couldn't resolve it itself."

Linus replied:

"That 'some reason' is because the networking tree made other changes to the file since (ie commit 2c33360bce6a: 'wfx: use container_of() to get vif').

"So both branches had done the same change (the merge), but one branch had then done other changes on top of that same change.

"Broken SCM thinking then thinks that means that 'oh, then we obviously have to take the extra change' (eg darcs 'patch algebra'), and make that the basis of their resolution strategy. It's not actually a valid model, because it just assumes that the additional patches were right. Maybe there was a _reason_ that extra patch wasn't done in the other branch? The extra patch might have been due to particular issues in that branch, you can't just make the darcs assumption of reordering patches and taking some union of them (which is an over-simplification of the patch algebra rules).

"Now, that's not to say that git can't get things wrong too when resolving things. But at least it doesn't make some fundamental mistake like that.

"The git rules are basically that it will resolve changes that aren't overlapping, using the traditional 3-way model (it then has that whole 'recursion and rename detection' thing, but that's more of a higher-level metadata thing separate from the actual code merge).

"So git doesn't assume any 'semantics' to the changes. If it sees that two branches changed the same code in different ways, git will go 'this is a conflict', and leave it to human (or scripted) intervention.

"Again, it's not that the git model is always right – you can obviously have changes that do *not* overlap at all, but still have a very fundamental semantic conflict, and git will happily merge those things and think it is all good.

"So the git model is basically practical and straightforward (also 'stupid', but in a good way – do the common truly obvious 3-way merges, don't try to do anything clever when that fails). There's no 'theory' behind it that might turn out to be completely wrong."

To which Greg replied, "That makes more sense now, git is being 'safe' by asking for the developer to look and resolve it themselves. Thanks for the explanation."

And that was the end of the conversation.

Concurrent Directory Updates

The Virtual Filesystem (VFS) is the central, core, generic filesystem code within the kernel, around which all other filesystems revolve. All filesystems interact with the VFS to provide their special data-storing features, and all applications interact with the VFS to access all that stored data.

A major goal of the VFS is to provide as fast as possible communication between applications and the data on the underlying filesystems. Recently, Neil Brown asked folks to discuss an issue he had noticed in the VFS:

"VFS currently holds an exclusive lock on a directory during create, unlink, rename. This imposes serialisation on all filesystems though some may not benefit from it, and some may be able to provide finer grained locking internally, thus reducing contention.

"This series allows the filesystem to request that the inode lock be shared rather than exclusive. In that case an exclusive lock will be held on the dentry instead, much as is done for parallel lookup.

"The NFS filesystem can easily support concurrent updates (server does any needed serialisation) so it is converted.

"This series also converts nfsd to use the new interfaces so concurrent incoming NFS requests in the one directory can be handled concurrently.

"As a net result, if an NFS mounted filesystem is reexported over NFS, then multiple clients can create files in a single directory and all synchronisation will be handled on the final server. This helps hide latency on link from client to server."

Daire Byrne replied with exuberant enthusiasm, thanking Neil for his work on this. Daire said, "I'm probably the main beneficiary of this (NFSD) effort atm so I feel extra special and lucky!" He ran some tests and reported up to a 40-fold increase in the number of file creations per second he could achieve, from 2.4 per second, to up to 121 per second. He posted some additional numbers. There were some problems remaining, but Daire concluded, "all in all, the performance improvements in the knfsd re-export case is looking great and we have real world use cases that this helps with." In addition, he offered to do more testing for the remaining difficult cases.

Neil was very happy for the feedback and posted some updated patches to address some of the problems Daire's tests had uncovered.

Anna Schumaker and Daire both started testing the new patches on their systems, and both reported some further timing data. Neil was thrilled to have more testers and posted additional patches addressing the issues they had uncovered.

At one point, Daire said, "This patch does the job for me – no more stack traces and things have been stable all day. I'm going to run some production loads over the weekend and then I'll do some more artificial scale testing next week. Thanks again for this work! Improving the parallelism anywhere we can for single clients and then nfsd is great for reexport servers (especially once you add some 'cloud' latency)."

There was no further discussion, just a few more test results. This does not necessarily mean the code will go into the kernel – it still has to pass through Al Viro's gauntlet because he maintains the VFS code and will want to be sure there are no new problems introduced by the patches. In any event, Neil isn't actually submitting the code yet; he's just requesting comments. However, it does seem very likely that this code or something like it will go into the VFS at some point in the near future, just because there are filesystems like NFS that do seem to show a significant benefit from it.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    In kernel news: Vulnerabilities using a 32-Bit Kernel on a 64-Bit CPU; Working Around Hardware Security Vulnerabilities; and When It's OK to Panic.

  • Kernel News

    In Kernel News: Removing the Thorn and What Do You Do When There's Nothing to Do?

  • Kernel Tips

    Worried about a recent security exploit? Want to take advantage of a new hardware feature? You don’t need to be a Linux expert to patch and compile the Linux kernel. We'll show you how to get started.

  • Kernel News

    This month in Kernel News: Chasing the Dream; The Power of the FUSE Side; NTFS3 Maintainership Issues: and Crashing and Warning.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More