Zack's Kernel News

Zack's Kernel News

Article from Issue 243/2021

This month in Kernel News: Debugging Production Systems; Patch Submission Guidelines; New Filesystem for SSD and Flash Drives; and Unlocking Memory Access.

Debugging Production Systems

In general, a developer might debug the Linux kernel by compiling a bunch of special debugging features, performance analysis features, and whatnot. Then they'll patch any kernel bugs they find. However, the actual users who benefit from those bug fixes would not run the debugging features themselves, because there would be too much overhead. Regular Linux users want fast, low-resource, secure systems that generally kick ass in all directions.

Recently however, Marco Elver of Google submitted a patch implementing Kernel Electric-Fence (KFENCE), a debugging tool that's intended to ship as a default feature of the standard Linux source tree. Why would he turn the world upside down in this way?

KFENCE's goal is to identify memory leaks and boundary violations. Other debugging tools such as KernelAddressSANitizer (KASAN) do this too, but they take a system-wide performance hit in order to find those problems.

KFENCE, on the other hand, sacrifices accuracy in order to achieve virtually no performance hit. However, as Marco explained, given enough time on a large enough set of running systems, KFENCE will eventually accurately detect memory issues. The inaccuracy of any individual test will disappear into statistical background noise as the truth is revealed.

So, why bother? Why not just use tools like KASAN during development and leave debugging tools out of production systems entirely?

Marco explained that "KFENCE will detect bugs in code paths not typically exercised by non-production test workloads."

This is often an issue that frustrates kernel developers. The kernel's process scheduler is a nightmare to maintain, because it has to be developed on a tiny number of systems, but still provide the best possible scheduling performance for hundreds of millions of systems across the galactic quadrant. The amount of pure bile disgorged by developers, one upon another, in debate on that subject alone has hollowed out the nostrils of many a horrified onlooker.

KFENCE takes a different approach. By targeting production systems, it hopes to identify bugs in kernel code that are really and truly along the most-used code paths. Then when those bugs get fixed, they'll be doing the most good for the most people.

There was no debate, but plenty of developers piled in to take a look at Marco's code, point out bugs, and offer suggestions. Jonathan Cameron from Huawei had a few fixes to offer, which Marco said would get into the next version of the patch. Dmitry Vyukov from Google also pointed out some problems with Marco's code. And Alexander Potapenko, on Marco's team at Google, said they'd address these problems in the next version.

SeongJae Park from Amazon was very interested in the KFENCE approach and offered his own evaluation of the patch. Marco took SeongJae's suggestions and said they'd be ready for the version after the next patch release.

Discussion of bugs was soon replaced with suggestions for naming alternatives and code organization preferences, which would seem to indicate that people were generally happy with the direction of the patch.

No one raised any alarm bells, for example, by saying, "we will never allow debugging code in a production release, my friend. Never." Though that doesn't mean no one will raise such objections as the patch gets closer towards acceptance into the main source tree.

For the moment, all seems favorable, at least among the corporate entities who in all likelihood would be running the patch on their production systems in millions upon millions of servers housed in secret data centers around the world. So they, at least, seem highly motivated to root out memory issues in the code fast paths upon which they themselves rely. And the rest of us would benefit from those discoveries as well.

Patch Submission Guidelines

Linus Torvalds recently gave some advice to Jakub Kicinski about pull requests. Jakub had sent a pull request for a big pile of bug fixes in the networking tree – deadlocks, authentication, connectivity, the works. There were dozens of patches from a slew of contributors. All of which was fine.

Linus said he used scripts to filter his incoming email, and having the words "git" and "pull" somewhere in the email body (as opposed to the subject line, for example) would make sure he got his eyeballs on it as soon as possible. Without those keywords in the email, he said, the eyeballs would still work, but they might be slower.

Linus also said that for the actual description, the present tense ("Ido fixes failure to add bond interfaces to a bridge") should be replaced with the imperative ("Fix failure to add bond interfaces to a bridge"). The reason, he said, was that these Git log entries would be read by future developers as past events. So if they were written as if they were current at all times, their truth value would likely fluctuate or just fall out of date entirely.

Linus remarked, "Using present tense in particular is very odd when somebody fixed something a year ago and you go back to the description that says 'Ido fixes'. No, he fixed things long ago."

He also explained that for pull requests involving tons of contributors who all deserved credit, "if you want to call out the developer, please do it _after_ describing the actual fix. Because the commit log (whether for an individual patch or for a merge message) is primarily about what the change is about. Authorship is separate (and generally shows up as such)."

Jakub replied that he'd stick with all those guidelines for next time.

Some of the guidance, as Linus pointed out, was already in the kernel's documentation for submitting patches, though that doc was intended for plain patches rather than pull requests.

New Filesystem for SSD and Flash Drives

Mikulas Patocka announced that he was designing a new filesystem called NVFS. The name apparently derives from the now defunct NovaFS project. NovaFS was intended to mount directory trees on nonvolatile memory, such as flash drives and solid-state drives (SSDs). Mikulas's NVFS project, he said, was smaller and faster than its predecessor; also it had more features and ran on more recent kernels.

Dan Williams was pleased and impressed to see this code. But he did caution Mikulas to avoid some of the pitfalls that had caught earlier projects – not just NovaFS, but ext4fs, the default filesystem in most Linux distributions. Specifically, in Mikulus's original post, he'd suggested that fsck ran very slow on NVFS, because the kernel used the buffer cache to map some memory devices. Mikulus felt that the buffer cache simply might be too heavyweight, and that fsck could be sped up by 500 percent or 1,000 percent if the kernel would directly map block devices based on Direct Access (DAX), bypassing the buffer cache.

Dan cautioned against this, saying that a whole passel of problems had come up for ext4fs when they tried to bypass the cache and do things directly. Dan acknowledged that those problems all very likely had solutions in some kind of ultimate-truth-of-the-universe sense. However, no one had been able to wend their way to that ultimate truth at the time, and going through the buffer cache solved everything in one fell swoop. Going through that same mess with NVFS, Dan implied, might just lead to the same spiraling misery, and maybe fsck didn't really need to run at the absolute highest possible speed after all.

Mikulas was not thoroughly discouraged by this and gave some thought to exactly what might be involved. But he did conclude that "it isn't as easy as it looks." A little while later, he reported that he had "implemented this functionality," although it was a bit of a compromise – essentially, for some files, DAX-based mapping could work, while for others it was necessary to fall back to the standard buffer cache solution. Mikulas was hopeful that he could continue finessing this, gaining more and more speed benefits for fsck along the way.

Dan didn't seem optimistic about going down this road, but he offered what help he could. And if Mikulas did end up finding the ultimate DAX-based solution, then all filesystems would likely benefit, so why not try.

Linux supports an uncountable number of filesystems. In fact if you're looking for a fun entry point into kernel hacking, writing your own filesystem could be a good way to go. The number is uncountable because there is such a low barrier to entry. Large numbers of people have whipped up filesystems for their own particular use, using Filesystem in User space (FUSE). This delectable part of the kernel was originally created in response to Hurd's claim that it would be able to offload many core kernel features into user space, including things like networking and filesystems. Hurd developers claimed that monolithic kernels like Linux would never be able to support such things. So some Linux developers got together and did it.

Note that NVFS is not a FUSE-based filesystem. I only mention FUSE here because it's so cool, and because cool filesystems are fun, and because anyone wanting to play around with creating their own filesystems can do it right now without waiting.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on string handling routines and speeding up database workloads.

  • Kernel News

    Chronicler Zack Brown reports on printk() wrangling, persistent memory as a generalized resource, making Kernel headers available on running systems, and Kernel licensing Hell. 

  • Kernel: New Maintainer for x86 Branch

    Back at the Kernel Summit in September Andi Kleen announced that he would no longer be maintaining the i386 and x86_64 branches if they were merged in the new x86 branch. A new patch shows that Kleen has kept his promise.

  • Linus Releases 2.6.33-rc1

    After releasing a new Kernel version, Linus Torvalds needed a few days of rest to put some remaining patches into the next release. The so-called merge window has closed, with the 2.6.33 branch now open.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95