Zack's Kernel News

Zack's Kernel News

Article from Issue 203/2017

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Improving the Kernel Clock

Miroslav Lichvar recently tried to make the Linux system clock more accurate. The problem wasn't that the clock itself would drift, it was that the kernel had to round off the time values for old vsyscalls, align frequency adjustments to the clock tick, or deal with the fact that numbers couldn't be stored with arbitrary precision. All of these things would introduce small errors that would eventually build up.

The real problem was that correcting for these errors would itself take time. Miroslav instead wanted to remove some of the sources of the errors. He posted a patch to do this, which resulted in a significant improvement in his test suite.

John Stultz liked the patches, but he wanted Miroslav to add his test suite to the kernel test directory, so anyone could track the effect of future patches on clock accuracy. However, Miroslav replied that his test suite was "a mess that breaks frequently as the timekeeping and other kernel code changes."

John pointed out that with the test suite in the kernel, folks would be less likely to submit patches that would break it. Or at least, they'd include patches to fix whatever breakage they introduced.

But Miroslav felt that the test suite was really and truly too fragile to inflict on other kernel developers. He suggested that a more stable solution would be to support the test suite in userspace. But John wasn't convinced. He suggested they just go for it, and in the worst-case scenario they could just take it out again later. And Rusty Russell added, "we did it with nfsim, but forward porting was a PITA. Good luck!"

Miroslav, however, couldn't live with it. Instead of submitting the test suite, he redesigned it to be more maintainable at the cost of making the code slower, less precise, and more likely to give different results after each run. In response, John bemoaned the lack of deterministic output, but felt the new suite was also acceptable.

The discussion ended there. For me, the interesting thing, aside from the fact that the clock is more accurate, is the fact that a developer had code that would have been accepted into the kernel, but didn't want to actually submit it because he felt it would be too messy. Usually that particular discussion plays out in reverse – a developer has a pile of really messy code that they really want to get into the kernel, while the higher-ups insist on cleaning it up first.

New Firmware Mailing List?

In addition to the many kernel-related mailing lists at, Luis R. Rodriguez felt there needed to be another one specifically for firmware discussions. As the maintainer of the linux-firmware code, he personally had no trouble CCing all the relevant people whenever he submitted a patch, but other contributors weren't always aware of who to CC on their patches. A dedicated mailing list would ensure that everyone who needed to see a patch, saw it.

Kalle Valo pointed out that the alias was already in use for submitting firmware patches to the linux-firmware.git repository, a separate project that fed back into the kernel. He felt that adding a full-fledged mailing list at might be confusing. He suggested renaming one of them to make it obvious which was which.

Luis was fine with that, but David Miller (the mailing list postmaster) said there was no need for two lists; they could just use the existing alias for everything. Luis replied that the existing alias was normally full of binary blobs shooting into the Git tree and wouldn't be so fun to read as a mailing list. But Greg Kroah-Hartman sided with David, saying that there just weren't enough firmware patches to justify a whole new mailing list.

Luis pointed out that the lack of a mailing list had been the cause of unnecessary regressions and other problems with the code over the years. But Linus Torvalds made the call, saying:


Boutique mailing lists are generally a _bad_ thing. All it means that there's an increasingly small "in group" that thinks that they generate consensus because nobody disagrees with their small boutique list, because nobody else even _sees_ that small list.

We should only have mailing lists if they really merit the volume, and are big enough that there are lots of users.

Luis agreed that the "in-group" problem was an issue, but pointed out that device drivers often had their own mailing lists with very few members, and "a few folks would be a bit disturbed if they were requested to subscribe and read lkml to get their driver updates they need to review."

Greg was not convinced, saying that firmware was kernel infrastructure rather than an essentially separate project like a device driver, and that it just wasn't a big enough piece of the kernel to justify a whole list of its own. Luis shrugged and said OK, and that was that.

It's interesting because no one contradicted Luis's main point – that the wrong people were getting CCed on patches, and that this was resulting in a poorer review process, and bugs slipping into the kernel. No one offered an alternative solution, and yet, the people who rejected his request – David, Greg, and Linus – were big-time heavyweights. I think the conclusion is that they feel there are still other things Luis can do to ensure that the right people get CCed on firmware patches that don't have the drawbacks of adding new communication channels.

Regularizing Virtualization

David Howells felt that virtual systems (i.e., containers) had become an unwieldy agglomeration of namespaces, control groups, and files that, taken together, defined a virtual system. But for the outside userspace to "upcall" into that virtual system, Linux seemed to have no standard approach.

The result, David said, was that certain data was given essentially the wrong security scope, or at least a non-intuitive structure. The DNS resolver, for example, would best be handled on a per-network basis, but it ended up being associated with a particular mountpoint and a particular process ID space.

David posted some patches to implement container objects. Each container object would contain the namespaces, root mountpoint, list of processes, security policies, and the credentials of the outer user who owned the running virtual system. The current containers and all subcontainers would be visible within the /proc/containers hierarchy.

His patches also implemented some container handling functions to create containers, do various filesystem operations, and to set up various communication channels between the container and the outside world.

James Bottomley took an immediate dislike to David's approach. Instead of creating a regular structure for all containers to follow, James said, it would be better to recognize that there were all sorts of different needs when it came to containers, and someone might have legitimate reasons for almost any way of putting one together. He said, "the strength of the current container interfaces in Linux is that people who set up containers don't have to agree what they look like. So I can set up a user namespace without a mount namespace or an architecture emulation container with only a mount namespace."

He also addressed each of David's example problems with the current way and showed either how containers could be set up differently or how there were legitimate examples that worked better with the current situation. Overall, he felt that David's approach created a set of unnecessary restrictions that would bite everyone in the butt later on.

Jessica Frazelle made a similar point, saying, "Adding a container object seems a bit odd to me because there are so many different ways to make containers, aka not all namespaces are always used as well as not all cgroups, various LSM objects sometimes apply, mounts blah blah blah. The OCI spec was made to cover all these things so why a kernel object?"

She went on to say that it was a lot less work to allow people to stick to the OCI specification by choice than to codify essentially the exact same thing into the kernel where it would be subject to maintenance costs and new bugs.

Aleksa Sarai also pointed out that "if the kernel APIs for containers massively change, then the OCI will have to completely rework how we describe containers (and so will all existing runtimes)."

He added that even though it was currently difficult to set up a secure container properly, there were real benefits to being able to set up only those parts of a virtual system that were actually needed for a given project.

Meanwhile, Jeff Layton came down more in favor of David's code. He said that even though David's code provided a way to construct a given container, it left a lot of flexibility in terms of what you actually did with it and how you structured it. The value of David's code, Jeff said, was that it gave the kernel a clear awareness of how all the different pieces of a container did in fact fit together.

Eric W. Biederman also felt that David's idea could be useful, though he felt that David's approach was not quite right. The good part, Eric said, was that a clear abstraction would make it easier to make clean, secure containers. But the bad part, he went on, was that the specific abstraction David wanted to implement was prone to bugs and could even lock the kernel into supporting an application binary interface (ABI) that it didn't like.

The binary interface is one of the most sacrosanct elements of the entire Linux kernel. If a piece of compiled code relies on a particular interface available in the kernel, Linus would rather chew glass than produce a kernel that no longer supported that interface. Just about the only thing that could induce him to do it would be the need to patch a security hole. Absent that, he'll tolerate nearly any extreme of absurd ugliness rather than break the ABI.

Eric said:

Let me suggest a concrete alternative:

  • At the time of mount observe the mounters user namespace.
  • Find the mounters pid namespace.
  • If the mounters pid namespace is owned by the mounters user namespace walk up the pid namespace tree to the first pid namespace owned by that user namespace.
  • If the mounters pid namespace is not owned by the mounters user namespace fail the mount it is going to need to make upcalls as will not be possible.
  • Hold a reference to the pid namespace that was found.

Then when an upcall needs to be made fork a child of the init process of the specified pid namespace. Or fail if the init process of the pid namespace has died.

That should always work and it does not require keeping expensive state where we did not have it previously. Further because the semantics are fork a child of a particular pid namespace's init as features get added to the kernel this code remains well defined.

For ordinary request-key upcalls we should be able to use the same rules and just not save/restore things in the kernel.

A huge advantage of my alternative (other than not being a bit-rot magnet) is that it should drop into existing container infrastructure without problems. The rule for container implementors is simple to use security key infrastructure you need to have created a pid namespace in your user namespace.

Jeff seemed to be essentially convinced by this. David, however, was not ready to give up on his approach. He argued that a lot of things that seemed to be left out of his approach were on his to-do list; that his code was not meant to unduly constrain anyone, but simply to avoid creating bad containers, and that his intention was not to replace important tools like Docker, but to provide a new tool for them to use.

The discussion grew increasingly technical, as potential security violations came into the picture. At one point while discussing a particular element of David's design, Eric said, "the filesystem implementations in the kernel are not prepared to handle hostile filesystem data structures so that that is the definition of a kernel exploit. The attack surface of the kernel gets quite a bit larger in that case."

At one point James remarked, "OK, so rather than getting into the technical back and forth below can we agree that the kernel can't have a unitary view of 'container' because the current use cases (the orchestration systems) don't have one? Then the next step becomes how can we add an abstraction that gives you what you want (as far as I can tell basically identifying a set of namespaces for an upcall) in a way that doesn't bind the kernel to have a unitary view of a container? And then we can tack the ideas on to the Jeff/Eric subthread."

At this point, the discussion seemed to be moving away from David's original intention and more toward identifying something similar that would meet more needs without the issues that had been raised about David's code.

To put the nail in the coffin, Eric officially rejected David's patches with a "Nacked-By" statement, in keeping with Git patch submission practice. He gave as his final conclusion, "As a user visible entity I see nothing this container data structure helps solve; it only muddies the waters and makes things more brittle. Embracing the complexity of namespaces head on tends to mean all of the goofy scary semantic corner cases are visible from the first version of the design, and so developers can't take short cuts that result in buggy kernel code that persists for decades."

Even so, David continued to explain his approach, and ultimately the discussion petered out without any real resolution. By the end, it still wasn't clear that his approach was bad or that something else would be good, but it was clear that he hadn't seemed to win anyone over to his approach.

The thing about this kind of discussion is that there are a lot of stakeholders and a lot of security constraints that can pop out of the woodwork at any time. An approach that might seem perfect could be rejected for an obscure reason. Ultimately, the person standing alone trying to explain why their approach is correct could very well represent a solution that wins out over the larger group of people trying to come up with something better. Or the reverse could be true. It's a completely unpredictable situation, with sometimes truly bizarre resolutions.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    The Kernel Development Process

  • Kernel News

    This month in Kernel News: Git Merge "Simplification" Advice; Loading Modules from Containers; Git Tree Synchronicity; and The New "No New Warnings" Warning.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More