Zack's Kernel News

Zack's Kernel News

Article from Issue 250/2021

Chronicler Zack Brown reports on: Trusting Trusted Computing; New Userspace Load-Balancing Framework; and Ending Big Endian.

Trusting Trusted Computing

There's a fundamental conflict between user and vendor in the commercial world. For example, if the vendor had full control over your system, they'd be able to offer streaming video services without the risk of you copying the stream and sharing the file. On the flip side, that level of control would also allow the vendor to control unrelated ways you wanted to use your system.

The Linux development philosophy – and open source philosophy in general – believes the user should have full control over their system. If a "feature" can't be implemented without taking that control away, then according to that philosophy, the feature simply shouldn't be implemented.

Not surprisingly, this is a controversial topic and a source of tension between Linux developers and commercial enterprises, many of whom contribute truly massive amounts of person hours of development to Linux.

The problem is that humanity has already experienced what happens when the vendor controls the user's system. It leads to the same sort of lockouts, poor interoperability, and general loss of configurability that existed before Linux took over the world. Linux was the cure, but the disease is always waiting for its chance to come back.

It's similar to those southern U.S. states that for decades were legally prevented from passing laws disenfranchising minorities. After all that time, they argued that the laws were no longer necessary because we lived in a post-race world where disenfranchisement was a thing of the past. So the laws were repealed, and the states proceeded to pass laws aggressively disenfranchising minorities.

As in that case, we shouldn't let success make us forget what we were protecting ourselves from in the first place.

Lately, Eric Snowberg posted some patches to retain user control over the encryption keys used to keep the kernel secure. As Eric put it, "Currently, pre-boot keys are not trusted within the Linux boundary. Pre-boot keys include UEFI [Unified Extensible Firmware Interface] Secure Boot DB keys and MOKList [Machine Owner Key List] keys. These keys are loaded into the platform keyring and can only be used for kexec. If an end-user wants to use their own key within the Linux trust boundary, they must either compile it into the kernel themselves or use the insert-sys-cert script. Both options present a problem. Many end-users do not want to compile their own kernels. With the insert-sys-cert option, there are missing upstream changes. Also, with the insert-sys-cert option, the end-user must re-sign their kernel again with their own key, and then insert that key into the MOK db. Another problem with insert-sys-cert is that only a single key can be inserted into a compressed kernel."

Eric proposed adding a new MOK variable to the kernel, to let the user use a new MOK keyring containing their own personal security keys. After bootup, the keys would be destroyed, thus remaining completely inaccessible to any hostile attackers.

As Eric explained, "The MOK facility can be used to import keys that you use to sign your own development kernel build, so that it is able to boot with UEFI Secure Boot enabled. Many Linux distributions have implemented UEFI Secure Boot using these keys as well as the ones Secure Boot provides. It allows the end-user a choice, instead of locking them into only being able to use keys their hardware manufacture provided or forcing them to enroll keys through their BIOS."

Eric and Mimi Zohar had a bit of a technical discussion over whether the MOK keyring needed to be destroyed after bootup or if it could be kept around like the other keys. The benefit, Mimi said, was that since the other keys were kept anyway, it would make sense to avoid adding exceptional cases to the code. Exceptional cases are always good places for hostile actors to look for security holes, so the fewer of them, the better.

There was not much debate, but neither was there a roar of acclamation. Security is security, and objections generally come from surprising directions. But at least for now, Eric's patches seem to be moving forward, providing an easier way for users to ensure that they, and not a vendor, have the final say on how to use their system.

As a very favorable sign, Linus Torvalds replied to a later version of the patch with no technical objections, saying simply, "I saw that you said elsewhere that MOK is 'Machine Owner Key', but please let's just have that in the sources and commit messages at least for the original new code cases. Maybe it becomes obvious over time as there is more history to the code, but when you literally introduce a new concept, please spell it out."

New Userspace Load-Balancing Framework

I always love seeing companies release code under open source licenses. Recently, Peter Oskolkov from Google put out some very early patches for consideration by the Linux kernel developers. Peter said, "'Google Fibers' is a userspace scheduling framework used widely and successfully at Google to improve in-process workload isolation and response latencies. We are working on open-sourcing this framework, and UMCG (User-Managed Concurrency Groups) kernel patches are intended as the foundation of this."

He went on, "Unless the feedback here points to a different approach, my next step is to add timeout handling to sys_umcg_wait/sys_umcg_swap, as this will open up a lot of Google-internal tests that cover most of use/corner cases other than explicit preemption of workers (Google Fibers use cooperative scheduling features only). Then I'll work on issues uncovered by those tests. Then I'll address preemption and tracing."

Jonathan Corbet remarked, "I have to ask … is there *any* documentation out there on what this is and how people are supposed to use it? Shockingly, typing 'Google fibers' into Google leads to a less than fully joyful outcome …. This won't be easy for anybody to review if they have to start by reverse-engineering what it's supposed to do."

Peter gave a link to a video ( and a PDF, adding that on the kernel mailing list, external links were generally discouraged, so he hadn't wanted to violate the standard. However, Randy Dunlap replied, "for links to email, we prefer to use archives. Are links to other sites discouraged? If so, that's news to me."

Peter Zijlstra replied:

"Discouraged in so far as that when an email solely references external resources and doesn't bother to summarize or otherwise recap the contents in the email proper, I'll ignore the whole thing.

"Basically, if I have to click a link to figure out basic information of a patch series, the whole thing is a fail and goes into the bit bucket.

"That said, I have no objection against having links, as long as they're not used to convey the primary information that _should_ be in the cover letter and/or changelogs."

Meanwhile, Jonathan pointed out that Peter O.'s video was from 2013, and "the syscall API appears to have evolved considerably since then." He went on, "This is a big change to the kernel's system-call API; I don't think that there can be a proper discussion of that without a description of what you're trying to do."

Peter O. said he'd put together some documentation and submit it with the next patch set. And he added that there were some documentation comments in the code itself. To this, Jonathan suggested, "A good overall description would be nice, perhaps for the userspace-api book. But *somebody* is also going to have to write real man pages for all these system calls; if you provided those, the result should be a good description of how you expect this subsystem to work."

Peter O. wrote up some documentation and posted it to the list – adding that it might be a bit early for full man pages, as he expected the API to change significantly before the whole thing went into the kernel.

In his documentation file, Peter O. said that UMCG "lets user space application developers implement in-process user space schedulers."

The document pointed out that the Linux kernel default scheduler was good for general purpose load-balancing, while Google's approach allowed certain processes to be considered more "urgent" than others. Peter O. said in the document:

"For example, a single DBMS process may receive tens of thousands [of] requests per second; some of these requests may have strong response latency requirements as they serve live user requests (e.g., login authentication); some of these requests may not care much about latency but must be served within a certain time period (e.g., an hourly aggregate usage report); some of these requests are to be served only on a best-effort basis and can be NACKed under high load (e.g., an exploratory research/hypothesis testing workload).

"Beyond different work item latency/throughput requirements as outlined above, the DBMS may need to provide certain guarantees to different users; for example, user A may 'reserve' 1 CPU for their high-priority/low latency requests, 2 CPUs for mid-level throughput workloads, and be allowed to send as many best-effort requests as possible, which may or may not be served, depending on the DBMS load. Besides, the best-effort work, started when the load was low, may need to be delayed if suddenly a large amount of higher-priority work arrives. With hundreds or thousands of users like this, it is very difficult to guarantee the application's responsiveness using standard Linux tools while maintaining high CPU utilization.

"Gaming is another use case: Some in-process work must be completed before a certain deadline dictated by [the] frame rendering schedule, while other work items can be delayed; some work may need to be cancelled/discarded because the deadline has passed; etc."

Aside from this, Peter O. said in the document, there could be security benefits as well. For example, "Fast, synchronous on-CPU context switching can also be used for fast IPC (cross-process). For example, a typical security wrapper intercepts syscalls of an untrusted process, consults with external (out-of-process) 'syscall firewall', and then delivers the allow/deny decision back (or the remote process actually proxies the syscall execution on behalf of the monitored process). This roundtrip is usually relatively slow, consuming at least 5-10 usec, as it involves waking a task on a remote CPU. A fast on-CPU context switch not only helps with the wakeup latency but also has beneficial cache locality properties."

Jonathan liked the document and did reiterate his desire for real API documentation for the system calls. As he put it, "it will really be necessary to document the system calls as well. *That* is the part that the kernel community will have to support forever if this is merged."

Peter Z. found the documentation less useful and complained to Peter O, "You present an API without explaining, *at*all*, how it's supposed to be used, and I can't seem to figure it out from the implementation either."

He went on:

"I'm confused by the proposed implementation. I thought the whole point was to let UMCG tasks block in kernel, at which point we'd change their state to BLOCKED and have userspace select another task to run. Such BLOCKED tasks would then also be captured before they return to userspace, i.e., the whole admission scheduler thing.

I don't see any of that in these patches. So what are they actually implementing? I can't find enough clues to tell."

He had many more technical comments about Peter O.'s patches, all negative. However, Peter O. replied, "Finally, a high-level review – thanks a lot, Peter!"

It was starting to become clear to Peter O., from Peter Z.'s and others' reactions, that UMCG's overall approach, "is not resonating with kernel developers/maintainers – you are the third person asking why there is no looping in sys_umcg_wait, despite the fact that I explicitly mentioned pushing it out to the userspace."

Peter O. tried to explain the main approach. Primarily, he said, the new system calls were not intended to do all the work – they were only supposed to handle the in-kernel requirements. Then, for things that were easier to handle in user space, the system calls would just kick the problem out to be handled at that layer. This made sense to him, because things overall would be simpler and clearer. But he did acknowledge that this would leave the new system calls "logically incomplete." He asked if this would be permissible, or if system calls were expected to handle everything rigorously themselves.

This was a relatively new idea for Peter Z., who replied that intuitively, he felt rigorous system calls would be the way to go.

Peter Z. and Peter O. went on to discuss many more implementation details, which seemed to give Peter O. a lot of inspiration for the next patch set.

At one point, Thierry Delisle joined the technical discussion, saying, "I am one of the main developers on the Cforall programming language (, which implements its own M:N user-threading runtime. I want to state that this RFC is an interesting feature, which we would be able to take advantage of immediately, assuming performance and flexibility closely match state-of-the-art implementations."

The discussion is ongoing. To me, it seems like this would be a very useful feature to get into the kernel in one form or another. A large portion of Google's product infrastructure certainly involves massively distributed software running on millions of globally distributed, relatively low-end hardware systems. Here they are open sourcing some of the keys to that scale of clustering. It's possible that their implementation has problems, but I would bet that eventually this patch set, or something similar, will go into the kernel.

Ending Big Endian

The Linux kernel is not exclusively written in the C language. There are other languages, including Rust – a C-like language that's been getting a lot of attention, not least because Linus Torvalds has accepted it into the kernel. But I'm not here to talk about that; I'm here to talk about a tiny related detail that came up recently on the mailing list.

Miguel Ojeda submitted some patches recently to deal with the large size of Rust symbols in the kernel code. Symbols are names that correspond to memory locations. Linux uses a symbol table so that the kernel can refer to memory locations that may be changing by reference to a consistent symbol name. It's not a Rust thing; it's a standard part of Linux. However, with Rust, these symbols were getting a bit long, and Miguel wanted to make sure each symbol name had enough space.

Most symbol names, Miguel said, had no trouble fitting into a single byte, though some needed two. But increasing symbol length to two bytes for all symbols would be a big waste of space, kernel-wide. Miguel wanted to finagle it a little.

His idea was to distinguish between regular-sized symbols and "big" symbols. His patch accomplished this by testing the length of the symbol at certain points in the kernel code. If the kernel reported the length as zero, that would mean the symbol was actually "big" and would use two bytes.

That's standard magic. Of course the length isn't really zero; it's just a pathological case that Miguel could make use of by assigning a meaning to it. As long as such weirdness is documented in the code, the top kernel developers will often approve. In fact, it's fairly normal.

However, Linus noticed that the two byte "big" symbols were in "big endian" order in Miguel's patch. Whenever you have a multi-byte piece of data, the order of bytes is considered "big endian" if the most significant byte occupies the lowest-numbered memory address, and "little endian" if the most significant byte occupies the highest-numbered memory address. Endianness is just a convention; it doesn't do anything special. But whichever endianness you've got, your code has to handle it.

Linus, on seeing this, said:

"Why is this in big-endian order?

"Let's just try to kill big-endian [BE] data, it's disgusting and should just die already.

"BE is practically dead anyway, we shouldn't add new cases. Networking has legacy reasons from the bad old days when byte order wars were still a thing, but those days are gone."

When I said above that endianness was just a convention, it was true, but there are details. For example, CPUs have their endianness hard-coded, and each CPU's endianness choices must be accommodated by the operating system. Also, as Linus pointed out, networking protocols have got some endianness standards that are hard to shake.

But in general, from a computational standpoint, little endian is more efficient to handle. Certain operations, such as casting a piece of data from one size to another, are a simple matter of ignoring the extra, while in big endian the system has to do some calculation to produce the desired cast.

In this particular case, Miguel had no stake in the endianness debate and agreed to switch his patch to use little endian. In fact, the developers ultimately went with a hybrid solution from Matthew Wilcox that was more little endian-ish than big endian-ish and also packed more storage into a smaller space. So Linus preferred it over straight little endian.

To me, those details are fun because they show how much fuss and bother the developers take – especially Linus – to make the kernel code as clean and sweet as possible. Sure, there are some ungodly messes in there and will be for the foreseeable future. But the developers really care about smoothing things out as much as possible. It's unusual in a world where a lot of software projects are just pure spaghetti.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    This month in Kernel News: Dealing with Older GCC Versions; and On-boarding New Kernel Hackers.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More