Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on avoiding bloat in the Kernel that does everything, and particularly odd occurrences of Stardust.
Avoiding Bloat in the Kernel That Does Everything
Sometimes prospective features are useful, and sometimes they're not. Sometimes they're useful, but only to a very specific set of users that somehow straddle the divide between first- and second-class citizens. These special users are the kernel developers themselves.
Steven Rostedt recently posted a patch that would generate a permanent file in the TraceFS filesystem. The file would identify the directory entries (dentries) and their reference counts, for dynamic file creation in the EventFS filesystem. Steven pointed to a recent debugging session where part of the debugging process involved creating such a file. He felt it would be useful for future debugging to have such a file available by default.
There followed a fascinating exchange between Linus Torvalds and Steven. Linus's take on the situation was that "this is neither a bug-fix, nor does it seem to make any sense at all in the main tree. This really feels like a 'temporary debug patch for tracing developers'."
Steven replied that it did seem to be generally useful, because "it can be used to see what's happening internally." He said he'd wrap the feature in an #ifdef
statement, so that developers would be able to use it and other similar resources in the future for easy access to filesystem internals.
But Linus reiterated that this was not a feature he wanted in the kernel. He said:
"Honestly, you copied the pattern from the /proc filesystem.
"The /proc filesystem is widely used and has never had this kind of random debugging code in mainline.
"Seriously, that eventfs_file thing is not worthy of this kind of special debug code.
"That debug code seems to be approaching the same order of size as all the code eventfs_file code itself is.
"There's a point where this kind of stuff just becomes ridiculous. At least wait until there's a *reason* to debug a simple linked list of objects.
"If you have a hard time figuring out what the eventfs entries are, maybe you should just have made 'iterate_shared' show them, and then you could use fancy tools like 'ls' to see what the heck is up in that directory?"
Steven replied that he hadn't actually copied the code from the /proc
filesystem, though he acknowledged there were similarities. He said, "I tried to look at how /proc does things and I couldn't really use it as easily, because proc uses its own set of 'proc_ops', and I had some different requirements."
But in terms of Linus's suggestion of a simpler way to see the EventFS entries, Steven replied, "I was more interested in what did not exist than what existed. I wanted to make sure that things were cleaned up properly. One of my tests that I used was to do a: find /sys/kernel/tracing/events, and then run my ring_buffer memory size stress test (that keeps increasing the size of the ring buffer to make sure it fails safely when it runs out of memory). Then I check to make sure all the unused dentries and inodes were reclaimed nicely, as they hang around until a reclaim is made."
However, Steven saw which way the wind was blowing and didn't intend to get blood on his sword over a potentially useful debugging patch. He asked, "Are you entirely against this file, or is it fine if it's just wrapped around an CONFIG_EVENTFS_DEBUG?"
Linus explained:
"I think [it's] extra code that we'd carry around – probably for much too long – with absolutely _zero_ indication that it's actually worth it.
"Not worth asking people about, but also not worth carrying around.
"You worry about bugs in it now, because it's new code. That's normal. That doesn't make your debug interface worth any kind of future.
"Keep it around as a private patch. Send it out to people if there are actual issues that might indicate this debug support would help. And if it has shown itself to be useful several times, at that point you have an argument for the code.
"As it is, right now I look at that code and I see extra BS that we'll carry around forever that helps *zero* users, and I find it very questionable whether it would help you.
"And if you really think that people need to know what the events exist in eventfs, then dammit, make 'readdir()' see them. Not some stupid specialty debug interface. That's what filesystems *have* readdir for."
But Linus replied to himself a couple of hours later, with a slightly different take. He said:
"Alternatively, if you have noticed that it's just a pain to not be able to see the data, instead of introducing this completely separate and illogical debug interface, just say 'ok, it was a mistake, let's go back to just keeping things in dentries since we can _see_ those'.
"Put another way: this is all self-inflicted damage, and you seem to argue for this debug interface purely on 'I can't see what's going on any more, the old model was really nice because you could *see* the events'.
"To me, if that's really a major issue, that just says 'ok, this eventfs abstraction was mis-designed, and hid data that the main developer actually wants'.
"We don't add new debug interfaces just because you screwed up the design. Fix it."
Steven remarked with a wry smile, "The entire tracing infrastructure was created because of the 'I can't see what's going on' ;-) Not everyone is as skilled with printk as you."
He also explained the historical reasoning behind the current design, saying, "The old eventfs model was too costly because of the memory footprint, which was the entire objective of this code patch. The BPF [Berkeley Packet Filter] folks told me they looked into use a tracing instance but said it added too much memory overhead to do so. That's when I noticed that the copy of the entire events directory that an instance has was the culprit, and needed to be fixed."
So Steven felt the "design" Linus had complained about was correct and didn't need to be "fixed." But he added, "I get your point. I will agree that this interface will likely be mostly useful for the first year or two after the new code is added. But after a few years, we could delete it too." And in a subsequent email, he also said, "I'll keep the code around locally, and if vfs ever changes and breaks this code where this file helps in solving it, I'll then do another pull request to put this file upstream ;-)."
And the thread ended there.
This was a short debate and probably fairly low-cost, because it didn't represent a huge amount of effort on Steven's part – he simply packaged up some debugging code that had recently proven useful. So the rejection from Linus didn't cost Steven very much. But it's very interesting to me personally the way Linus balances the needs of developers against the needs of the rest of us. The Linux kernel project is completely dependent on the contributions of developers like Steven, while the rest of us – aside from possibly submitting a bug report once in awhile – are simply the beneficiaries. But as far as Linus is concerned, Steven's bit of debugging code, benefitting only developers, had no place in the kernel, even as a relatively temporary aid until the feature it helped had stabilized. It's a fascinating balancing act on Linus's part, intended to keep the Linux kernel – an operating system that supports virtually every piece of computer hardware on the planet – from becoming bloated with extra code that might make it more difficult to maintain.
Particularly Odd Occurrences of Stardust
Recently, a Spectre variant 1 (V1) vulnerability may or may not have appeared in the Linux kernel. Spectre V1 is a bizarre vulnerability that takes advantage of CPU optimizations that make a reasonable guess at the result of conditionals, so it can begin to execute code along the path that's most likely to be taken after the conditional is performed. If it guesses right, it keeps those calculations; otherwise, it abandons them and starts again along the proper path. And because its guess is generally pretty good, the CPU tends to save time this way and increases overall performance.
The problem is that for those wrong guesses, the unneeded calculations aren't really abandoned at all – they still leave traces of data behind them (e.g., data such as passwords), which malicious programs can read and use.
When Spectre V1 was discovered, the Linux developers patched the kernel to prevent those data traces from lingering or being created in the first place. However, to maintain security, it's important that new kernel features and other patches avoid re-exposing those things.
Luis Gerhorst recently identified a patch that had previously gone into the Linux kernel as potentially re-exposing the Spectre V1 vulnerability under certain circumstances. The patch had allowed the kernel to compare the pointers used to access packets of data sent across a network – and specifically to allow the size of the data packets to be variable. According to Luis, it was the variability of the packet size that let Spectre V1 rear its head again.
If the packets had a fixed size, then the kernel could simply check the bounds. But with the variable packet size, hostile code could load more data beyond the packet itself, which would then be exposed when the kernel ran its comparison and the CPU optimized that conditional.
But it's not as clear as all that!
Alexei Starovoitov looked over Luis's argument and concluded that, in fact, there was no way for an attacker to actually get access to useful data in this particular situation. The attacker, Alexei said, could indeed expose sensitive data. However, because they would not have control over the various pointers involved, they would not be able to actually read that data in such a way as to know what data they were reading. Exposing the data was not enough! As Alexei put it, "the attack cannot be repeated for the same location. The attacker can read one bit 8 times in a row and all of them will be from different locations in the memory. Same as reading 8 random bits from 8 random locations. Hence I don't think this revert is necessary. I don't believe you can craft an actual exploit."
Daniel Borkmann agreed with Alexei, but he felt there could be additional security vulnerabilities to take into account. Specifically beyond the end of a given networking data packet, the kernel stored a data structure that contained memory addresses used by the kernel. And although Daniel agreed with Alexei that the attacker would not be able to access the data that worried Luis, he felt that the attacker would indeed have access to parts of those memory addresses.
The reason you want to keep Linux kernel memory addresses out of the hands of an attacker is because the addresses let the attacker make guesses about the overall layout of the kernel in system memory. The kernel relies on Kernel Address Space Layout Randomization (KASLR) to prevent such access for this reason. This feature loads the kernel into a random place in system memory, specifically to prevent attackers from knowing where a given part of the system is located, in order to target that part for an attack. Daniel's point was that by exposing even a portion of those kernel addresses, the kernel would allow the attacker to mitigate the effect of KASLR protections. So the vulnerability wouldn't give the attacker direct access to sensitive kernel data like passwords, but it would help the attacker identify other potential exploits that they might attempt.
Alexei, however, was still not convinced. He felt that the attacker would still not be able to identify the data it was accessing. Just as the attacker couldn't access passwords, the attacker would not be able to access those kernel addresses.
However, Luis was not convinced by Alexei being unconvinced. He felt that he had identified aspects of Spectre V1's basic vulnerability – things that could indeed be exploited. Also, in terms of Alexei's response to Daniel's specific case, Luis countered, "It is true that this is not easily possible using the method most exploits use, at least to my knowledge (i.e., accessing the same address from another core). However, it is still possible to evict the cacheline with skb->data/data_end from the cache in between the loads. [...] For a CPU with 64KiB of per-core L1 cache all 64-byte cachelines can be evicted by iterating over a 64KiB array using 64-byte increments, that's only 1k iterations."
Luis also posted some actual assembly code that he felt would leak data in this case.
Alexei acknowledged that "I have to agree that the above approach sounds plausible in theory and I've never seen anyone propose to mispredict a branch this way." But he added that this probably "means that no known speculation attack was crafted. I suspect that's a strong sign that the above approach is indeed a theory and it doesn't work in practice."
Alexei concluded sternly, "So I will insist on seeing a full working exploit before doing anything else here. It's good to discuss this cpu speculation concerns, but we have to stay practical. Even removing bpf from the picture there is so much code in the network core that checks packet boundaries. One can find plenty of cases of 'read past skb->end' under speculation and I'm arguing none of them are exploitable."
Luis posted code that leaked some otherwise inaccessible data via Spectre V1. But he also acknowledged to Alexei, "However, you are right in that there does not appear to be anything extremely useful behind skb->data_end, destructor_arg is NULL in my setup but I have also not found any code putting a static pointer there. Therefore if it stays like this and we are sure the allocator introduces sufficient randomness to make OOB reads useless, the original patches can stay. If you decide to do this I will be happy to craft a patch that documents that the respective structs should be considered 'public' under Spectre v1 to make sure nobody puts anything sensitive there."
The discussion ended there.
It's still unclear whether an actual useful exploit for either Luis's or Daniel's cases exists. But it's also true that Alexei's approach to this problem seems to follow Linus Torvalds's general principle that security fixes must address actual exploits, rather than people simply implementing speculative protections that might not actually be needed.
Security is an inherently nightmarish topic in software development, in which strange dreamscapes continually seem to turn the simplest truths on their heads. Whatever the most obvious assumption might be, it also might be exactly where a sudden vulnerability will be revealed. Many strange features and constraints in the Linux kernel boil down to the need to avoid particular vulnerabilities. And the answer to many of the oddest questions in kernel development is often, simply, security.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.