Zack's Kernel News
Supporting Larger Address Space, Pt 2
Previously, I described Liang Li's effort to extend EPT support to five levels in order to accommodate 52 to 56 bits of hardware address space. Coming at this issue from a different angle, Kirill A. Shutemov posted patches to allow userspace to successfully access 56-bit memory addresses. He said, "On x86, five-level paging enables 56-bit userspace virtual address space. Not all user space is ready to handle wide addresses. It's known that at least some JIT compilers use high bit in pointers to encode their information. It collides with valid pointers with five-level paging and leads to crashes."
Kirill wanted to accomplish this by setting an rlimit
, which places limits on resource usage. But Andy Lutomirski felt that it would be more appropriate to be an ELF flag (aka note), which would allow the setuid
bit to behave better, or a personality, which would allow flags to control the way in which a particular program could run. Kirill replied that he'd intended to implement ELF flags on top of the rlimit
implementation, but admitted not fully grasping the setuid
issue. Andy explained, "If a setuid
program depends on the lower limit, then a malicious program shouldn't be able to cause it to run with the higher limit. The personality code should already get this case right because personalities are reset when setuid
happens."
Meanwhile, Arnd Bergmann liked Kirill's patch, saying also that, "This seems to nicely address the same problem on arm64, which has run into the same issue due to the various page table formats that can currently be chosen at compile time."
Resolving these various issues led to a fairly technical consideration of the ins and outs of each implementation, and the various things that might go wrong with each, or the special powers that each might contain. For example, better/worse support for database users.
After quite a bit of discussion, Dave Hansen remarked to Kirill, "The farther we get into this, the more and more I think using an rlimit
is a horrible idea. Its semantics aren't a great match, and you seem to be resistant to making *this* rlimit
differ from the others when there's an entirely need to do so. We're already being bitten by 'legacy' rlimit
. IOW, being consistent with *other* rlimit behavior buys us nothing, only complexity."
And elsewhere, Andy remarked at one point:
"Taking a step back, I think it would be fantastic if we could find a way to make this work without any inheritable settings at all. Perhaps we could have a per-mm value that is initialized to 2^47-1 on execve()
and can be raised by ELF note or by prctl()
? Getting it right for 32-bit would require a bit of thought. The ELF note would make a high stack possible and, without the ELF note, we'd get a low stack but high mmap()
. Then the messy bits can be glibc's problem and a toolchain problem as it should be, given that the only reason we need a limit at all is because of messy userspace code.
Sure, the low stack prevents the *whole* address space from being used in one big block for databases, but 2^57 to 2^47 ought to be good enough.
I'm not 100% sure this is workable, but, if it is, it makes everyone's life easier. There's no need to muck around with setarch(1)
or similar hacks."
Linus Torvalds stepped in at this point, saying definitely that "this is the right model. No inheritable settings, no suid issues, no worries. Make people who want the large address space (and there aren't going to be a lot of them) just mark their binaries at compile time."
But Kirill didn't want to give up on the inheritance idea just yet. He argued, "One thing that inheritance give[s] us is ability to change available address space from outside of binary. Both ELF note and prctl()
don't really work here. Running a legacy binary with a full address space is a valuable option – as is limiting address space for a binary with ELF note or prctl()
in case of breakage in a field. Sure, we can use personality(2) or invent [an]other interface for this. But to me, rlimit
covers both normal and emergency use cases relatively well."
But Linus didn't agree. He felt that inheritance was "simply not valuable enough to worry about. Especially when there is a fairly trivial wrapper approach: Just make a full-address-space wrapper that acts as a binary loader (think 'specialized ld.so'). Sure, the wrapper may be fairly trivial but not necessarily pleasant: you have to parse ELF sections etc. and basically load the binary by hand. But there are libraries for that, and loading an ELF executable isn't rocket surgery; it's just possibly tedious."
Meanwhile Andi Kleen said he favored Kirill's inheritance-based approach. He felt Linus was minimizing the complexity of parsing ELF sections. Andi said, "Compile time is inconvenient if you want to test some existing random binary. I tried to write a tool that patched ELF notes into binaries some time ago for another project, but it ran into difficulties and didn't work everywhere. An inheritance scheme is much nicer for such use cases."
The discussion petered out at this point, though I'd expect Linus to have the last word eventually.
The difficulties of making a change of this kind – adjusting the way fundamental resources like RAM are handled by the system – is extreme. There are all sorts of caveats and corner cases, semantics that change in odd ways, speedups and slowdowns that occur in odd places, and also security concerns. It wouldn't surprise me to see this issue kicked around by the developers for months, before anything resembling an acceptable patch can emerge.
ARM SPE Support
Will Deacon said that the ARM 8.2 hardware "introduces the Statistical Profiling Extension (SPE). SPE provides a way to configure and collect profiling samples from the CPU in the form of a trace buffer, which can be mapped directly into userspace using the perf AUX buffer infrastructure."
He posted a patch to add a new perf driver to support ARM SPE. Peter Zijlstra asked for a high-level explanation of SPE, and Will replied:
"Sure, I can try, although there is no public documentation, yet so it's a bit fiddly.
SPE can be used to profile a population of operations in the CPU pipeline after instruction decode. These are either architected instructions (i.e., a dynamic instruction trace) or CPU-specific uops, and the choice is fixed statically in the hardware and advertised to userspace via caps/
. Sampling is controlled using a sampling interval, similar to a regular PMU counter, but also with an optional random perturbation to avoid falling into patterns where you continuously profile the same instruction in a hot loop.
After each operation is decoded, the interval counter is decremented. When it hits zero, an operation is chosen for profiling and tracked within the pipeline until it retires. Along the way, information such as TLB lookups, cache misses, time spent to issue, etc. is captured in the form of a sample. The sample is then filtered according to certain criteria (e.g., load latency) that can be specified in the event config (described under format/
), and, if the sample satisfies the filter, it is written out to memory as a record, otherwise it is discarded. Only one operation can be sampled at a time.
The in-memory buffer is linear and virtually addressed, raising an interrupt when it fills up. The PMU driver handles these interrupts to give the appearance of a ring buffer, as expected by the AUX code.
The in-memory trace-like format is self-describing (though not parsable in reverse) and written as a series of records, with each record corresponding to a sample and consisting of a sequence of packets. These packets are defined by the architecture, although some have CPU-specific fields for recording information specific to the microarchitecture.
As a simple example, a record generated for a branch instruction may consist of the packets shown in Table 1.
Table 1
Branch Instruction Packets
0 (Address) |
Virtual PC of the branch instruction |
1 (Type) |
Conditional direct branch |
2 (Counter) |
Number of cycles taken from Dispatch to Issue |
3 (Address) |
Virtual branch target + condition flags |
4 (Counter) |
Number of cycles taken from Dispatch to Complete |
5 (Events) |
Mispredicted as not-taken |
6 (END) |
End of record |
You can also toggle things like timestamp packets in each record.
Since SPE is an optional extension to the architecture, I'm sure there will be big.LITTLE systems where only one of the clusters has SPE support, so the driver is slightly complicated by handling that."
As you might imagine, a technical discussion ensued. With no public documentation, I would expect the kernel people to be reluctant to accept code into the actual kernel – though maybe they'd accept it in the staging branch for now. In either case, folks had questions about the implementation details as well. It's clear that this feature is in a very early stage of kernel support. But I wanted to include it here because of Will's very cool description of something that's barely hit the public eye.
Zack Brown
The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.