Zack's Kernel News

Supporting Larger Address Space, Pt 2

Previously, I described Liang Li's effort to extend EPT support to five levels in order to accommodate 52 to 56 bits of hardware address space. Coming at this issue from a different angle, Kirill A. Shutemov posted patches to allow userspace to successfully access 56-bit memory addresses. He said, "On x86, five-level paging enables 56-bit userspace virtual address space. Not all user space is ready to handle wide addresses. It's known that at least some JIT compilers use high bit in pointers to encode their information. It collides with valid pointers with five-level paging and leads to crashes."

Kirill wanted to accomplish this by setting an rlimit, which places limits on resource usage. But Andy Lutomirski felt that it would be more appropriate to be an ELF flag (aka note), which would allow the setuid bit to behave better, or a personality, which would allow flags to control the way in which a particular program could run. Kirill replied that he'd intended to implement ELF flags on top of the rlimit implementation, but admitted not fully grasping the setuid issue. Andy explained, "If a setuid program depends on the lower limit, then a malicious program shouldn't be able to cause it to run with the higher limit. The personality code should already get this case right because personalities are reset when setuid happens."

Meanwhile, Arnd Bergmann liked Kirill's patch, saying also that, "This seems to nicely address the same problem on arm64, which has run into the same issue due to the various page table formats that can currently be chosen at compile time."

Resolving these various issues led to a fairly technical consideration of the ins and outs of each implementation, and the various things that might go wrong with each, or the special powers that each might contain. For example, better/worse support for database users.

After quite a bit of discussion, Dave Hansen remarked to Kirill, "The farther we get into this, the more and more I think using an rlimit is a horrible idea. Its semantics aren't a great match, and you seem to be resistant to making *this* rlimit differ from the others when there's an entirely need to do so. We're already being bitten by 'legacy' rlimit. IOW, being consistent with *other* rlimit behavior buys us nothing, only complexity."

And elsewhere, Andy remarked at one point:

"Taking a step back, I think it would be fantastic if we could find a way to make this work without any inheritable settings at all. Perhaps we could have a per-mm value that is initialized to 2^47-1 on execve() and can be raised by ELF note or by prctl()? Getting it right for 32-bit would require a bit of thought. The ELF note would make a high stack possible and, without the ELF note, we'd get a low stack but high mmap(). Then the messy bits can be glibc's problem and a toolchain problem as it should be, given that the only reason we need a limit at all is because of messy userspace code.

Sure, the low stack prevents the *whole* address space from being used in one big block for databases, but 2^57 to 2^47 ought to be good enough.

I'm not 100% sure this is workable, but, if it is, it makes everyone's life easier. There's no need to muck around with setarch(1) or similar hacks."

Linus Torvalds stepped in at this point, saying definitely that "this is the right model. No inheritable settings, no suid issues, no worries. Make people who want the large address space (and there aren't going to be a lot of them) just mark their binaries at compile time."

But Kirill didn't want to give up on the inheritance idea just yet. He argued, "One thing that inheritance give[s] us is ability to change available address space from outside of binary. Both ELF note and prctl() don't really work here. Running a legacy binary with a full address space is a valuable option – as is limiting address space for a binary with ELF note or prctl() in case of breakage in a field. Sure, we can use personality(2) or invent [an]other interface for this. But to me, rlimit covers both normal and emergency use cases relatively well."

But Linus didn't agree. He felt that inheritance was "simply not valuable enough to worry about. Especially when there is a fairly trivial wrapper approach: Just make a full-address-space wrapper that acts as a binary loader (think 'specialized'). Sure, the wrapper may be fairly trivial but not necessarily pleasant: you have to parse ELF sections etc. and basically load the binary by hand. But there are libraries for that, and loading an ELF executable isn't rocket surgery; it's just possibly tedious."

Meanwhile Andi Kleen said he favored Kirill's inheritance-based approach. He felt Linus was minimizing the complexity of parsing ELF sections. Andi said, "Compile time is inconvenient if you want to test some existing random binary. I tried to write a tool that patched ELF notes into binaries some time ago for another project, but it ran into difficulties and didn't work everywhere. An inheritance scheme is much nicer for such use cases."

The discussion petered out at this point, though I'd expect Linus to have the last word eventually.

The difficulties of making a change of this kind – adjusting the way fundamental resources like RAM are handled by the system – is extreme. There are all sorts of caveats and corner cases, semantics that change in odd ways, speedups and slowdowns that occur in odd places, and also security concerns. It wouldn't surprise me to see this issue kicked around by the developers for months, before anything resembling an acceptable patch can emerge.

ARM SPE Support

Will Deacon said that the ARM 8.2 hardware "introduces the Statistical Profiling Extension (SPE). SPE provides a way to configure and collect profiling samples from the CPU in the form of a trace buffer, which can be mapped directly into userspace using the perf AUX buffer infrastructure."

He posted a patch to add a new perf driver to support ARM SPE. Peter Zijlstra asked for a high-level explanation of SPE, and Will replied:

"Sure, I can try, although there is no public documentation, yet so it's a bit fiddly.

SPE can be used to profile a population of operations in the CPU pipeline after instruction decode. These are either architected instructions (i.e., a dynamic instruction trace) or CPU-specific uops, and the choice is fixed statically in the hardware and advertised to userspace via caps/. Sampling is controlled using a sampling interval, similar to a regular PMU counter, but also with an optional random perturbation to avoid falling into patterns where you continuously profile the same instruction in a hot loop.

After each operation is decoded, the interval counter is decremented. When it hits zero, an operation is chosen for profiling and tracked within the pipeline until it retires. Along the way, information such as TLB lookups, cache misses, time spent to issue, etc. is captured in the form of a sample. The sample is then filtered according to certain criteria (e.g., load latency) that can be specified in the event config (described under format/), and, if the sample satisfies the filter, it is written out to memory as a record, otherwise it is discarded. Only one operation can be sampled at a time.

The in-memory buffer is linear and virtually addressed, raising an interrupt when it fills up. The PMU driver handles these interrupts to give the appearance of a ring buffer, as expected by the AUX code.

The in-memory trace-like format is self-describing (though not parsable in reverse) and written as a series of records, with each record corresponding to a sample and consisting of a sequence of packets. These packets are defined by the architecture, although some have CPU-specific fields for recording information specific to the microarchitecture.

As a simple example, a record generated for a branch instruction may consist of the packets shown in Table 1.

Table 1

Branch Instruction Packets

0 (Address)

Virtual PC of the branch instruction

1 (Type)

Conditional direct branch

2 (Counter)

Number of cycles taken from Dispatch to Issue

3 (Address)

Virtual branch target + condition flags

4 (Counter)

Number of cycles taken from Dispatch to Complete

5 (Events)

Mispredicted as not-taken

6 (END)

End of record

You can also toggle things like timestamp packets in each record.

Since SPE is an optional extension to the architecture, I'm sure there will be big.LITTLE systems where only one of the clusters has SPE support, so the driver is slightly complicated by handling that."

As you might imagine, a technical discussion ensued. With no public documentation, I would expect the kernel people to be reluctant to accept code into the actual kernel – though maybe they'd accept it in the staging branch for now. In either case, folks had questions about the implementation details as well. It's clear that this feature is in a very early stage of kernel support. But I wanted to include it here because of Will's very cool description of something that's barely hit the public eye.

Zack Brown

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Zack Brown reports on improving a hashing function, constant values adjustable at boot time, and dealing with an Intel design flaw. 

  • Kernel News

    This month in Kernel News: Spanking Linus; Controlling Boot Parameters via Sysfs; Finessing GCC; and Dealing with Loose Build Dependencies.

  • Realtime

    Linux provides tools and patches for speeding up the priority of multimedia applications. So if you're not getting the performance you expect, try shifting into overdrive.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News


comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More