Zack's Kernel News

Zack's Kernel News

Article from Issue 197/2017

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Increasing Maximum Address Space in x86-64

Liang Li pointed out that, currently, the x86-64 architecture only supports 46-bit physical memory addresses, which limits all x86-64 systems to a maximum of 64 TiB. But he said that some hardware vendors, notably Intel, were going to start building support for 52-bit addresses into their hardware. Liang said that Linux's Extended Page Table (EPT) code only supported four-level page tables, which would could go as far as 48-bit physical addresses, but no further. To reach 52-bit, he said, it would be necessary to extend EPT support to five levels.

This is the same story told in 2004 (see, except instead of migrating to four levels, Liang now wants to migrate to five levels. Liang posted a patch to do this.

Valdis Kletnieks had no major objection, but noticed that Liang's patch mentioned support for 52 bits in some places and 56 bits in others. He asked if Liang was trying to get all the way to 56 bits in an effort to "future-proof" the patch. Liang replied that 56 bits were "the virtual address width which will be supported in the future CPU with 52-bits physical address width. 5 level EPT can support a maximum 57-bit physical address width, as long as the future CPU use[s] no more than 57-bits physical address width; no more work is needed."

Kirill A. Shutemov liked Liang's patch and has already incorporated it into his own code tree.

Paolo Bonzini felt that Liang's patch seemed straightforward, though he wanted to investigate more. But he was slightly concerned that the way Liang had designed the patch might make it difficult to migrate virtual machines from one piece of hardware to another. Specifically, Paolo said, the LA57 mode in modern Linux would allow writing to high addresses, while older systems without LA57 mode couldn't. This meant that, as Liang's code stood, if a virtual system did a single write using LA57 mode, it would permanently block that system from migrating to an older system that didn't support LA57.

Ordinarily, Paolo said, the hypervisor would trap any unavailable features, but in this case, it wasn't able to trap LA57 – at least the code to do it hadn't been written yet, partly because the required change might slow the system down.

Essentially, Paolo said this wasn't a problem with Liang's code at all, but was a hardware issue. He said, "I am seriously thinking of blacklisting FSGSBASE completely on LA57 machines until the above is fixed in hardware."

Liang replied that he'd already forwarded that particular issue to the hardware people. But to some extent, that's that. If there's a hardware glitch with no viable workaround, it could simply delay adoption of this particular patch indefinitely, barring some new design idea.

Securing ext4

Yi Zhang described an exploit that a hostile user could use to cause memory corruption on systems running ext4. He posted a patch to prevent that sequence of actions from causing the corruption.

Andreas Dilger had minor quibbles but generally liked the patch. Valdis Kletnieks also liked Yi's work but wanted to find a technique to identify which filesystem on a multi-filesystem setup had been attacked. So Yi posted a modification of his patch, to identify the location of the problem as well as fix it. But Darrick J. Wong came up with an alternative solution, which Andreas liked better. However, Ted Ts'o pointed out that Darrick's improvement made some calls that wouldn't be possible from that particular context; so Yi's code stood as it was.

Yi also claimed that similar exploits existed on other filesystems. He asked, "do you think we should put these detections on the VFS layer? Thus other filesystems would not need to do the same things, but the disadvantage is that we cannot call ext4_error to report ext4 inconsistency."

Ted replied that some filesystems didn't use inodes, and therefore wouldn't be susceptible to this kind of attack. But he added, "We'll have to see if Al and other filesystem developers are agreeable, but one thing that we could do is to do the detection in the VFS layer (which it is actually easier to do), and if they find an issue, they can just pass a report via a callback function found in the struct_operations structure. If there isn't such a function defined, or the function returns 0, the VFS could just do nothing; if it returns an error code, then that would get reflected back up to userspace, plus whatever other action the filesystem sees fit to do."

Al Viro replied with some technical questions that seemed to indicate he was open to the idea, but at this point, the discussion petered out or went offline.

But clearly, everyone would be happy to plug a memory corruption hole, so we can expect the fix to go in very quickly, in whatever form is most pleasing.

Stabilizing RCU

Paul McKenney pointed out that the read/copy/update (RCU) implementation in Linux had some problems. RCU is a technique for making sure that certain data structures are available to all CPUs during bootup. Once the system is up and running, data structures can be accessed in RAM using normal means. But before all the resources of the system have been fully initialized, the CPUs all still need access to certain data. RCU is a way of copying the data between all CPUs in a way that's useful and won't cause race conditions or other problems.

Paul explained that RCU was initialized during boot-up in three distinct phases. During the first phase, you've got a single CPU running, with preemption disabled, so nothing else can interrupt it. At that point, any no-op is a Synchronous Grace Period, which means it's the moment that's OK for copying data.

In the second phase, preemption is enabled and the scheduler is running, but RCU has not yet got all its kthreads and workqueues up. During the second phase, if a Synchronous Grace Period comes along, the RCU isn't ready for data copying, and the system may crash.

In the third phase, the CPUs are running, the RCU is running, and everything's fine.

As Paul explained, that second phase has begun to be a problem, as Synchronous Grace Periods have begun to creep in and mess things up. He posted some patches to rework the second phase, to properly handle Synchronous Grace Periods during the second phase of RCU deployment.

Borislav Petkov liked the patch, though he had a few wording suggestions for the commit message and a couple of coding suggestions. He also tested the patch on a bunch of systems and confirmed that it seemed to work OK.

Paul posted a new patch, though regarding Borislav's coding suggestions, he said, "note that the code was buggy even before this commit, as it was subject to failure on real-time systems that forced all expedited grace periods to run as normal grace periods."

Rafael J. Wysocki wondered aloud if ACPI systems would need a separate fix for this whole issue, or if Paul's patch covered them as well. Borislav felt there was no need for a targeted ACPI patch, though he did say, "If you still think it would be better to have it regardless, you could pick it up (i.e., making ACPI more robust, yadda yadda). I dunno, though, perhaps it is only complicating the code unnecessarily and then can be safely ignored with a mental note for future freezes."

Lv Zheng also remarked, "ACPI fix is unnecessary as ACPI is just a user of the RCU APIs. And it's pointless to add special checks in the user side in order to use one of them."

Paul said that, if his RCU patch was delayed for any reason, the ACPI patch might be needed as a stopgap.

Boot time is one of those dark regions of the kernel, where few dare go and fewer still come out alive. If you ever want to feel fear, don't bother skydiving or bungee jumping. Just research what computers have to do in order to boot up successfully. Start with earlier history and work your way forward. The people who code in this area have the scars to go with every story. Paul's RCU patch was mercifully bloodless.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Zack Brown reports on improving a hashing function, constant values adjustable at boot time, and dealing with an Intel design flaw. 

  • Realtime

    Linux provides tools and patches for speeding up the priority of multimedia applications. So if you're not getting the performance you expect, try shifting into overdrive.

  • Kernel News


  • Kernel News

    Improving Netfilter Efficiency; Protecting Memory from Malicious Modification; and Speeding Up Workarounds for Intel Security Flaws.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95