Zack's Kernel News

Zack's Kernel News

Article from Issue 211/2018
Author(s):

Improving Netfilter Efficiency; Protecting Memory from Malicious Modification; and Speeding Up Workarounds for Intel Security Flaws.

Improving Netfilter Efficiency

Netfilter has some speed issues. Speed is always a focus of Linux development, but recent workarounds for widespread Intel hardware security flaws have resulted in significant slowdowns in the kernel. So lately, there's been even more incentive to improve speed wherever possible.

Netfilter is a generic kernel tool that allows system administrators to perform a wide array of operations on data packets moving through a network. However, as Imre Palik pointed out recently, netfilter was implemented with flexibility in mind, rather than efficiency. Even when a system performs no operations at all on network packets, simply hitting the netfilter hooks can slow things down a lot.

Imre posted a patch to address this issue. His idea was that if netfilter wasn't being used, then the kernel shouldn't hit its code at all. This would eliminate the slowdown. Of course, for systems that did use netfilter, the slowdown would remain. And this proved to be the big stumbling block for his patch.

Originally he reported a 15 percent speedup when using his patch. That would be enough to get anyone's attention, except for the fact that, as David Miller pointed out, Imre's measurements were not done with the standard perf tool; they were just his own observations, and thus could not be verified or even well understood. As one of the main gatekeepers of networking code in the kernel, David did not want to apply patches that were too speculative.

But David also objected to the whole idea behind Imre's patch. The solution to netfilter being too slow, he felt, was not to bypass netfilter in the case where it wasn't needed. The solution was to speed up netfilter so that it would run faster for everybody. He said, "I definitely would rather see the fundamental issue addressed rather than poking at it randomly with knobs for this case and that."

An interesting aspect of the whole discussion is the effort of a kernel "lieutenant" to guide developers towards working on a problem that might be more difficult than what they had originally attempted, but that would ultimately provide a better solution in general. There's always a little push and pull at this level, because sometimes the thing that really needs to be worked on is a really tough nut to crack, while there might be significant benefits to plucking whatever low-hanging fruit might be available. At what point does the lieutenant or the feature maintainer accept that an improvement is still good, even if it doesn't solve the big problem?

Protecting Memory from Malicious Modification

Sometimes patches slide so smoothly into the kernel, it almost doesn't seem as though anyone worked on it at all. For example, patches that don't touch anything in the "fast path" and therefore won't have a critical impact on the speed of the whole system, or patches that enhance an existing feature in a simple and obvious way, or patches that implement an optional security feature.

Igor Stoppa had an idea for an optional security feature and recently posted a patch to implement it. He wanted a way to set a memory pool dynamically to be read only. This would let the user do all the work of setting up a batch of data structures and then freeze them so that no hostile attackers could modify them on the running system.

Memory "pools" are just groups of memory allocations that are all the same size. Linux groups them together this way because it helps make allocations take the same amount of time to set up, which in turn makes the performance of any given piece of code more predictable.

With Igor's code, the user could perform a one-time, non-reversible operation to make all the allocations in a given memory pool read only. He felt it was important that the operation be non-reversible, because if the user were able to reverse it, an attacker potentially could as well.

There was immediate interest in this patch, particularly from the XFS people. Dave Chinner remarked that XFS had a lot of data that was simply static and unchanging; so it would be a short jump to just lock it down securely with Igor's patch. In fact, a bunch of people had been discussing exactly this feature at a recent Linux conference, so it was ready to be welcomed with open arms.

Several folks immediately began discussing possible improvements, naming conventions, and other elements of the patch. But everyone seemed to think it was a winner and should go into the tree.

It's fun to read this kind of discussion on the Linux Kernel Mailing List, because it could have just as easily been the case that Igor's patch would have touched a piece of untouchable code, or mangled something that shouldn't be mangled, or slowed something down that should be fast. You never know who's going to pop out of some distant part of the kernel infrastructure, to say, "Hey! Your code is stomping my feature!" So it must be very satisfying for developers when they write a patch, and it just snaps right in.

Speeding Up Workarounds for Intel Security Flaws

Intel's Meltdown flaw has really been an annoyance to Linux developers. The primary workaround is page table isolation (PTI), a heavy-handed patch that makes the system a lot slower but that definitely handles the problem case. Before PTI, there was no obvious way to deal with Meltdown. With PTI in place, the security issues have been eliminated, which has bought time to think about how to replace PTI piecemeal throughout the kernel, with solutions that don't have as big of a performance hit.

It's essentially the same approach that they ultimately took to eliminate the big kernel lock (BKL) some years back. When they tried to get rid of it all at once, they couldn't do it, and there was much gnashing of teeth and clawing of one another's eyes. But when they finally isolated all the BKL uses so that each occurrence could be dealt with individually, they were able to begin the gradual process of replacing them one by one with more targeted and efficient locking code.

That experience is possibly what made PTI such an obviously good choice in spite of its speed impact. Like the BKL, it solves the crucial problem, and like the BKL, there's no other apparent solution that can work everywhere in the kernel. But now like the BKL, PTI can be isolated, and each part of the kernel eventually purged of it, replacing PTI with whatever works best at that particular spot.

A recent patch from Nadav Amit tried to replace PTI at one spot in the kernel. The resulting discussion helped clarify the sort of issues the developers have to consider and why it might be much more difficult to eliminate PTI than it was to eliminate the BKL.

Nadav's idea was that Linux's compatibility mode had some of its own natural safeguards against Meltdown and that the remaining problems within compatibility mode could be handled individually, thus eliminating the need for PTI.

Compatibility mode is Linux's way of emulating older CPUs to support older user software on modern systems. In particular, some of the emulated systems are 32-bit processors, which were not susceptible to the Meltdown bug. Nadav wanted his code to detect when 32-bit user code was running in compatibility mode and to ditch PTI for that activity.

It was essentially a corner-case solution, since it wouldn't help users running regular programs on 64-bit systems, which is the much more common case. However, it seemed like a legitimate inroad against the need for PTI.

But there were problems. First of all, as Andrew Cooper pointed out, it was possible for 32-bit code to break out of compatibility mode, get back into a 64-bit environment, and then take advantage of the Meltdown bug after all. As he put it, "Being 32 bit is itself sufficient protection against Meltdown (as long as there [is] nothing interesting of the kernels mapped below the 4G boundary). However, a 32-bit compatibility process [tries] to attack with Spectre/SP2 to redirect speculation back into user space, at which point (if successful) the pipeline will be speculating in 64-bit mode, and Meltdown is back on the table."

However, he also went on to say that Supervisor Mode Execution Protection (SMEP) could also guard against the problem he'd just identified. SMEP is a CPU feature that can prevent all non-root-level code from running. Using SMEP in short bursts is a useful way to ensure that a particular stretch of code will not be vulnerable to attack. As Dave Hansen put it, "SMEP is valuable. It's valuable to everything, compatibility-mode or not."

But there was further criticism of Nadav's patch. Andy Lutomirski pointed out that compatibility mode was not a true hardware "mode" like SMEP, but was just a form of software emulation implemented by Linux itself. As a result, he said, he wanted to discourage Nadav and others from trying to distinguish between software running in compatibility mode or not in compatibility mode. He said there wasn't really any such thing; it was all just code running in Linux. As such, Andy said, trying to identify a test for when something was in or out of compatibility mode would be complex and prone to error.

A better approach, he said, though not without other problems, would be for the user software itself to put itself into a 32-bit execution environment. Once that happened, Nadav's code could swing into action. This would eliminate any need to guess when someone was in compatibility mode, though as Andy himself acknowledged, it would require old user software to be updated. That's not always possible, for example, if only a binary executable is available. And Linus Torvalds has traditionally been adamant that old user code should continue to work under Linux unless absolutely necessary.

Linus also had his own reservations about Nadav's patch, as well as about the general prospects of eliminating PTI in Linux anywhere else. Speaking only about Nadav's patch, though, Linus said that it seemed too easy for user code to escape compatibility mode and get back to a 64-bit environment; in which case, disabling PTI would be opening the door for an attacker.

Nadav replied that his code attempted to cover nearly all cases where user code might try to break out of compatibility mode, though he acknowledged, "There is one corner case I did not cover (LAR), and Andy felt this scheme is too complicated. Unfortunately, I don't have a better scheme in mind."

But even covering all the cases, Linus said, was "some really shady stuff" that might be too difficult to maintain safely. He said, "if you get it wrong, things will happily work, except you've now defeated PTI. But you'll never notice, because you won't be testing for it, and the only people who will are the black hats." Linus said that the fragility of trying to cover all these cases, "makes me go 'eww' about the whole model. Get one thing wrong, and you'll blow all the PTI code out of the water."

Linus also pointed out that Nadav's whole patch was covering a small corner case that wouldn't make any difference to most users, but that could potentially be used by malicious attackers to defeat PTI across the entire system. At one point Linus remarked, "I just feel this all is a nightmare."

So it's not quite as simple as getting rid of the BKL. And getting rid of the BKL wasn't simple by any stretch of the imagination. But in the case of PTI, the risks are greater. Get one thing wrong among a complex set of safeguards in any part of the kernel, and the whole system is vulnerable to attack. Meanwhile, the reason PTI had to be so heavy-handed in the first place is exactly because any other approach will have to be complex, difficult, and error prone. The decision to have a near-complete separation between kernel and user page tables is simple and solves the problem. Re-exposing those kernel page tables to user space will only be possible under very delicate and context-sensitive conditions.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News