Zack's Kernel News

Zack's Kernel News

Article from Issue 180/2015
Author(s):

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

NMI Refactor And Simplification

Andy Lutomirski wasn't satisfied with the non-masking interrupt (NMI) system, as implemented in the Linux x86 architecture. An NMI is when the system hardware interrupts the normal flow of code execution in order to perform a task that's so important it can't be ignored or deferred by the operating system. For example, if the system detects bad memory, it might trigger an NMI that would halt the OS.

But even – maybe especially – in such dire circumstances, it's important for the kernel to behave in as orderly a way as possible to preserve user data and leave the system in as consistent a state as possible. One NMI may occur within another, depending on the cause of the interrupts. That nesting has to be handled gracefully, and code sometimes needs to know whether it was called from within another NMI – information that is notoriously tricky to determine. However, Andy reported that in the current code, in certain corner cases, NMIs could nest improperly, malicious user code could cause some NMIs to be ignored entirely, and some code paths simply handled NMIs incorrectly because the kernel code was wrong.

He posted some patches to try to fix things up but quickly found himself plunging down the rabbit's hole of nested NMI call subtlety, as he discovered more about how the existing NMI code behaved. The whole discussion was fascinating.

The problem seemed to be that because NMIs could nest, the full scope of possible context switches had to be detected and accounted for, and the behaviors in all cases had to be sane. Or not! Maybe, as Andy suggested, a fully paranoid approach would handle all cases, but at an unacceptable speed cost. A less thorough approach, he said, might catch nearly all cases, but without the cost. And, maybe the discrepancies wouldn't really matter in practice.

As Borislav Petkov put it, the worst case was just that some breakpoints (out of a vast bunch) might get ignored. Ultimately, based on this and similar logic, Andy decided to focus on the simpler, less correct, yet possibly good enough approach.

Andy said regretfully, "I admit I'm rather fond of xadd as a way to switch rsp and set a flag at the same time, though." To which Borislav replied, "I know you are. But people will rip your head out if you added 60 cycles to the IRQ path."

At that point, Linus Torvalds offered his opinion of Andy's approach, saying, "Yes, it's irq-safe. But it's not NMI-safe, so if an NMI happens there, when the NMI returns, an interrupt might occur there too. But since we're back on the original stack where the original fault happened, and since interrupts were enabled, I don't see why that would be horrible. In theory, we might have a growing stack if this keeps happening, but since the only way to get that is to get the NMI in that one-instruction window (and apparently on at least _some_ microarchitectures the sti shadow stops even NMIs), I don't see how any kind of unbounded growth would happen."

In other words, yes it could be a problem, but to hit the badness would require actively aiming at it.

So, Linus supported the less thorough approach, although he added, "it still worries me a bit."

A bit of implementation discussion ensued, and Andy forked off a new thread to summarize the debate and to introduce a new possible approach. As he put it, the first possibility was to be fully paranoid and catch every possible case. This would allow the kernel to fully support NMI nesting, but at the speed cost mentioned above. Throughout the discussion, no one really advocated for that.

The second option was to disallow IRET inside NMIs – in other words, to prevent NMIs from nesting within each other entirely. This would be the simpler though less pretty option.

Andy now proposed a third option – forbid faults within an NMI, with the exception of MCEs (Machine-Check Exceptions). An MCE indicates a hardware error, essentially serious enough to indicate that the system could not continue to run.

By forbidding other sorts of faults within an NMI, Andy reasoned, the problem could be dramatically simplified. The only two faults that might conceivably be nested within an NMI were #PF (Page Fault) and #DB (Debug). And, Andy said that the debug fault case could be handled using existing patches.

This left only page faults. For those, there were only two cases where Andy could see a page fault occurring within an NMI. The first was a vmalloc fault (memory allocation fault), and the second was a user memory access fault.

However, Andy said that Ingo Molnár was already working on patches that would eliminate vmalloc faults entirely. So, user memory access faults were the only remaining ones that had to be removed from NMIs. And, he said, those faults didn't belong in NMI context anyway. Andy explained:

The reason we access user state in general from an NMI is to allow perf to capture enough user stack data to let the tooling backtrace back to user space. What if we did it differently? Instead of capturing this data in NMI context, capture it in prepare_exit_to_usermode. That would let us capture user state *correctly*, which we currently can't really do. There's a never-ending series of minor bugs in which we try to guess the user register state from NMI context, and it sort of works. In prepare_exit_to_usermode, we really truly know the user state.

It looked promising. All the cases seemed to be accounted for, and he went on to consider various implementation details. However, Linus torpedoed that whole third idea. Linus said that forbidding faults within an NMI, "depends on us getting many things right, and never introducing new cases in the future."

Linus preferred Andy's second solution – the relatively simple compromise of forbidding IRETs within NMIs. He thought that was the most localized solution. As Linus saw it:

The point of RF is to make forward progress in the face of debug register faults, but I don't see what was wrong with the whole 'disable any debug events that happen with interrupts disabled'.

And no, I do *not* believe that we should disable debug faults ahead of time. We should take them, disable them, and return with 'ret'. No complex 'you can't put breakpoints in this region' crap, no magic rules, no subtle issues.

I really think your 'disallow #DB' is pointless. I think your 'prevent instruction breakpoints in NMI' is wrong. Let them happen. Take them and disable them. Return with RT clear. Go on with your life.

And the 'take them and disable them' is really simple. No 'am I in an NMI context' thing (because that leads to the whole question about 'what is NMI context'). That's not the real rule anyway.

No, make it very simple and straightforward. Make the test be 'uhhuh, I got a #DB in kernel mode, and interrupts were disabled – I know I'm going to return with 'ret', so I'm just going to have to disable this breakpoint'.

Nothing clever. Nothing subtle. Nothing that needs 'this range of instructions is magical'. No. Just a very simple rule: if the context we return to is kernel mode and interrupts are disabled, we're using 'ret', so we cannot suppress debug faults.

Andy accepted the course correction and started thinking about particular implementation details for this approach, and a bunch of other folks joined in. Peter Zijlstra was not thrilled about disabling breakpoints entirely, though, because he said it, "would make the whole debug register thing entirely unreliable."

But, Linus said that re-enabling the breakpoints was something to think about afterwards – it wasn't necessary for Andy's NMI reworking. He said that re-enabling breakpoints "should not be a requirement for the basic stability and core integrity of the kernel. Not like the current horrid mess with NMI nesting and ESP fixing etc." And, he added, "realistically, nobody will ever even notice. So the whole 'ok, we can use _TIF_USER_WORK_MASK to re-enable dr7' is a tiny tiny detail that is more like cleaning up things, not a core issue."

Peter wasn't the only one concerned about the need to re-enable breakpoints. Linus may have compartmentalized the issue to his satisfaction, but Andy also had issues with letting breakpoints go like that; and suggested, "Or we just re-enable them on the way out of NMI (i.e., the very last thing we do in the NMI handler). I don't want to break regular userspace gdb when perf is running."

Linus wanted to go even further. In terms of disarming breakpoints, he said:

… for simplicity, I'd make it cover not just NMI code, but any 'kernel code with interrupts disabled'.

Because that's the test we'd use for 'use ret instead of iret'.

And that wider test is exactly because it's so damn hard to get the exact instruction boundaries right. Let's *not* go down the path (again) of having to get the whole %rip range and 'magic stack pointer values' etc.

Make it simple and completely unambiguous. The rule really would be:

- if we return to kernel space and interrupts are disabled, we will use 'ret' rather than 'iret'. Hard rule. Simple. Straightforward. No random %rip values. No random %rsp values. NO CRAP.

- but because we use 'ret' rather than 'iret' we can't get RF semantics, it means that #DB is special. RF is supposed to make us make forward progress. So for that reason, #DB just says 'if the breakpoint happened during that interrupts-ff region, I will clear %dr7 to guarantee forward progress'

So those would be the two main rules. Very simple, and avoiding all nasty cases.

Now, I'd be willing to then hide the 'oops, we clear dr7 very aggressively' issue by having a few additional _heuristics_. But I call them 'heuristics' because unlike the current NMI nesting games, they aren't about core stability. They are about 'ok, maybe somebody wants to trigger those faults, and we'll be _nice_ and try to make it easy for them', but nothing more.

So, for example, if that '#DB clears %dr7' happened, it sounds easy to set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a cached value, so that if we disabled things because of some user stack trace access, it will be re-enabled by the time we return to user space. I think that sounds reasonable, but it's not something the core low-level entry x86 assembly code needs to even care about. It's not that level of 'core', it's just being polite.

Elsewhere, Linus gave more explanation about why he wanted to draw such a sharp distinction between the core behavior and any other "polite" behavior that might be implemented later. This was in response to Andy's suggestion that the code re-enable breakpoints as the very last thing the NMI did before returning. Linus said:

I'd really prefer it if we don't touch NMI code in those kinds of ways. The NMI code is fragile as hell. All the problems we have with it is exactly due to 'where is the boundary' issues.

That's why I *don't* want NMI code to do magic crap. Anything that says 'disable this during this magic window' is broken. The problems we've had are exactly about atomicity of the entry/exit conditions, and there is no really good way to get them right.

I'd be much happier with a _TIF_USER_WORK_MASK approach exactly because it's so *obvious* that it's not a boundary condition.

I dislike the 'disable and re-enable dr7 in the NMI handler' exactly because it smells like 'we can only handle faults in _this_ region'. It may be true, but it's also what I want us to get away from. I'd much rather have the 'big picture' be that we can take faults anywhere at all, and that none of the core code really cares. Then we 'fix up' user space.

But Andy was still *very* reluctant to go along with Linus on this, because the core behavior Linus was aiming for would break GDB. And then, re-enabling breakpoints in a polite-yet-unreliable way would not necessarily be sufficient, because GDB could still set breakpoints that might be disarmed by this code and then not re-enabled. Andy dug further into the code and came up with a way that the NMI code would only need to disarm a specific set of breakpoints, which were outside the scope of what GDB might do. He posted some pseudocode illustrating his idea.

Linus liked Andy's approach. Linus also realized he hadn't fully understood the details of hardware behaviors. He said:

Hmmm. I thought watchpoints were 'before the instruction' too, but that's just because I haven't used them in ages, and I didn't remember the details. I just looked it up.

You're right – the memory watchpoints trigger after the instruction has executed, so RF isn't an issue. So yes, the only issue is instruction breakpoints, and those are the only ones we need to clear.

And that makes it really easy.

So yes, I agree. We only need to clear all kernel breakpoints.

So we don't even need that _TIF_USER_WORK_MASK thing, because user space isn't setting kernel code breakpoints, it's just kgdb.

Sounds good to me.

There was another burst of implementation discussion, patch submission, and investigation into whether things truly behaved the way they seemed. The discussion got very deep, analyzing opcodes, lock locations, and lots of other technical details.

Ultimately, this seems like the sort of project that leaves a trail of controversy in its wake – as soon as anyone comes up with a new approach that relocates the problem, developers working in that new area uncover issues of their own. So, Andy, and everyone else involved have to keep searching for that magical spot where the fix won't break 10 other things.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel: New Maintainer for x86 Branch

    Back at the Kernel Summit in September Andi Kleen announced that he would no longer be maintaining the i386 and x86_64 branches if they were merged in the new x86 branch. A new patch shows that Kleen has kept his promise.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

     

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News