Zack's Kernel News

Reading from the Ring Buffer

Wang Nan wanted to be able to pause and resume the kernel ring buffer, to be able to read from it without worrying that anything might try to write to it at the same time. The ring buffer is where the kernel stores the log of its events. It's where the dmesg output originates.

Peter Zijlstra had no objection to this kind of patch, but he did want to see the man pages updated as well. And Vince Weaver agreed that updating the man pages would be good for something like this, since it represented an application binary interface (ABI) change.

ABI changes mean that user code compiled before the change might not run on kernels compiled after the change. Traditionally, ABI changes are something kernel developers desperately want to do and which Linus Torvalds absolutely refuses to allow. The reason developers want to do it is because they must otherwise support ancient legacy features forever – even broken or inconsistent features. The reason Linus refuses to allow it is because breaking the ABI means real user code starts to break in the real world. As a side issue, it becomes more difficult to find bugs in the kernel itself if the search goes across the boundary of the ABI change.

The question of why Wang's patch was important enough to justify an ABI change was not made clear during the mailing list discussion. Possibly it only added to the ABI instead of changing something that was there already. However, it does seem to be an important change, because as Wang said, "Before reading caller must ensure the ring buffer is frozen, or the reading is unreliable."

Migrating Processes Between Cgroups

John Stultz posted a patch to allow processes to migrate from one virtualized Linux instance to another. He got the idea from Michael Kerrisk, and it had originated in Android to avoid having to run the process manager with root privileges.

Typically process migration between virtual Linux instances is a risky business because it represents a potential point of attack, where hostile code might break out of the sandbox and escape to the host system.

Still, a potential security hole is different from an actual security hole, and folks like Kees Cook were glad to see the code. Andy Lutomirski, on the other hand, said that cgroups were about to expand their entire scope to do more than simply resource control. Future cgroups might have powers and abilities beyond those of mortal virtualized systems. Simply migrating a process from one cgroup to another, Andy said, might expose vulnerabilities that today's cgroups would not.

Without an idea of what to do instead, John didn't have any solid idea of how to update his patch to avoid the problems Andy was talking about. Finally, Andy suggested adding some form of privilege, not only to the task but to the cgroup itself, so that a process could only migrate from one cgroup to another if the user had permissions over both the process and the target cgroup.

Beyond that, it was a question of exactly which capabilities to use and how to organize them properly. At least the possibility does exist to support cgroup migration in the future, without compromising security.

Overall, cgroups are a strange and dangerous world. They never perfectly imitate a host system, and there is always the temptation to add bizarre features that could only exist in a virtualized environment. Ultimately, I suspect virtualized OSs will look quite a bit different from the hosts.

Performance Events Limits

Jeffrey Vander Stoep wanted to limit certain potential attack vectors, so he wrote a patch such that if kernel.perf_event_paranoid were set to 3, users would have to have CAP_SYS_ADMIN to gain access to performance events.

Jeffrey's idea was that performance events were great for debugging purposes, but they were rarely used on production systems and represented a potential security hole. He pointed to a slew of examples and said, "This new level of restriction allows for a safe default to be set on production systems while leaving a simple means for developers to grant access."

Kees Cook was enthusiastic about the patch, but Peter Zijlstra said plainly, "We have bugs; we fix them; we don't kill complete infrastructure because of them." He went on, "the problem I have with this is that it will completely inhibit development of things like JITs that self-profile to recompile frequently used code. I would much rather have an LSM hook where the security stuff can do more fine grained control of things, allowing some apps perf usage while denying others."

Arnaldo Carvalho de Melo also pointed out other areas of development that would be stifled by "such big hammer restrictions." Daniel Micay, on the other hand, came down in support of Jeffrey's patch. He said that it would still be possible, with Jeffrey's patch, to give certain processes the privileges they needed to use performance data. He said to Peter, "You're forcing people to have common local privilege escalation and information leak vulnerabilities for something few people actually use."

Daniel added, "This patch is now a requirement for any Android devices with a security patch level above August 2016. The only thing that not merging it is going to accomplish is preventing a mainline kernel from ever being used on Android devices."

Kees also replied directly to Peter's statement, "we have bugs; we fix them." He said, "it isn't what things look like for the average end-user of Linux. The lifetime on bugs is very long, even in upstream (see both Jon Corbet and my talks about this: an average of five years from introduction to fix), and gets drawn out even further by vendors with slow (or missing) update processes. Being able to remove attack surface is a fundamental first step of security defense, and things like perf, user namespaces, and similar APIs, expose a lot of attack surface when they are enabled. And the evidence for this attack surface being a real-world risk is in the history of security vulnerabilities (that we know about!) in these various APIs."

He went on to say, "the APIs are needed, but they lack the appropriate knobs to control their availability. And this isn't just about Android: regular distro kernels (like Debian, who also uses this patch) tend to build in everything so people can use whatever they want. But for admins that want to reduce their systems' attack surface, there needs to be ways to disable things like this."

Peter agreed with the knob concept, but he felt that the specific knob being proposed was not the right one. He said, "Having this knob will completely inhibit development of such applications. Worse, it will probably render perf dead for quite a large body of developers. The moment you frame it like: perf or sekjurity, and even default to no-perf-because-sekjurity, a whole bunch of corporate IT departments will not enable this, even for their developers."

The current proposal, he said, was too coarse and inhibiting. A better way had to be found.

Kees said, "The vast majority of people running Linux do not use perf (right now). I've never suggested it be default disabled: I'm wanting to upstream the sysctl setting that is already in use on distros where the distro kernel teams have deemed this is [a] needed knob for their end-users." He pointed out, "All of the objections you're talking about assume that the knob doesn't exist, but it does already. It's just not in upstream."

Jeffrey also put in, "Far from trying to kill perf, we want (and require) perf to be available to developers on Android. All that this patch enables us to do is gate it behind developer settings – just like we do with other developer targeted features."

Ingo Molnár, however, agreed with Peter. He said that it made no difference whether the default was on or off. The coarse/limiting aspect was simply too significant and had to be dealt with properly. Ingo said, "This isn't some narrow debugging mechanism we can turn on/off globally and forget about, this is a wide scope performance measurement and event logging infrastructure that is being utilized not just by developers but by apps and runtimes as well."

He went on to say, "in practice what will happen is that if the only option is to do something drastic for sekjurity, IT departments will do it – while if there's a more flexible mechanism that does not throw out the baby with the bath water that is going to be used."

Ingo compared the current patch with a situation that might have played out in the past. He said:

This is as if 20 years ago you had submitted a patch to the early Linux TCP/IP networking code to be on/off via a global sysctl switch and told people that "in developer mode you can have networking, talk to your admin."

We'd have told you: "this switch is too coarse and limiting, please implement something better, like a list of routes which defines which IP ranges are accessible, and a privileged range of listen sockets ports and some flexible kernel side filtering mechanism to inhibit outgoing/incoming connections."

Global sysctls are way too coarse.

Daniel argued that at least with the current patch, there was a way to turn access to perf events on and off at run time. If, for example, this was a compile-time configuration option, he said, it would require a reboot to gain access to perf events.

He also said that the "wide scope" infrastructure Ingo had referred to was exactly why the security problem was so big. He said, "If it wasn't such a frequent source of vulnerabilities, it wouldn't have been disabled for unprivileged users in grsecurity, Debian, and then Android."

He reiterated that Android and Debian already included the current patch. The baby wasn't in danger of being thrown out with the bath water – it had already happened – and the official kernel could recognize that or not. He said, "They'll keep doing it whether or not this lands. If it doesn't land, it will only mean that mainline kernels aren't usable for making Android devices."

Peter suggested coming up with a new capability to govern access to perf events. Specifically, he suggested that processes operating across a network connection would drop all capabilities. This would allow perf access at the local level, but not to networked applications.

Eric W. Biederman reiterated the main objection to the current patch. He said, "the problem with a system wide off switch is what happens when you have a single application that needs to use the feature. Without care your system wide protection disappears. That is very brittle design."

Eric suggested using a sandboxing approach instead, in which a given sandbox could have a given feature turned on or off without affecting anyone else. He added, "One of the strengths of Linux is applications of features the authors of the software had not imagined. Your proposals seem to be trying to put the world [in] a tiny little box where if someone had not imagined and preapproved a use of a feature it should not happen. Let's please avoid implementing totalitarianism to avoid malicious code exploiting bugs in the kernel. I am not interested in that future."

Kees replied, saying that he was "interested in giving system owners greater control over what's exposed. That's not about limiting access everywhere. And I'm interested in making sure that the upstream kernel actually provides what end-users want. In the most extreme version of this is when distros carry kernel patches to get it done (this was true with userns and is true again here with perf). This IS a desired feature, and it exists in the world. I want to avoid the confusion that arises from people running patched kernels: upstream developers don't realize what state their features are in when they reach end users, documentation doesn't match, etc., etc."

Daniel also said, "There are perf event vulnerabilities being exploited in the wild to gain root on Android. It's not a theoretical attack vector. They're used in both malware and rooting tools. Local privilege escalation bugs in the kernel are common so there are a lot of alternatives but it's one of the major sources for vulnerabilities."

The discussion became somewhat disjointed at this point. There was some effort to explore technical alternatives to the original patch, but there was still disagreement over exactly how dangerous the various vulnerabilities were and how crucial it would be to eliminate absolutely all of them in one fell swoop. At the same time, several people who were expert in the areas they'd been discussing so far were less expert in some of the proposals that began to emerge, so various folks had to catch up to others.

Ultimately, no agreement could be reached, and the debate seemed to be shaping into one of epic proportions. The kernel loyalists are on one side, saying that a given feature would be unacceptable, and the distro makers are on the other side, saying that the feature in question are already included in systems around the world, including everything running Android or Debian.

It's impossible to know how the debate will eventually play out. These sorts of things can take years, with neither side willing to budge. In this particular case, though, it does seem there is room for a more subtle approach than the original patch would allow.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    Zack Brown reports on enhancing KUnit, arguing over a nonexistent problem, and Cgroup core wars.

  • Kernel News

    Improving Netfilter Efficiency; Protecting Memory from Malicious Modification; and Speeding Up Workarounds for Intel Security Flaws.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More