Capabilities
Core Technology
Everyone wants to be root, because root can do anything. But in fact, its powers are now split. Learn more in this overview of capability sets.
Today's Linux is somewhat like a famous sightseeing city you might have visited on your last vacation. There is a historic part that's of no practical use now, yet it is what keeps the city's identity. There are some well-known tourist spots that everybody seems to visit. And, finally, there are some secluded locations you never find in an advertisement in a travel agency. These are places a friend living there would show you, and they are essential for sensing a real spirit of the city, not its pamphlet picturesque image.
Okay, maybe I've taken the analogy a bit too far here. But if you agree to follow it for a second, capabilities would be one of these secluded locations. Introduced with Linux 2.2, they are what really tells if process X can do Y. Yet they are often lost in shadows of traditional Unix privileges, SELinux, eBPF, and many others. By the end of this Core Tech article, you'll know who really sets your limits in a city of Linux.
An All-Mighty Root (Actually, Not)
Back in ye olde days, the permission system of Linux was pretty much simple. A user with UID 0 – often called "root" – could do any privileged operation, and he wasn't subject to permission checks. Note it is the UID, not the name, which is important. A user called "val" with UID 0 holds all powers of root user as well.
This "all-or-nothing" approach served well but wasn't very flexible. What if you do not want someone to install new packages or add new users, yet want him to create raw sockets the ping command uses? Granting someone a permission to adjust the system date doesn't mean you would be happy if he or she reconfigured Nginx or MySQL on this server.
The sudo tool solves this, kind of. You can tell it which command a given user can execute as root, so being able to run date
doesn't imply a permission to execute passwd
. But command-level granularity is sometimes too coarse to be useful. If a single command writes files and adjusts dates, /etc/sudoers
provides no way to restrict the former and grant the latter. This means that you leave your system possibly open to the attack.
Let's look at how the kernel implements permission checks for privileged operations. Listings 1 and 2 show a relevant part of the inet_create()
function, which is called in response to the socket(AF_INET, ...)
system call. The listings also show functions that do actual permission checks; they are really defined in a separate file, linux/sched.h
.
The code in Listing 1 comes from Linux 2.0. It's straightforward to see that it only evaluates if the current process effective UID is zero. However, Linux 2.2 is not concerned with the user ID anymore. Instead, it checks for specific flags in the process descriptor. These bit flags are essentially capabilities of the process or, more precisely, a thread. Any capability could be either set or reset. (See the "Secure Bits" box for more information.)
Secure Bits
Capabilities are a tricky yet flexible system, and you may be wondering why we still keep root users today. A short answer is it preserves backward compatibility and works well in many cases. A longer answer is that you really don't have to.
Starting with Linux 2.6.26, it is possible to establish a root-less, capabilities-only environment. In this environment, UID 0 is treated no differently from any other UID. As process permissions are really granted as per capabilities since Linux 2.2, establishing such an environment only needs some flags to disable special handling of the UID 0. These flags are commonly known as "secure bits."
Perhaps the most important security bit is SECBIT_NOROOT
, which disables setting permitted and inherited file capabilities to all-ones, as I described. Two other flags, SECBIT_KEEP_CAPS
and SECBIT_NO_SETUID_FIXUP
remove the effect of switching between zero and non-zero UIDs.
All these "base" flags also have companion "locked" flags. A locked flag forbids modifications to the corresponding base flag, and it can't be cleared. This means you can set up a secure bits environment the way you want, lock it, and be confident no process could ever change it. Secure bits are managed with a prctl(2)
system call, and a PAM module would be an appropriate place to do so.
Listing 1
Permission Checks in Linux Kernels Before 2.2
Linux understands a few dozen capabilities now; see linux/capability.h
[1] or capabilities(7)
man page [2]. The highest capability's number available (zero-based) is also in /proc/sys/kernel/cap_last_cap
:
cat /proc/sys/kernel/cap_last_cap 37
I'd be happy to say any privileged operation has a dedicated capability flag now, but it isn't the case. Some capabilities span several operations. For instance, CAP_NET_ADMIN
permits one to configure network interfaces, manage firewall rules, and modify routing tables (besides other things). You see the grouping is natural, so when a capability feels coarser than you might expect, it's usually not a problem.
As you may have guessed by now, CAP_NET_RAW
allows creating a raw (and packet) network socket which is useful for the ping command and for sniffing tools such as tcpdump.
Capability Sets
You may have noticed that the code in Listing 2 checks capabilities in the cap_effective
member of the process descriptor. There are a few other cap_something
members as well because each thread in Linux has several associated capability sets. Effective is, of course, what defines capabilities currently in action. Other sets are used, for example, when a thread does an execve(2)
system call to execute some new code for which you may want different capabilities.
Listing 2
Permission Checks in Linux Kernels 2.2 and Newer
First, there is the permitted capability set. It contains all capabilities a thread may ever assume – that is, add to any other capability set. If a thread drops a capability from the permitted set, there is no way back, at least until the thread executes the same program.
This brings us to the inheritable capabilities set. As the name implies, these are capabilities that are preserved across the execve(2)
system call. Inheritable capabilities are automatically added to the permitted set when a program is executed. However, this only applies to privileged processes, which either run as root or execute a setuid binary. For everything else, inheritable capabilities are simply ignored. So, if ping had CAP_NET_RAW
in its inheritable set, and you trick it to run a Python interpreter for you somehow, you still won't be able to create arbitrary raw network sockets. Only ping could do it, and it properly restricts the use of this powerful feature to innocent ICMP echo requests.
This raises a question: How do you execute a privileged helper then? This is, in fact, a common scenario: Consider a network management app. You don't need privileges to fill in stuff like an IP address or a gateway. Yet when you apply these settings, the app calls some helper script (often it is setuid-root) to put the configuration you want in effect.
Before Linux 4.3, there was no straightforward way to do this using capabilities. Now we have the ambient capabilities set. A capability in this set must be both permitted and inheritable (the kernel enforces it automatically), and these capabilities are preserved across execve(2)
calls in unprivileged programs. When you execute a setuid or a setgid program, the kernel clears ambient capabilities to keep things safe.
A process can also directly change capabilities in the ambient set using prctl(2)
system call. Keep in mind, however, that everything I described so far applies to execve(2)
only. Forks are nothing special from the capabilities point of view: Both a parent and a child get a bitwise copy of all capabilities set. It's execve(2)
that matters as it decides which code a thread will ultimately execute.
Capability Math
Now you have the idea of how the kernel implements thread capabilities, but where do these capabilities come from? Nowadays, they're usually attached to an executable file. Linux stores capabilities in a dedicated extended attribute within the security
namespace [3]:
$ getfattr -m - -d /usr/bin/ping # file: usr/bin/ping security.capability=0sAQAAAgAwAAAAAAAAAAAAAAAAAAA= ...
Interestingly, there is a dedicated capability, CAP_SETFCAP
, which grants a permission to set file capabilities. This is a sort of chicken and egg problem, although an "all-mighty root" concept solves it easily.
As with thread capabilities, there are several file capabilities set. Perhaps the most important one is the permitted set. Capabilities in this set are automatically granted when you execute a file, even if they aren't in the inheritable set of a thread doing an execve(2)
call. So, if an executable file has CAP_KILL
attached, the process will be able to send signals to arbitrary siblings, even if it doesn't run as root. Note that adding a capability to the file's permitted set isn't enough. You should also set a so-called "effective bit" in the file's capabilities. This bit makes permitted capabilities effective, that is, raised in the effective capabilities set after execve(2)
.
Files also have an inheritable capabilities set, which is ANDed with the thread inheritable capabilities at execve(2)
time. This is a way of saying "a thread executing this code never should be granted CAP_X
." If you know the program is going to adjust the system clock and nothing else, limiting the file's inheritable capabilities set to CAP_SYS_TIME
would mean dropping any other capability a thread may have gained.
If a process calling execve(2)
runs as root or the binary itself is setuid-root and has no capabilities attached (Figure 1), both permitted and inheritable file sets are assumed to be all ones (remember they are really just bitmaps). That's how the kernel preserves an all-mighty root illusion in 2017.
If the previous text was too verbose for you, the capabilities(7)
man page neatly summarizes the rules in just four formulas. Think of a process as doing execve(2)
and let P(something)
be capabilities in the respective set. Then, new capabilities, P'(something)
, are defined as:
P'(ambient) = (file is privileged) ? 0 : P(ambient)
If the file is setuid/setgid-root or has capabilities attached, ambient capabilities are cleared.
P'(permitted) = (P(inherit.) & F(inherit.)) | (F(permitted) & cap_bset) | P'(ambient)
This one is trickier. Thread inheritable permissions are put in the permitted set if file inheritable permissions don't disable them. Then, the file's permitted capabilities are dropped into the mix, subject to the capability bounding set (see the man page [2] for details). Finally, ambient capabilities are added for non-privileged processes.
P'(effective) = F(effective) ? P'(permitted) : P'(ambient)
If the file's effective bit is set, permitted capabilities become effective ones. Note that they could never be stricter than F(permitted)
, cap_bset
aside. Otherwise, only ambient capabilities are in effect.
Inheritable capabilities remain unchanged during execve(2)
: P'(inheritable) = P(inheritable)
. If you are interested in (somewhat mind-bending) implementation details, refer to [4] (also Figure 2).
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.