Zack's Kernel News
Zack's Kernel News
Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.
System Call Organization
Andy Lutomirski pointed out that Linux system calls were insane. In particular, he said it was virtually impossible to code for them in an architecture-independent way. There was no clean way, he said, to map between system call names and their numbers, to map between system call argument registers and logical system call arguments, or to identify which architectures implemented which system call.
The current source tree took care of all of these things on a per-architecture basis that was very messy. Various tools like strace, libseccomp, and glibc had various approaches that worked but that were also messy.
Andy said, "I'd like to see a master list in the kernel that lists, for every syscall, the name, the number for each architecture that implements it (using the AUDIT_ARCH semantics, probably), and the signature. The build process could parse this table to replace the current per-arch mess." He added, "More importantly, we could add a library in tools that exposes this information to userspace."
Eric Paris loved this idea and pointed out that, currently, userspace tools invariably had to implement their own messy versions of this themselves, which was awful. He was all in favor of an in-tree readable list.
Paul Moore also remarked that libseccomp's implementation was horrible and unmaintainable. He said, "I would be very happy if the kernel had some header/file/whatever that could be used by userspace applications to map syscall names/numbers for each architecture."
He said Andy had actually submitted libseccomp patches to him a while back that he had not yet applied because of how tricky the system call handling code was. He added, "Improvements here could make life much easier for us and remove a lot of my hesitation in merging Andy's code."
H. Peter Anvin replied to Andy's original email, saying "I have brought that up a lot of times, originally dating back from my work on klibc. I have tried to keep the klibc syscall list in a sane format with architecture annotations, but it doesn't contain all the syscalls in the system." He added that "making it encompass everything the kernel exports would be highly useful, but it would take a lot of work."
David Howells also liked Andy's idea and said "It would be really nice if we could do this in such a manner that we could build strace from it."
Andy and H. Peter started talking about the best way to go about gathering the necessary information, and Michael Kerrisk remarked that the man pages project would be very interested in the outcome of all this.
This all seems pretty cool. Even the few tools that handled system call information properly were clamoring to get away from it. And now it seems as though instead of an unreadable, unnavigable wad of intractability, there may soon be a clean, easily read, easily used, user-space-accessible table that solves the whole problem.
Khalid Aziz from Oracle introduced a controversial issue recently. He wanted to add some real-time-ish code to deal with a particular problem that database folks had been seeing. In his scenario, "a thread grabs the mutex, runs out of its timeslice and gets switched out which then causes another thread to run which tries to grab the same mutex, spins for a while and finally gives up. This can happen with multiple threads until [the] original lock owner gets the CPU again and can complete executing its critical section."
Khalid knew that other operating systems addressed this problem by allowing threads to request an additional timeslice. This would give them time to give up their lock and avoid holding up the whole queue. He estimated that a similar feature in Linux could speed up certain database workflows by 3-5 percent.
He posted a patch based on some earlier work by Venkatesh Pallipadi to create a
/proc/<tgid>/task/<tid>/sched_preempt_delay file for each running process. If a thread wrote a value of
1 to their file, they'd get all available CPU time until they subsequently wrote a
0 to it. Because he was dealing with tiny fractions of a second, Khalid also added code to memory-map those files, to make writing to them an extremely quick operation.
Davidlohr Bueso didn't object to the feature itself, but he felt this was something that should go into one of Linux's real-time variants, rather than the official tree. He worried that some software might abuse the ability to snarf more cycles, and he thought a voluntary preemption model might solve Khalid's problem better.
Khalid replied that for this particular workload, the database folks didn't need to run in full real time, nor could they expect to have total control of a particular server running their software. There was just this one little moment where a slight tweak could save a lot of time. As far as the potential for abuse, Khalid added, "I have thought about allowing sysadmins to lock this functionality down some but that does add more complexity." He said he'd be willing to do it if the kernel folks wanted him to.
Oleg Nesterov was also skeptical about Khalid's patch, but he had no initial comment on the actual scheduling changes. Instead, he noticed some problems with the way Khalid memory mapped his proc files. So, he made some suggestions on how to improve that.
Al Viro also noticed race conditions in Khalid's memory mapping code, which Khalid promised to fix. Alan Cox also pointed out race conditions in the same area of code.
H. Peter Anvin had some comments on the actual scheduling parts of Khalid's patch. He agreed with the possibility of abuse, saying it would allow user space to transform the kernel's normal preemptive multitasking approach (in which the kernel simply decided when to reschedule a process) into a cooperative multitasking approach (in which user software told the kernel when it was ready to give up control).
Aside from that, H. Peter pointed out a couple of other practical issues. For one thing, a process would need to know whether the kernel would have preempted it before it called
sched_yield() to give control to another process. Otherwise, it might end up taking less than its full timeslice. So, H. Peter said that the scheduler would have to set an additional flag with that information when allowing a thread to claim additional timeslices.
He also remarked that once a process stopped claiming timeslices, if it did not also then call
sched_yield() to give away the time it no longer needed immediately, then the kernel should penalize it somehow.
Khalid agreed with both of these ideas and suggested a couple of penalties he might impose on processes that failed to call
sched_yield() when they were supposed to. Essentially, the penalties involved losing or delaying the process's next timeslice.
Alan Cox also suggested a way for Khalid to avoid adding the whole scheduling feature. He suggested using control groups to cluster a set of processes together and to implement cooperative preemption just within that group – in other words, he said, "they only pre-empt each other by yielding but they can be pre-empted by other tasks outside the group."
Khalid replied, "I had suggested that to database folks initially, but they use mutexes that are shared across processes and they do not believe they can control the environment in [a] customer scenario where they can ensure right tasks will always be in the right control group. Besides they want to use a common mechanism across multiple OSs and pre-emption delay is already in use on other OSs."
Thomas Gleixner retched into his hand at the whole idea of Khalid's patch. He said that just because other OSs jumped off a cliff, it was no reason for Linux to follow them down. He said, "In fact its a horrible idea. What you are creating is a crystal ball based form of time bound priority ceiling with the worst user space interface I've ever seen."
Andi Kleen remarked that Thomas was being a nay-sayer and asked him to propose some kind of solution that would work better than what Khalid had implemented. Andi said that a lot of workloads were affected by this problem and that "the workarounds I've seen for it are generally far worse than this. Often people do all kinds of fragile tunings to address this, which then break on the next kernel update that does even a minor scheduler change." He added, "The real time scheduler is also [a] really poor fit for these workloads and needs a lot of hacks to scale."
Andi, Oleg, and others started to consider what alternatives might be available to Khalid's approach. Part of the problem was that there was very little time available for any code to run. Context switching happened tons of times per second, and any user code that affected context switching had to run entirely within one of those slices.
Also, any solution that involved invoking a system call would already add enough overhead to completely outweigh the benefits of Khalid's patch.
There's a strong incentive to avoid giving user code the ability to claim additional CPU cycles for itself. You don't want to have to trust user code. On the other hand, allowing user code to yield CPU cycles to another process is perfectly fine. User code can already yield all cycles by simply not running. Once it runs, there's no problem if it wants to yield any of its time to other processes.
One of Oleg's suggestions involved this idea – instead of having the process holding the lock claim more CPU to give it time to release the lock, why not have other processes yield their time back to whichever process held the lock? That way, there'd be no queue of processes wasting their timeslices spinning for the lock, and the original process could have as much time as it needed to release the lock itself.
Khalid rejected this, saying, "by the time a thread finds out it can not acquire the lock because someone else has it, we have already paid the price of context switch. What I am trying to do is to avoid that cost."
David Lang replied, "Yes, you have paid the cost of the context switch, but your original problem description talked about having multiple other threads trying to get the lock, then spinning trying to get the lock (wasting time if the process holding it is asleep, but not if it's running on another core) and causing a long delay before the process holding the lock gets a chance to run again. Having the threads immediately yield to the process that has the lock reduces this down to two context switches, which isn't perfect, but it's a LOT better than what you started from."
But Khalid replied that the next thread vying for the lock might be on a different CPU. It wouldn't be able to yield back to the first thread, because the first thread had already been preempted by something else on that CPU. Meanwhile, the first thread couldn't rely on subsequent threads on that same CPU knowing enough to yield their time back to it – the first thread could have been preempted by anything else running on the system.
This brought Khalid back to the conclusion that the thread should be able to prevent itself from being preempted, just long enough to release its lock.
David replied that actually, even from the other CPU, the next thread could yield back to the thread that held the lock. That thread would just have to wait until the next timeslice opened up, but it wouldn't have to wait for the whole scheduling queue to roll around. It would only pay the penalty of a couple of context switches.
Khalid felt that his solution was still faster than what David and Oleg were talking about, so he continued to prefer it.
At a deeper level, Peter Zijlstra said, "The thing is; unless userspace is a RT program or practises the same discipline to such an extent that it make no practical difference, there's always going to be the case where you fail to cover the entire critical section, at which point you're back to your pile-up fail. So while the limited preemption guard helps the best case, it doesn't help the worst case at all." Khalid agreed with this assessment, but he said that his first priority was to avoid the pile-up when possible and to deal efficiently with pile-ups that he couldn't avoid.
Thomas also said to Khalid:
We understand that you want to avoid preemption in the first place and not getting into the contention handling case.
But, what you're trying to do is essentially creating an ABI which we have to support and maintain forever. And that definitely is worth a few serious questions.
Let's ignore the mm related issues for now as those can be solved. That's the least of my worries.
Right now you are using this for a single use case with a well-defined environment, where all related threads reside in the same scheduling class (FAIR). But that's one of a gazillion use cases of Linux.
If we allow you to special case your database workload then we have no argument why we should not do the same thing for realtime workloads where the SCHED_FAIR housekeeping thread can hold a lock shortly to access some important data in the SCHED_FIFO realtime computation thread. Of course the RT people want to avoid the lock contention as much as you do, just for different reasons.
Add SCHED_EDF, cgroups and hierarchical scheduling to the picture, and hell breaks loose.
David also pointed out that Khalid had originally described his fix as producing a 3-5 percent improvement. If something like his and Oleg's suggestion could produce a similar improvement, he said, it might outweigh the drawbacks of giving userspace code control over its own preemption.
Khalid agreed with Thomas that a new ABI (application binary interface) should not be created lightly and that he himself would argue against creating one to satisfy just a single use case. But, he said that his code solved problems for a very highly used set of database tools. It wasn't a special case for an uncommon workload; it was a general case for a common workload.
Somewhere in the course of discussion, Khalid posted a new version of his patch, incorporating some of the suggestions he'd received. The new version no longer relied on memory mapping files in the proc directory, and instead used a futex-type communication between the kernel and user space. It also added a flag to let the user process know if it had gained a new timeslice, and it penalized processes that failed to yield after they no longer needed additional timeslices.
Andrew Morton commented at this point, saying "The feature makes sense and sounds useful, but I'd like to see some real world performance testing results so we can understand what the benefit will be to our users?"
Various folks made technical suggestions regarding the new version of the patch, but it still wasn't universally popular. Eric W. Biederman in particular thought it was of the Devil, and said, "You allow any task to extend it's timeslice. Which means I will get the question why does really_important_job only miss its latency guarantees when running on the same box as sched_preempt_using_job?" He added, "Your change appears to have extremely difficult to debug non-local effects."
Meanwhile, Davidlohr jumped up and clicked his heels at the prospect of Khalid's patch going into the kernel. He said, "Good timing! The topic came up just yesterday in LSF/MM. This functionality is on the wish list for both Facebook and Postgres."
The discussion ended around there. There seem to be two clusters of opinions on the issue: There are the actual users, hoping and praying for this exact feature, and there are the kernel old-timers who say that this is a can of worms that should remain forever unopened.
I suspect that the "we shouldn't create a crazy ABI" argument, the "this creates undebuggable problems" argument, and the "just yield back to the lock holder" argument may ultimately tilt the balance against the patch. But, it's never clear. At the very least, Khalid seems to be getting plenty of solid technical suggestions – in some cases, from the very folks who think his whole approach is wrong. So, one way or another, the patch itself is likely to become acceptable, if the idea behind it does as well.
The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.
Buy this article as PDF
News site for the openSUSE community falls victim to a Wordpress exploit.
The source code is available online.
One out of three virtual machines on Microsoft Azure Cloud run Linux.
The form factor of the board makes it a drop-in replacement for Raspberry Pi.
Makes it easier for customers to move workloads into container-centric applications.
SUSE’s answer to container-centric operating systems.
Linux 4.9 is the biggest release in terms of number of commits.
The latest version of the official RHEL clone is here.
New release targets Linux professionals.
The Fedora project adds Wayland and Gnome 3.22