Zack's Kernel News

Zack's Kernel News

Author(s):

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

cpusets, isolcpus, cgroups, Oh My!

Christopher Lameter wanted to restrict new process threads to just a subset of the CPUs installed on a system. This approach would allow real-time software to monopolize certain CPUs and, thus, guarantee their low latency requirements. He posted a patch to implement this at a very early stage of the boot cycle, so that the init process wouldn't start up user daemons on the wrong CPUs.

Gilad Ben-Yossef was very excited about this, but he didn't like the sheer number of parameters that users needed to set in order to use the feature. He suggested that folding Christopher's feature into the existing isolcpus feature might make this simpler. The isolcpus is a mechanism for removing specified CPUs from the scheduler's awareness and letting them run in isolation.

But, as Mike Galbraith pointed out, the isolcpus feature was going to be taken out of the kernel at some point in favor of the existing cpusets feature. Like isolcpus, cpusets lets the user hew off sets of CPUs (and associated regions of memory), so they can run in isolation from the rest of the system.

Gilad was in favor of cpusets taking over from isolcpus because he thought the cpusets feature was more elegant. He pointed out, however, that cpusets couldn't quite do what Christopher's patch attempted, for example, migrating a running timer from one CPU to another. He wouldn't mind seeing Christopher's code folded into cpusets instead of isolcpus, as long as something allowed the kind of workloads he was interested in.

One thing isolcpus could do that cpusets couldn't – and that Christopher's code tried to handle – was come into effect earlier in the boot cycle. Thus, Gilad also suggested reimplementing isolcpus so that all it did was configure cpusets earlier in the boot process.

Mike liked this idea, but Christopher said, "isolcpus is broken as far as I can tell. Let's lay it to rest and come up with a sane way to configure these things. Autoconfig if possible."

Christopher also mentioned that cpusets was going to be replaced by cgroups, a Google project to implement generic resource limitation and isolation.

The discussion veered off at this point to a consideration of how best to migrate threads off of a CPU. For example, what if you identify all the threads to migrate, and then one of them spawns a new thread before you can migrate it off the CPU? If a clean kernel solution exists, would it still be better to do it more messily in userspace? What are the various race conditions that might arise? And, how can the kernel support people in the finance industry who want as close to raw hardware access as possible?

No definitive conclusions were reached on the mailing list – just a very fun discussion and debate. It seems clear, however, that there are currently several different attempts to do the same thing in the kernel; some will go away, while one will (hopefully) gather together the best ideas of the rest.

Developer Backups

Linus's hard drive crashed suddenly one day, costing him some time and a few days of work. He planned to try to recover some of the data, but he said, "If worst comes to worst, I'll just do the last next days of the merge window on the laptop that I was planning on finishing it off with anyway, since I have travel coming up. At least this didn't happen at the very beginning of the merge window."

John Stoffel suggested using two drives and mirroring them. This way, if either of them died, the other would allow a quick recovery. Linus replied, "I long ago gave up on doing backups. I have actively moved to a model where I use replaceable machines instead. I've got the stuff I care about generally on a couple of different machines, and then keys etc backed up on a separate encrypted USB key."

H. Peter Anvin remarked, "I won't get any stationary machines without mirrored drives anymore. Storage just isn't reliable enough." John also said, "And I won't trust a single USB thumb drive to hold my most important stuff. And how do you hold onto family pictures and such? It's amazing how much crap can accumulate, but also how important it can be to have good backups that are remote. If the house burns down, don't matter how many machines the stuff is spread across if it's not local."

That's it. No kernel discussion, just some ideas about backing up one's system.

H8/300 Architecture Going Away

Guenter Roeck suggest that "H8/300 has been dead for several years, the kernel for it has not compiled for ages, and recent versions of gcc for it are broken. It is time to drop support for it."

He posted a patch to delete all associated files, but he also acknowledged that "it is not that simple to drop an architecture, and it may need some discussion, but someone has to put a stake into the ground. Keeping a virtually dead architecture on life support takes resources which are better spent elsewhere."

Greg Kroah-Hartman gave his blessings to this patch, remarking, "If this doesn't build, and no one is using it anymore, I agree, it should be removed. If someone wants to revive it, 'git revert' works just fine."

Joe Perches added the maintainer Yoshinori Sato to the CC list, just in case there was a plan or a hope to revive the architecture at some point. Guenter mentioned that he'd done that as well but possibly had misspelled the email address.

David S. Miller also agreed with Greg and Guenter – the code should go. He gave his "Acked-by" alongside Greg's. Wim Van Sebroeck gave his "Acked-by" as well.

In terms of scheduling, Guenter said: "My plan is create a branch on my repository on kernel.org, ask Stephen to add it to linux-next, and then ask Linus to pull it after one release cycle (ie for 3.13). This should give people enough time to find out about it and complain, and give everyone else enough time to find any missing pieces."

Linus Torvalds said: "I'm ok with code deletion patches, I don't think that would be a problem. I didn't check them, but I assume this is all literally just removing code that is conditional on h8/300 config options?"

Geert Uytterhoeven suggested waiting and hearing from Yoshinori. Geert said Yoshinori had planned to come to the Kernel Summit, so there would be a chance to discuss it directly.

It's very unusual to drop a whole architecture from the kernel. Even in this case, where it wouldn't even compile and had been broken for a long time, there was a reluctance to remove the code without making every effort to confirm that the code truly was dead.

The True Meaning of EXPORT_SYMBOLS_GPL

Richard Yao asked why the LZ4 code, which was clearly under a BSD license, had its symbols included in the EXPORT_SYMBOLS_GPL code.

Matthew Garrett replied, "EXPORT_SYMBOL_GPL is intended [as] an indication that using a symbol is likely to result in you producing a derived work of the kernel, and the kernel as a whole is under the GPL. It has nothing to do with additional licenses that individual pieces of code may be available under."

Joe Perches gave a link to an interesting summary of the debate over this issue [1]. Rob Landley thought that the whole discussion was legally dangerous. He said that EXPORT_SYMBOLS_GPL was intended to indicate when the use of kernel source code constituted a derived work (and thus when any distribution of that work would need to be licensed under the GPL). He didn't want to give the anti-GPL lawyers any ammunition to claim that anything other than that might be the case.

Joe, however, replied that Matthew's "declarative statement that EXPORT_SYMBOL_GPL is 'intended [as] an indication that using [the] symbol is likely to result … .' is incomplete. There are competing histories as to what EXPORT_SYMBOL_GPL was intended to do."

Matthew replied that the history and meaning of EXPORT_SYMBOLS_GPL wasn't really in doubt. And there, this time, the discussion ended.

Atomic Renames

Miklos Szeredi posted a patch to add a new system call, rename2(), to go alongside the rename() system call. The rename() call changes the name of a file. The rename2() call takes two files and swaps their names, so the first ends up with the name of the second, and the second ends up with the name of the first. Eventually, Miklos would like to see the behavior of rename2() folded into that of rename(), but he didn't want to do that at the start because it would make the patch balloon up in size.

About his patch, he said: "This allows interesting things, which were not possible before, for example atomically replacing a directory tree with a symlink."

He added, "The other reason to introduce this is for whiteout handling in union/overlay solutions in an atomic manner without having to add complex code to each filesystem's rmdir, mkdir and rename just for handling whiteouts." A whiteout is when you have a union filesystem and want to delete a file that resides on a read-only filesystem. When you delete the file under those circumstances, the union filesystem uses a whiteout to make it look to the user as though the file has really been deleted.

He pointed out that, although most whiteout cases would be solved with rename2(), there were still cases that wouldn't, and he mentioned that a new flag would be needed eventually to handle those remaining cases.

H. Peter Anvin suggested that instead of a simple A-to-B and B-to-A name exchange, Miklos should use a more complicated rename3() solution that would rename A to B, but then if B already existed it would rename B to C. Peter pointed out that this would encompass Miklos's original behavior, because rename(C,B,C) would accomplish the same file swap Miklos had envisioned.

Linus Torvalds stepped in at that point, saying that, actually, Miklos had already implemented that three-name version, but that it had been much more complicated and didn't fit as well with the rest of the API. Miklos's rename2() was actually the simpler and more preferable revision, Linus said. He added, "I was actually very relieved to see this much simpler and cleaner model, because the alternative really was nasty. This one looks fairly simple and clean and straightforward."

Andy Lutomirski suggested adding a flag to the new system call that would prevent a rename if the destination filename already existed on the filesystem. Linus agreed that this would be good, and the thread ended there.

Speeding Up AMD Kernels

Austin Hemmelgarn posted a patch to optimize a bunch of AMD CPUs. He said, "These provide noticeable improvements over the K8 config option, and allow the kernel to take full advantage of AMD specific instruction set extensions, such as ABM, LZCNT, and POPCNT."

Borislav Petkov replied that "A patch like that keeps popping up every couple of months. Please show us those noticeable improvements because the guy last time failed to do so." Borislav added that distribution kernels always shipped with only generic CPU support, so Austin's patch, at best, would help a very small subset of users who downloaded and compiled their own kernels.

Austin asked why it was so important to show measurable speedup, if the worst-case scenario was that almost no one would use the configuration option. However, he agreed to run the tests. Borislav replied, "Just having the option for no good reason at all is a no-no." A couple of posts later, he added, "If it doesn't bring any performance improvement – and I don't want to rain on your parade but I think it won't, at least not enough to warrant a serious look – there's absolutely no reason to add it."

Austin ran a few tests using his PILEDRIVER config option and reported that "build jobs appear to be much improved. Building kernel 3.12-rc2 with allmodconfig using 8 jobs on a FX-8320 takes 22 minutes and 57 seconds on a kernel with CONFIG_MK8, 21 minutes and 35 seconds on a kernel with CONFIG_GENERIC, and 19 minutes and 11 seconds on a kernel with CONFIG_PILEDRIVER. I see similar results for a build of GCC 4.7 (45m1s, 41m39s, and 38m56s)."

He added, "I don't know about you, but that sure seems to be a worthwhile performance increase to me." Linus Torvalds replied, "That's certainly noticeable. Surprisingly so."

Linus asked if Austin had run any kernel profiling, to see exactly where the speedup occurred. Borislav also ran his own tests using Austin's PILEDRIVER option. He reported, "I don't really see any of those improvements above. Actually, -march=bdver2 is even slightly worse in comparison to mk8. And the workload is of building a config specific to that machine but allmodconfig looks very similar, the numbers being simply higher."

Austin remarked, "Part of the difference between our results may be that I have my entire userspace built with -mtune=bdver2, so less of the time is spent in userspace." He added, "With regards to the differences shown above relative to CONFIG_MK8, that does actually make sense; with CONFIG_MK8, gcc makes very minimal use of extension instructions (afaik, only MMX, SSE, and 3Dnow!), this improves performance slightly on bulldozer derivatives because there are only half as many SSE and FP units as CPU cores."

Borislav replied, "That still cannot explain the huge difference between building on a mk8 vs bdver2 kernel. Provided your userspace is the same and only the kernels are different, I don't see how that happens."

The discussion ended around there with no conclusive performance improvement shown for Austin's patch, and no conclusive disproof either. To get into the kernel, however, proof of value would be the deciding factor.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.