Zack's Kernel News

Zack's Kernel News

Article from Issue 221/2019

Considering Plan 9 extensions for type conversion, supporting heterogeneous systems, and optimizing CPU idle states.

Considering Plan 9 Extensions for Type Conversion

For a long time, the Linux kernel would only compile with the GNU C Compiler (GCC). Now, several compilers can do it, but each compiler has its own way of doing things, offering various extensions to the C language and optimizing code in different ways. The question of which compiler features to depend on can have an effect on whether other compilers can keep supporting Linux.

Recently, Matthew Wilcox suggested using the -fplan9-extensions GCC option to handle some implicit type conversions. This way, a particular cyclic memory allocation could be made to embed a needed reference instead of requiring it to be passed explicitly to the function. If the code used the Plan 9 extensions, the functions would not need to be tweaked to accept the additional input.

However, as Nick Desaulniers pointed out, other compilers might not support that particular extension. Even if Matthew successfully argued in favor of it, Nick suggested making it optional, so that other compilers could continue to support the kernel.

Linus Torvalds had even stronger reservations. He said:

The full Plan 9 extensions are nasty and make it much too easy to write "convenient" code that is really hard to read as an outsider because of how the types are silently converted. And I think what you want is explicitly that silent conversion. So no. Don't do it. Use a macro or inline function that makes the conversion explicit so that it's shown when grepping.

Linus also added, "We've used various GCC extensions since day #1 ('inline' being perhaps the biggest one that took forever to become standard C), but these things need to have very strong arguments."

So that was that. The kernel will not use the Plan 9 extensions. On the other hand, Matthew probably won't suffer too much implementing his cyclic memory allocations, because modifying the necessary functions will only be a slight inconvenience. The interesting question is when Linus would say it's legit to use a particular extension. Judging from this conversation, it seems like the main justification for an extension is if the feature truly belongs in the standard C language. I'd love to see that sort of debate play out at some point.

Supporting Heterogeneous Systems

As hardware systems become more complex, operating systems need to be able to accommodate a much wider variety of configurations. It used to be that system memory was all the same type. Then we encountered systems like cell phones, which were built fast, and slow RAM. Now a system can host a wide range of different types of memory, and many different buses for sending data between them.

Among the many people working on supporting all this, JÈrÙme Glisse wanted to take the bull by the horns and come up with an overarching memory and bus management system to enable the kernel to operate at maximum efficiency on any hardware configuration.

One of his main goals was first to find a way even to express the configuration of a given system. Only once it could be referred to within the kernel could the strange mix of different RAMs and different bus topologies begin to be supported in a methodical way. Even the CPU had to come into play, given that now graphics processing units (GPUs) and field-programmable gate arrays (FPGAs) were taking on more and more computational tasks.

To express all these things, he named "targets," which were any memory on the system; "initiators," which were CPUs; "links," which were fast connections between targets and initiators; and "bridges," which were remote connections to other groups of initiators and targets.

With these identifiers, the entire graph of a given system could be exposed to user space via the sysfs directory. This was JÈrÙme's first goal.

But his patches went beyond identifying the hardware graph to expressing actual memory policy for the running system. This would be used to decide which parts of memory should host which processes. In general, one would presumably always prefer to use fast memory when it's available, and slower memory when no fast memory is free.

One of the major problems that JÈrÙme encountered was that the kernel's existing memory policy is handled on a per-CPU basis. He felt this was not fine-grained enough to deal with modern systems properly. A really robust set of memory policies, he felt, would require an entirely new API.

But this was easier said than done. Revising the existing memory policy API would necessarily break vast tracts of existing user code, all of which would need to be ported to the new API. That type of change, unless absolutely necessary, was unlikely to find sympathetic ears among the top kernel people.

JÈrÙme's solution was to write an entirely new API that could sit alongside the original. Any user code could use the new hotness, while existing code could stick with the old reliable code.

Yet this solution also carried its own drawback. As Aneesh Kumar put it in response to JÈrÙme's proposal, "we now have multiple entities tracking CPU and memory."

Aneesh also drew out of JÈrÙme the confession that even once the new API was in place, it might not result in granting all initiators optimal access to all targets. Mostly JÈrÙme seemed to feel that the situation was still too unknown to have real clarity. He wanted first of all to expose the topology to user space and then, once that was accomplished, see what could be seen. He wanted to climb the mountain first and only then look out across the landscape.

Dave Hansen had the sternest objections to JÈrÙme's work. He said that since the kernel already had infrastructure to deal with various types of RAM on the system, it made more sense to enhance those existing features rather than write something new. In particular, he said, the Heterogeneous Memory Attribute Table (HMAT) was already present in firmware for this very purpose – to express the system's hardware topology to the operating system. He also pointed out that non-uniform memory access (NUMA) existed in the kernel already to deal with multiple types of RAM. In fact, he said, NUMA had already been embraced by Advanced Configuration and Power Interface (ACPI) specification. So, Dave said, there was quite a lot of work underway to address this whole issue.

However, JÈrÙme was very conciliatory. He said his code was not intended to replace any of the stuff Dave had mentioned. His own work was essentially separate. He said he couldn't see any way to extend NUMA to support device memory, because that memory was not cache coherent – it didn't have the same characteristics as other memory on the system. In some cases, the memory couldn't be seen at all by the CPU. NUMA, he felt, just wasn't able to handle that sort of case. JÈrÙme intended his own interfaces to take up some of that slack.

And Dave essentially agreed with all that and affirmed that NUMA really was intended to handle memory that was visible universally and could be allocated normally.

So the two agreed that there didn't seem to be any conflict there. But Dave still had some technical objections, or issues, to address. For one thing, exposing the system topology on sysfs could potentially lead to a metric ton of files on systems with large numbers of CPUs and RAM resources, each with their own links and bridges to the others.

Additionally there was the question of time. Even given the non-overlapping nature of their work, it was possible that enhancing NUMA would still give faster solutions than writing an entirely new system from the ground up. Maybe it would still be better to focus on NUMA, rather than wait potentially years for JÈrÙme's approach to bear fruit.

These issues and others formed the rest of the technical discussion, with various other folks pitching in with suggestions. It seems that for the moment, at least, JÈrÙme's idea has made it past the breakers.

The issue of trying to support a widening array of increasingly complex systems is an interesting one. It's possible that in the not-so-distant future, Linux might migrate processes over WiFi, sharing the resources of every device in the house or even a whole city. It's easy to see the value in managing a vast range of unequal resources.

Optimizing CPU Idle States

Rafael J. Wysocki wanted to improve the kernel's menu governor, which is used to put a CPU into a power-saving state when the CPU is inactive. A number of such power-saving states could be available from which to choose: some that save less power but can be awakened quickly, and others that save more power but take more time to awaken. If there's good reason to know how long the CPU will remain idle, the menu governor puts the CPU into the appropriate state directly. There's also something called the ladder governor, which walks a CPU into deeper and deeper power-saving states on the basis of simple heuristics, like how long the CPU has already been in its current state.

Rafael felt that the existing logic used by the menu governor to pick CPU states was a horrifying violation of all natural law, which could only be fixed by a thorough rewrite. He offered a lot of reasons. For one thing, the menu governor used pattern matching to identify timer data, but also data from other sources, and mixed them together. This, he said, could cause the menu governor to perceive a time-based wake-up call at a point where no time-based wake-up calls were possible in the code.

The current menu governor, he said, also relied on data about processes that might not be running on the target CPU at all, so it couldn't be relevant to when to wake up the target CPU. Rafael also pointed out that some of the menu governor's heuristics had to do with whether a process was waiting for input, which he said was not actually related to the problem and seemed just random. And lastly, sometimes he found that the menu governor would start analyzing time frames that were just way too large and therefore completely irrelevant and a waste of resources even to run that code at all.

However, Rafael did acknowledge that a wholesale replacement, while good overall, might make some workloads perform worse. Specifically, any workload that had been highly tuned to work well with the current menu governor might not work so well with the replacement.

And, since those highly tuned workloads were most likely to be the ones that needed peak performance, Rafael suggested keeping both menu governors, at least for awhile, and let people choose their favorite.

He called his new one the timer events-oriented (TEO) governor. Like the menu governor, it always attempts to find the deepest (most power-saving) state in which to put the CPU, but it had a cleaner new strategy for identifying that state. He explained:

First, it doesn't use "correction factors" for the time till the closest timer, but instead it tries to correlate the measured idle duration values with the available idle states and use that information to pick up the idle state that is most likely to "match" the upcoming CPU idle interval.

Second, it doesn't take the number of "I/O waiters" into account at all, and the pattern detection code in it avoids taking timer wake-ups into account. It also only uses idle duration values less than the current time till the closest timer (with the tick excluded) for that purpose.

Doug Smythies replied with some test results comparing several different workloads under the plain kernel vs. Rafael's patched kernel. In some cases he found no significant performance difference between them, and in some cases, he found a 1.4 percent speed-up under Rafael's patch. Doug also looked over Rafael's code and posted some bug fixes, which Rafael accepted for the next iteration.

Giovanni Gherdovich also replied with his own benchmarks, saying that Rafael's patch was much better than an earlier version he'd tested and also better than the current menu governor. He also posted a set of tests he performed, which came back with no significant difference between the two governors.

Rafael looked over Giovanni's results and felt that actually the tremendous speed improvements might mean that the patch was being too aggressive. He remarked that other tests even showed a slowdown in some cases. He said he would soon put out a new patch that was a little more energy efficient, but he said that if this resulted in too much speed degradation, he'd return to the current version of his patch.

There was a bit more back-and-forth, between test results and patch tweaks, before the thread ended. There seems to be no controversy whatsoever with this patch, and inclusion in the main tree may just be a question of a few more tweaks to the code. It does seem as though there is enough variation between Rafael's TEO governor and the menu governor to warrant keeping the menu governor around for a while longer. But eventually I'd expect most user code that's optimized for the menu governor to start optimizing for the better organized and more usable TEO governor instead.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95