Zack's Kernel News

Zack's Kernel News

Article from Issue 240/2020
Author(s):

Chronicler Zack Brown reports on a stone so large..., a global stats-gathering interface, and the Kernel development process.

A Stone So Large …

A lot of kernel patches just fix little things or make something slightly more convenient for system administrators and kernel developers. Recently, Joe Perches pointed out that the Arm architecture could conceivably invoke the Linux kernel with a command line that was longer than what printk() was able to output. So you could launch the kernel, but you couldn't then go back and see exactly how you'd done it.

He posted a patch to split long lines up so that printk() could output them.

Sergey Senozhatsky liked the overall idea, but pointed out a bug in Joe's code – he noted that printk() would also output a bit of prefix text, to let the user know what was being output. In this case, the prefix was "Kernel command line". Sergey reported that Joe's patch neglected to take the length of that prefix text into account when seeing whether the full output string was over the maximum size. So the text could still be slightly too big for printk(). Joe agreed and fixed it.

Sergey also commented that it would be good to have some way to let the calling routine know what printk()'s maximum line length was, "so that printk() will still have sane limits and won't print a 1G string for example." He suggested exporting a variable that calling routines could see, which simply gave the maximum line length printk() supported, such as 256 characters or something around there. Joe agreed with that, saying probably a simple #define would do the trick. So there would be the double solution: printk() would split lines that were too long, and calling routines would know how small to split their data before calling printk().

Problem solved! Except… Sergey suggested a different approach, in fact one that was already in use in the kernel's print_modules() function, which also frequently had to output tons of data. It used memchr() and pr_cont() to split its input in a loop. The benefit over Joe's printk() implementation being that pr_cont() had its own overflow control.

Joe was fine with this – though he did notice that pr_cont() also had a maximum size limit of 8,192 characters. But for the purposes of the Arm architecture, he added that kernel command lines would not hit that limit, so it didn't much matter either way.

Andrew Morton sensed controversy on the horizon between the two proposals. He suggested an alternative approach, creating a new putsk() function, which would be a kernel-specific version of the puts() function from the C library. The main benefit, Andrew said, would be to replace a lot of printk() invocations inside the kernel, which would reduce the size of the compiled binary.

Joe didn't care either way, but Sergey was not enthusiastic about this suggestion. He said, "A function that prints the kernel command line is a bit different in the way that we can split command line arguments – they are space separated, which is very convenient – so we would pr_cont() parts of command line individually. This has an advantage that we won't \r\n in the middle of the parameter."

In fact, the conversation descended into the intricacies of potential implementations, and it never came up for air. The discussion ended after a few more emails, with no clear decision in any direction.

Obviously, there'll be a solution eventually – the Arm architecture can't be allowed to have a command line so long that even it can't see it. But the interesting thing is the care and attention given to such a minor feature, even when multiple prospective solutions are available.

Implementing a Global Stats-Gathering Interface

Often the need for a particular feature will only gradually emerge into the collective kernel developer consciousness as a variety of one-off implementations start popping up everywhere. Then finally someone says, "shouldn't all this stuff be in just one place?" And that's when the fun begins. Or maybe multiple people all start simultaneously implementing their own visions for a glorious unity without saying anything to anyone else. And the fun starts there as well.

Recently, Emanuele Giuseppe Esposito said, "There is currently no common way for Linux kernel subsystems to expose statistics to userspace shared throughout the Linux kernel; subsystems have to take care of gathering and displaying statistics by themselves, for example in the form of files in debugfs. For example KVM has its own code section that takes care of this in virt/kvm/kvm_main.c, where it sets up debugfs handlers for displaying values and aggregating them from various subfolders to obtain information about the system state."

In fact, Emanuele's purpose was to introduce the statsfs filesystem written by Paolo Bonzini in 2019 to replace KVM's use of debugfs for statistics. In his announcement post on the KVM list in November, Paolo had said, "statsfs is a proposal for a new Linux kernel synthetic filesystem, to be mounted in /sys/kernel/stats, which exposes subsystem-level statistics in sysfs. Reading need not be particularly lightweight, but writing must be fast. Therefore, statistics are gathered at a fine-grain level in order to avoid locking or atomic operations, and then aggregated by statsfs until the desired granularity."

Paolo's goal was to create a special-purpose filesystem with a stable API that other parts of the kernel could also use for the same purpose of gathering and processing statistics.

Now, apparently, statsfs was ready to be adopted by the wider community of subsystems within the kernel. In Emanuele's announcement he said, "In this patch series I introduce statsfs, a synthetic ram-based virtual filesystem that takes care of gathering and displaying statistics for the Linux kernel subsystems. The file system is mounted on /sys/kernel/stats and would be already used by kvm."

He went on to say, "Statsfs offers a generic and stable API, allowing any kind of directory/file organization and supporting multiple kind[s] of aggregations (not only sum, but also average, max, min and count_zero) and data types (all unsigned and signed types plus boolean). The implementation, which is a generalization of KVM's debugfs statistics code, takes care of gathering and displaying information at run time; users only need to specify the values to be included in each source."

David Rientjes was very excited to see this – in fact, he said he'd been looking into doing something similar with Jonathan Adams. So he had some very specific comments and concerns. His main desire was optimization. David saw some dangerous overhead in the way individual values were gathered in preparation for generating statistical data. Values could be as fine-grained as the amount of RAM used by a single data structure. Minimizing the number of operations required for each saved value, David said, would be crucial.

By way of suggestions, David said:

"A couple of ideas:

- an interface that allows gathering of all stats for a particular interface through a single file that would likely be encoded in binary and the responsibility of userspace to disseminate, or

- an interface that extends beyond this proposal and allows the reader to specify which stats they are interested in collecting and then the kernel will only provide these stats in a well formed structure and also be binary encoded.

"We've found that the one-file-per-stat method is pretty much a show stopper from the performance view and we always must execute at least two syscalls to obtain a single stat."

Emanuele replied that a binary format for holding data in statsfs had been considered from the beginning and seemed feasible.

Jim Mattson also stood up in favor of a binary format for encoding raw values. He felt that storing things in ASCII format was simply not scalable.

Paolo, who wrote the original code for the statsfs filesystem, said:

"I am totally in favor of having a binary format, but it should be introduced as a separate series on top of this one – and preferably by someone who has already put some thought into the problem (which Emanuele and I have not, beyond ensuring that the statsfs concept and API is flexible enough).

ASCII stats are necessary for quick userspace consumption and for backwards compatibility with KVM debugfs (which is not an ABI, but it's damn useful and should not be dropped without providing something as handy), so this is what this series starts from."

But David was worried. He said that once the new filesystem was merged into the kernel, then /sys/kernel/stats could indeed be considered an application binary interface (ABI) and would therefore have much stricter controls on whether and how it could ever be changed.

Linus Torvalds has traditionally been utterly unwilling to accept any changes to an ABI that is already in the kernel, unless such a change is necessary to fix a security hole, or if developers can be reasonably certain that no one – but no one! – is still using that particular ABI.

The reason for Linus's reluctance is tied to the difference between an ABI and an application programming interface (API). If an API changes, it means that source code using a particular library call will have to be changed in order to use a new library call. This is generally no problem, because, if you have the source code, you can fix it to use the new call and then recompile your program. Presto!

When an ABI changes, however, it means that a compiled binary using a particular internal publicly exposed kernel feature can no longer find that kernel feature. The user program stops working, and it will never work again. The kernel developers can't make the assumption that u sers will definitely have access to the source code for a particular binary. Maybe the binary is a closed-source product from a now-defunct company. Maybe the source code has simply been lost in the mists of time.

Linus considers it unacceptable to break user space in this way. If something can run on Linux, then it must continue to be able to run on Linux.

The same problem doesn't exist for API changes. By definition, if the interface used by your source code changes, you therefore have the source code and can write a patch for it to use the new interface. Because an existing compiled binary is so much harder to patch in that way, it's basically not reasonable to expect anyone to be able to do it.

This is David's concern about adopting statsfs into the kernel before ironing out the then-permanent ABI issues.

Paolo felt that binary and ASCII data formats should complement each other – they should each be available. The binary format would be available for highly efficient operations, while the ASCII format would be available for quick-and-easy user operations.

He affirmed that as far as the binary format itself went, he hadn't thought about it and wasn't sure what feature set to aim for. But he agreed that the ASCII format should be an optional item – easy to remove via mount options or during kernel compilation.

Meanwhile, Jonathan, who had been working with David and others at Google to develop metricfs, a project with similar goals as statsfs, weighed in. Regarding this new project, Jonathan said, "It's designed in a slightly different fashion than statsfs here is, and the statistics exported are mostly fed into our OpenTelemetry-like system. We're motivated by wanting an upstreamed solution, so that we can upstream the metrics we create that are of general interest, and lower the overall rebasing burden for our tree."

He added, "I agree with the folks asking for a binary interface to read statistics, but I also agree that it can be added on later. I'm more concerned with getting the statistics model and capabilities right from the beginning, because those are harder to adjust later."

He offered to collaborate on an overall statsfs design and proposed several ideas for Emanuele and Paolo to consider. These ideas were highly welcome, and the two groups descended into a technical design discussion.

In the midst of that discussion, Jonathan offered an interesting view of Google's metricfs behavior. He said:

"Here's a summary of the types of statistics we use in metricfs in google, to give a little context:

- integer values (single value per stat, source also a single value); a couple of these are boolean values exported as '0' or '1'.

- per-CPU integer values, reported as a <cpuid, value> table

- per-CPU integer values, summed and reported as an aggregate

- single-value values, keys related to objects:

- many per-device (disk, network, etc) integer stats

- some per-device string data (version strings, UUIDs, and occasional statuses.)

- a few histograms (usually counts by duration ranges)

- the 'function name' to count for the WARN statistic I mentioned.

- A single statistic with two keys (for livepatch statistics; the value is the livepatch status as a string)

"Most of the stats with keys are 'complete' (every key has a value), but there are several examples of statistics where only some of the possible keys have values, or (e.g. for networking statistics) only the keys visible to the reading process (e.g. in its namespaces) are included."

The discussion continued without any significant controversy. It seems clear that the two groups of developers want almost exactly the same thing and will eventually solve the technical implementation details. So we can look forward to a nice, clean, standardized statistics-gathering filesystem at some point in the not-too-distant future.

The Kernel Development Process

Linus Torvalds has changed his preferred development process many times over the years – in fact so much so that discrete changes can be hard to identify, and the whole process seems to become much more fluid and cultural, rather than formal and rigid.

There was a short exchange recently in which David Howells wanted to submit a few minor bug fixes directly to Linus, but wasn't sure when would be the right time to send them. In recent years, Linus has become stricter about when he wants new features versus fixes. But exactly when and which has not been entirely clear. In this particular case, the next release was going to be Release Candidate 1, and David wasn't sure what Linus's plans for that particular release were.

So David asked. Linus replied, "No, I'll take fixes at any time, and the better shape rc1 is in, the happier everybody will be and the more likely we'll have testers."

And that was that. So, fixes are always welcome at any time during the development cycle. It's only for new features and perhaps more complex changes that submitters need to consider exactly which release candidate is the right one for them.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Nmon

    Administrators often assume that if all nodes are functioning, the system is fine. However, a common problem is poor or unexpected application performance. In this case, you need a simple tool to help you understand what's happening on the nodes: nmon.

  • Kernel News

    Zack Brown discusses preventing the kernel from tainting, encrypting printk() output, and a new kernel bug reporting bot. 

  • Kernel News

    Chronicler Zack Brown reports on printk() wrangling, persistent memory as a generalized resource, making Kernel headers available on running systems, and Kernel licensing Hell. 

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Kernel News

    This month in Kernel News: Shared Processes with Hyper-Threading; Cleaning Up printk(); and Rust in the Kernel.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News