Zack's Kernel News

Zack's Kernel News

Article from Issue 206/2018

Chronicler Zack Brown reports on I3C support, fixing mmap(), and tracing RAM usage in OOM conditions.

I3C Support

Boris Brezillon posted some patches to implement a portion of the I3C core infrastructure. This is a wholesale upgrade of the I2C protocol for communications through serial ports. A lot of sensor devices use serial communications, because it's a simple two-wire interface. However, as that simplicity brings a proliferation of sensor devices, it becomes more important to manage the increased bandwidth and interrupt needs they create. I3C is designed to do that.

Boris' approach would transparently handle I2C backward compatibility for minimal user pain, but he also made certain compromises that would make using his APIs more difficult, and he left a fair chunk of the I3C API unimplemented for now, although he intends to fill it out in the future.

One implementation compromise was to require user code to run in a non-atomic context (i.e., only when the current process can be interrupted by something else). That's a slight annoyance, because it requires user code to be aware of its current state when calling the I3C API. However, Boris indicated he'd be fine with changing that. He'd mainly done it as a shortcut.

Among the missing API calls, Boris left out hot plugging support, which might be a deal-breaker for some users. Of course, the missing APIs would all be added later as the I3C code was fleshed out.

Wolfram Sang took a look at the code, but had no serious objections and generally approved the patches.

Arnd Bergmann asked why Boris had created a whole new subsystem for I3C, instead of simply extending the existing I2C code to support I3C as well. Boris replied:

I3C and I2C are really different. I'm not talking about the physical layer here, but the way the bus has to be handled by the software layer. Actually, I thin[k] the I3C bus is philosophically closer to auto-discoverable buses like USB than I2C or SPI.

Indeed, all I3C devices can be discovered and do not need to be described at the board level (using DT, board files, ACPI, or whatever). Also, some I3C devices are hot pluggable, and most importantly, all I3C devices describe themselves during the discovery procedure (called DAA in the I3C world).

There is some kind of "device class" concept. In the I3C world it's called DCR (Device Characteristic Register), but it plays the same role: It's a set of generic interfaces devices have to comply with when they declare themselves as being compatible with a DCR ID (like accelerometer, gyroscope, or whatever). [...]

Devices also expose a 48-bit Provisional ID which is made of sub-fields. Two of them are particularly interesting: the manufacturer ID and the part ID, which are comparable to the vendor and product ID in the USB world.

These three [pieces of] information (DCR, ManufacturerID, and PartID) can be used to match drivers instead of the compatible string or driver-name used for I2C devices.

So, as you can imagine, dealing with an I3C bus is really different from dealing with an I2C bus.

Boris added, "Of course, I can move all the code in drivers/i2c/, but that won't change the fact that I3C and I2C buses are completely different with little to share between them."

Arnd didn't want to let the idea go quite yet, though. He agreed that there were reasons to oppose extending I2C to cover I3C as well, but he felt there were also reasons to go through with it, even if it involved creating an ugly mess behind the scenes. He said, "there is value in representing the physical bus hierarchy in the software model, and if I2C and I3C devices can be attached to the same host bus, a good abstraction should show them under the same parent. This is true for both the kernel representation (in sysfs and the data structures) as well as the device tree binding (assuming we will need to represent I3C devices at all). The two don't have to use the same model, but it's easier if they do."

He also added, "We have discussed whether I2C and SPI should be merged into a single bus_type in the past, as a lot of devices can be attached to either of them. If it's common enough for I3C devices to support an I2C fallback mode, having a common bus_type might noticeably simplify device drivers by only requiring a single i2c_driver structure. Simplifying many drivers a little bit can in turn offset the added complexity in the subsystem."

Boris didn't really see the benefit of these ideas and expressed surprise that I2C and SPI might ever be merged. He asked Arnd for an explanation, and Arnd replied, "well, we never changed it, so at least the work required to merge the two was considered too much to justify any advantages." But he described the rationale as simplifying kernel build-time config options. He said:

The main problem with having one driver that can operate on different bus types (I2C plus either SPI or I3C) is the handling for the various combinations in configurations (e.g., I2C=m, SPI=y).

The easy case is having a module_init function that registers two device drivers, but that requires having a Kconfig dependency on both subsystems, and you can't use the module_i2c_driver() helper.

The second way is to have a number of #ifdef and complex Kconfig dependencies for the driver to only register the device_driver objects for the buses that are enabled. This is also doable, but everyone gets the logic wrong the first time.

What we end up doing to work around this for other drivers is to have the base driver in one library module, and separate modules for the bus-specific portions, which can then use module_i2c_driver again. There are many instances for combined I2C/SPI drivers in the kernel, and it works fine, but it adds a fair bit of overhead compared to having one driver that would, e.g., use regmap to abstract the differences in the probe() function and otherwise keeps everything in one place.

This made sense to Boris, and he could see the value of having a single subsystem, but he still hated the ugliness of the potential merged implementation. He asked, "Can't we solve this problem with a module_i3c_i2c_driver() macro that would hide all this complexity from I2C/I3C drivers?"

Wolfram, who had earlier approved Boris's patches, felt that merging I2C with I3C was possibly significantly different from merging I2C with SPI. In the latter case, there was a decent likelihood that the system might only have one or the other, in which case supporting both would cover the most possibilities. But he doubted there was any hardware currently implementing both I2C and I3C. And since the I3C code was backward compatible with I2C, that seemed to obviate the need for a merger – the user could simply plug the I2C device into the port and use it.

Boris replied that although he didn't know of any devices implementing both protocols at the moment, "the spec clearly describe[s] what legacy/static addresses are for and one of their use case[s] is to connect an I3C device on an I2C bus and let it act as an I2C device," which Wolfram agreed made it more likely that a device would one day implement both protocols.

At some point, Boris took a whack at merging I2C and I3C, but he found it difficult to conceptualize a workable design. He also added, "It's kind of hard to design something when you don't have real devices. I guess I can mimic I2C for now and make it evolve based on users' needs."

Elsewhere, Greg Kroah-Hartman asked Boris to split the documentation out from the patch and submit it separately, to make the code slightly easier to go through, but he also did go through the code and offered several technical criticisms, which Boris then said he'd address.

Greg also deduced from a missing data type that Boris had never actually tested removing an I3C device once it had been installed. Boris replied, "You got me, never tried to remove a device." He asked some technical questions about how best to handle that case.

At this point, the discussion veered off into technical locking issues, as Greg and Boris worked on resolving the need to remove devices from a running system.

There was no ultimate resolution to any of the questions raised in the discussion, but it seems clear that Boris's first attempt will probably be reworked to fit Greg and Arnd's objections. It also seems like the task might be bigger than Boris had first anticipated, and he may not have the physical equipment he needs. (He remarked at one point, "all my tests have been done with dummy/fake I3C slaves emulated with a slave IP.") So, although it seems like I3C will definitely be getting into the kernel sooner rather than later, there are still some significant hurdles to overcome.

Fixing mmap()

Dan Williams had a thorny conundrum. The mmap() system call didn't validate unknown flags, which meant that any modern new mmap behaviors that didn't work on older systems couldn't have a graceful failover. Dan wanted to implement a new system call, mmap3(), that would validate all flags.

There were a couple of problems. For one thing, Christoph Hellwig pointed out that "Adding new syscalls is extremely painful; it will take forever to trickle this through all architectures (especially with the various 32-bit architectures having all kinds of different granularities for the offset) and then the various C libraries, never mind applications."

Christoph suggested just using an existing __MAP_VALID hack as a workaround.

Dan replied, "I agree with the mess and delays it causes for other archs and libc, but at the same time this is for new applications and libraries that know to look for the new flag, so they need to do the extra work to check for the new syscall."

Regarding the possibility of using __MAP_VALID, he also pointed out that "any new mmap flag with this hack must be documented to only work with MAP_SHARED and that MAP_PRIVATE is silently ignored."

However, he said he wasn't totally opposed to doing things this way if it turned out not to be too onerous.

Christoph went digging around in the mmap code and thought he found a way to avoid the problem with ignoring MAP_PRIVATE. In which case, he felt it certainly beat the hell out of adding a new system call.

However, Kirill A. Shutemov looked at Christoph's discovery and felt that certain architectures, in particular PA-RISC, wouldn't be able to support the fix Christoph had in mind. Christoph rejoined, "I'd be happy to say that we should not care about PA-RISC for persistent memory. We'll just have to find a way to exclude PA-RISC without making life too ugly." Kirill hated this idea, though, saying that system call interfaces should be universal and not have different behaviors depending on which machine they were run on.

The debate continued, although it seemed to be getting further afield from the original issue. Eventually Dan brought the conversation back around to the real question, saying:

The problem here is that to support new the mmap flags the arch needs to find a flag that is guaranteed to fail on older kernels. Defining MAP_DIRECT to 0x8 on PA-RISC doesn't work because it will simply be ignored on older PA-RISC kernels.

However, it's already the case that several archs have their own sys_mmap entry points. Those archs that can't follow the common scheme (only PA-RISC it seems) will need to add a new mmap syscall. I think that's a reasonable tradeoff to allow every other architecture to add this support with their existing mmap syscall paths.

Helge Deller objected to this plan, saying, "I don't want other architectures to suffer just because of PA-RISC. But adding a new syscall just for usage on PA-RISC won't work either, because nobody will add code to call it then."

Helge proposed breaking the ABI for PA-RISC in this case. This, he said, would allow a proper fix without having to do any major contortions. He added that there were not a lot of PA-RISC users, anyway, and that most of them updated their kernels regularly, because of all the recent fixes that had gone into the code.

However, Dan replied, "The whole point is to avoid an ABI regression and the chance for false positive results. We're immediately stuck if some application was expecting 0x8 to be ignored, or conversely, an application that absolutely needs to rely on MAP_SYNC/MAP_DIRECT semantics assumes the wrong result on a PA-RISC kernel where they are ignored."

The debate petered out at that point, and it's not yet clear what solution they'll ultimately use for mmap(). I found it interesting to watch how each proposed solution ran into its own stumbling blocks. New system calls take time to trickle down. The proposed workaround would only work for certain cases – or if it could be fixed to work for all cases, there was still a single holdout architecture that wouldn't go along with it. Choosing to break ABI compatibility for that one case was a nonstarter because the whole point was not to do that in the first place.

Of course, many parts of the kernel would love the opportunity to break ABI compatibility, but this seems to be one of the most sacrosanct elements of the kernel. Almost the only thing capable of trumping ABI compatibility is a security hole. Security trumps all. But one of these days, it would be wonderful to have a special ABI-breaking kernel release, where every part of the kernel is free to break the ABI for just that one time. Oh the feeding frenzy! Oh the carnage! And afterward … oh the regret! Oh the recrimination! It would be glorious.

Tracing RAM Usage in OOM Conditions

Yang Shi was annoyed when his kernel ran out of memory, and the out-of-memory (OOM) killer couldn't find a process to kill to stave off the kernel panic. If there's no process to kill, how could the system be out of memory? It turned out that the system had put all its memory into unreclaimable slabs. Yang posted a patch to add a new -U option to the slabinfo program to cause that program to output only data about unreclaimable slabs. This, he said, would at least allow the user to troubleshoot the problem and possibly find a way to deal with it properly.

Michal Hocko had no objection to the patch going into the kernel, but he did note that Yang's code might produce a metric ton of output, on top of the already highly verbose OOM killer report. As such, he suggested leaving the feature disabled by default.

Yang disagreed, saying the output would only be produced in the event of catastrophic failure and not as a part of normal operations. He added that the code could easily add a file to the proc filesystem (procfs), to control the amount of output, even under that circumstance.

Michal said he didn't care that much about this particular issue, because "most OOM reports I have seen were simply user space pinned memory." He was fine leaving the output more verbose rather than less.

They kept talking, and eventually they seemed to reach a compromise when Yang suggested, "Maybe we can set a unreclaimable slab/total mem ratio. For example, when unreclaimable slab size >= 50% total memory size, then we print out slab stats in OOM? And, the ratio might be adjustable in /proc."

This made sense to Michal, and the thread came to an end.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Userspace Drivers

    New versions of the Linux kernel will support a special userspace driver
    model, but some technical pitfalls might limit the use of this interesting
    new feature.

  • State of Linux Drivers

    Linux developers are working on a whole new generation of tools for managing and updating device drivers. We'll help you get a handle on device drivers.

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • RISC-V

    The new RISC-V chip promises to be a game changer in the open hardware field.

  • Doghouse – Device Drivers

    maddog considers the benefits of teaching students how to write device drivers.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95