Zack's Kernel News

Zack's Kernel News

© Zack Brown

© Zack Brown

Article from Issue 201/2017
Author(s):

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Tracking Inode Versioning

Just as the double slit experiment behaves one way when it's being watched, and another when it thinks it can get away with something, Jeff Layton suggested that Linux could get away with incrementing the inode version number only when some other part of the kernel was actually paying attention.

The inode version number (i_version in the inode structure) is generally incremented whenever the inode's metadata changes. Or, as Jeff explained, filesystems increment the version number whenever something happens that they care about. In the case of Btrfs, ext4, and XFS, they increment the version for file data changes as well. The problem with too much version change is that the inode has to be written back to disk each time, which can be slow on write-heavy operations.

Jeff believed he'd discovered that the only time the version number really mattered was when some other bit of code queried the inode. If there was no observer, there would be no need to keep the inode version precisely up to date. And, when an observer came along, that would be a good time for the inode to show it had been paying attention, by updating its version number.

So, Jeff posted a patch that would update inode version numbers in those filesystems if both a change and a query had occurred. He asked if folks thought this was the right way to go – and he predicted that eventually other kernel changes would make something like this necessary in any case.

Christoph Hellwig asked for some performance numbers to show that Jeff's patch made a real improvement. Jeff offered up some numbers showing a two-fold speed improvement, though he also pointed out that the numbers would vary depending on workload.

Bruce Fields also had some comments but was enthusiastic about the whole idea. His main concern was making sure versioning worked properly for all filesystems. At one point, he wrote up notes for an i_version man page, describing the semantics as he understood them:

  • It would be a 64-bit number, which was big enough that there should be no question of rolling over when the version got too high, although those tend to be famous last words.
  • It works at least for files and directories and possibly for symlinks and other special file types.
  • It increments between separate queries if and only if the data or metadata changes between the two queries.
  • It works even if a reboot occurs between two version queries.

Jeff liked Bruce's write-up and said it covered the semantic issues very nicely. But then he and Bruce went back and forth for a bit on the feasibility of i_version working across reboots. The problem was, how would the system behave if the system crashed in between the time the version number increased and the relevant data made it onto the drive? If they couldn't guarantee consistency across reboots, then what guarantees could they reasonably make?

Jeff and Bruce agreed it was a real problem and needed to be addressed to avoid data inconsistency. At one point, Jeff said, "I think we'll have a hard time making this fully atomic. We may end up having to settle for something less (and doing our best to warn users of that possibility)."

But he also said that his i_version patch didn't make any atomicity issues worse than they already were in the kernel. In particular, as Dave Chinner had pointed out, NFS had always had a consistency issue between client and server. And Jan Kara also pointed out, "similarly as with XFS, ext4 doesn't guarantee atomicity unless fsync() has completed on the file. Until that you can see arbitrary combination of data & i_version after the crash. We do take care to keep data and metadata in sync only when there are security implications to that (like exposing uninitialized disk blocks) and if not, we are as lazy as we can to improve performance."

Finally, Jeff remarked, "I think what we'll have to do here is ensure that those filesystems do an fsync prior to reporting the i_version getattr codepath. It's not pretty, but I don't see a real alternative." But Dave replied, "I think that's even more problematic. ->getattr currently runs completely unlocked for performance reasons – it's racy w.r.t. to ongoing modifications to begin with, so /nothing/ that is returned to user space via stat/statx can be guaranteed to be 'coherent'."

Jan suggested just updating i_version for all inodes after any system crash. Since the version was a 64-bit number, such updates wouldn't really risk hitting the limit. One problem with this would be that it would invalidate any existing cached data, which would then have to be reloaded from disk. As Bruce put it, "Even if it's rare, it may be really painful when all your clients are forced to throw out and repopulate their caches after a crash. But, yes, maybe we can live with it."

At one point, Dave said, "The big question is how do we know there was a crash? The only thing a journaling filesystem knows at mount time is whether it is clean or requires recovery. Filesystems can require recovery for many reasons that don't involve a crash (e.g. root fs is never unmounted cleanly, so always requires recovery). Further, some filesystems may not even know there was a crash at mount time because their architecture always leaves a consistent filesystem on disk."

Jan pointed out that each filesystem had its own level of awareness of the state of the system or the occurrence of crashes. He suggested that each filesystem could implement a separate "crash flag," although he acknowledged that this would require each filesystem to change their on-disk format, which would be undesirable.

The discussion continued, with a variety of suggestions proposed and rejected. Ultimately the problem seems to boil down to having a vast array of filesystems, all of which would ideally be supported. At the same time, the goal of Jeff's code is to speed things up, so there's a limit to how bad a performance hit any supporting infrastructure will be allowed to introduce, and, of course, filesystem maintainers are always reluctant to change on-disk data formats, so there are limits to how big a change Jeff can expect from those codebases.

Controlling Console Output

Calvin Owens from Facebook pointed out that some hardware consoles were slower than others, making it difficult to know how much debugging output a piece of code should produce. Too much output could clog a slow console, while a fast one would have no problem. But because Linux offered no per-console way to control console verbosity, users would have to configure the kernel to match the slowest console attached to the system. Calvin wanted to change that.

He posted a patch to let the user control the verbosity of each individual console. Joe Perches gave an immediate thumbs up, and Sergey Senozhatsky also really liked the concept, although he had some technical suggestions, which Steven Rostedt agreed with. They both thought that Calvin's code should respect the "ignore loglevel" parameter, which Calvin's code ignored. He said he'd update the patch to take care of it.

Petr Mladek also suggested, "people want to see all messages even on the slow console during panic(), oops(), with ignore_loglevel. It means that the new per-console setting must not limit it. Also any console must not go below minimum_console_level."

The discussion didn't go very far. Everyone seemed to be on board with the basic idea, and there didn't appear to be any big stumbling blocks. I'd probably put this in the category of features everyone will appreciate but that no one knew they wanted until companies like Facebook, Google, and the rest started building massive data centers.

Cleaning Up Data Transfer Code

Al Viro knew that the mechanisms for copying large amounts of data between kernel and userspace had become more and more unmanageable over time, with code duplication, inconsistent semantics, and a big pile of bugs. The problem was in the uaccess.h code, which was all over the map.

His first step toward fixing it was to force everyone to use the same uaccess.h file. He'd already done this in a recent kernel release, and it allowed the rest of the issues to at least be addressed in a central location.

He posted a fresh patch that began to address the remaining problems, although he acknowledged that more would be needed. He said, "About 2.8K [lines of code] removed, a lot of cruft is gone and semantics is hopefully in sync now. All but two architectures (ia64 and metag) had been switched to new mechanism; for these two I'm afraid that I'll need serious help from maintainers."

The patch defined several standard function calls for copying data to and from userspace and kernel space. Each function also had clear behaviors for failure conditions.

Ultimately, the patch reduced the total number of functions to just eight, with standardized calling conventions. He added, "Some of per-architecture branches might be even done right; however, most of them got no testing whatsoever, so any help with testing (as well as 'Al, for f k sake, dump that garbage of yours, here's the correct patch' from maintainers) would be very welcome."

He mentioned that the reason he hadn't implemented his patches for the ia64 and metag architectures was that "both architectures zero-pad in __copy_from_user_inatomic() and that really needs fixing. In case of metag there's __copy_to_user() breakage as well, AFAICS, and I've been unable to find any documentation describing the architecture wrt exceptions, and that part is apparently fairly weird. In case of ia64 … I can test mckinley side of things, but not the generic __copy_user() and ia64 is about as weird as it gets. With no reliable emulator, at that … . So these two are up to respective maintainers."

Vineet Gupta was happy that Al had taken on this work, and he posted an additional short patch of his own, to inline some of the code (i.e., to automatically replace function calls with duplicated code at compile time). Sometimes it's faster.

In this particular case, Al thought that any speed improvement would probably depend on the architecture and shouldn't be done for the whole kernel by default. He wanted to see some performance numbers showing that inlining really offered a speedup. And regarding inlining, Linus Torvalds said:

I would suggest against it.

The only part I think is worth inlining is the compile time size checks for kasan – and that only because of the obvious 'sizes are constant only when inlining' issue.

We used to inline a *lot* of user accesses historically, pretty much all of them were bogus.

The only ones that really want inlining are the non-checking ones that people should never use directly, but that are just helper things used by other routines (ie, the unsafe_copy_from_user() kind of things that are designed for strncpy_from_user()).

Once you start checking access ranges, and have might_fault debugging etc, it shouldn't be inlined.

Al agreed and reiterated that inlining could speed things up on certain architectures, but probably not for the general case. However, he also felt that keeping the option to inline available for any given architecture was not a problem and wouldn't make things more complicated in the general case. But Linus replied:

I disagree.

I think one of the biggest problems with our current uaccess.h mess is just how illegible the header files are, and the INLINE_COPY_{TO,FROM}_USER thing is not helping.

I think it would be much better if the header file just had:

extern unsigned long _copy_from_user(void *, const void __user *,unsigned long);

and nothing else. No unnecessary noise.

The same goes for things like [__]copy_in_user() – why is that thing still inlined? If it was a *macro*, it might be useful due to the might_fault() thing giving the caller information, but that's not even the case here, so we'd actually be much better off without any of that inlining stuff. Do it all in lib/usercopy.c, and move the might_fault() in there too.

Al was neither for nor against. But he wanted to keep certain issues separate from the current patch, given the scope and complexity of the problems addressed. He first wanted to get all the architecture-dependent fixes to assembly language routines out of the way; in fact, he wanted to get everything out of the arch/ directory altogether and to simplify all the many uaccess.h files.

He pointed out a large steaming pile of remaining disgustingness that would be higher on his list than the inlining issue.

But on that issue, in a nearby post, Al said, "Just to make it clear – I'm less certain than Linus that uninlined is uniformly better, but I have a strong suspicion that on most architectures it *is*. And not just in terms of kernel size – I would expect better speed as well. The only reason why these knobs are there is that I want to separate the "who should switch to uninlined" from this series and allow for the possibility that for some architectures inlined will really turn out to be better. I do _not_ expect that there'll be many of those; if it turns out that there's none, I'll be only glad to make the guts of copy_{to,from}_user() always out of line."

Elsewhere, Linus agreed that there were bigger fish to fry than the inlining issue, and there was no need to address everything at once. But he added, "I just think that we really *should* aim for a simpler uaccess.h in the long term, so I would prefer we not encourage architectures to do things that simply won't matter."

Meanwhile, Martin Schwidefsky tested Al's patch on the S/390 architecture and found that it worked fine, except for one bug, which he patched and Al accepted.

Alexey Dobriyan also did some testing, and offered some technical suggestions to Al.

Russell King did some testing on the ARM architecture and confirmed that the system would boot successfully, although he added, "there's definitely a reproducible drop in performance for some reason when switching between current and Al's uaccess patches, which is partly recovered by switching to the out of line versions." He said, "I'd suggest that we immediately switch to the uninlined versions on ARM so that the impact of that change is reduced. We end up with a 1.9% performance reduction rather than a 6% reduction with the inlined versions."

Al posted a test patch that was not intended to go into the kernel but only to help uncover the behavior Russell had seen. Linus also speculated on what might cause the slowdown.

Kees Cook also tested Al's code on the ARM and x86 architectures and reported no problems. He also said to Al, "Thanks for working on this! I've wanted to see this done for a long time; I'm glad you had the time for it!"

Al posted an updated patch based on the discussion so far, and the thread ended.

This is the kind of stuff Al works on – things the end user never sees or hears about but that make life easier for massive numbers of kernel developers, make it harder to introduce security bugs, and in general speed the kernel up, or at least won't really slow it down.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

News