Zack's Kernel News

Zack's Kernel News

Article from Issue 229/2019

Chronicler Zack Brown reports on string handling routines and speeding up database workloads.

String Handling Routines

In the midst of a discussion on patch handling in the kernel development process, it came up that apparently a lot of patches are scripted constructs that are hard to synchronize between the various maintainers. So, someone asked for an example of such a patch, and someone else offered a recent set of changes from Julia Lawall for the stracpy() code.

A wrapper around strscpy(), stracpy() is a replacement for strncpy(), which is a function for copying strings. It was at this point that Linus Torvalds, essentially off-topic, weighed in on the string handling issue:

"I'm not actually convinced about stracpy() and friends.

"It seems to be yet another badly thought out string interface, and there are now so many of them that no human being can keep track of them.

"The 'badly thought out' part is that it (like the original strlcpy garbage from BSD) thinks that there is only one size that matters – the destination.

"Yes, we fixed part of the 'source is also limited' with strscpy(). It didn't fix the problem with different size limits, but at least it fixed the fundamentally broken assumption that the source has no size limit at all.

"Honestly, I really really REALLY don't want yet another broken string handling function, when we still have a lot of the old strlcpy() stuff in the tree from previous broken garbage.

"The fact is, when you copy strings, both the destination *AND* the source may have size limits. They may be the same. Or they may not be.

"This is particularly noticeable in the 'str*_pad()' versions. It's simply absolutely and purely wrong. I will note that we currently have not a single user or strscpy_pad() in the whole kernel outside of the testing code.

"And yes, we actually *do* have real and present cases of 'source and destination have different sizes'. They aren't common, but they do exist.

So I'm putting my foot down on yet another broken string copy interface from people who do not understand this fundamental issue."

Joe Perches replied, "I think you are mistaken about the stracpy limits as the only limit is not the source size but the dest. Why should the source be size limited?" Joe asked if Linus had actually looked at the stracpy() code. How could it be a problem, since it was only a wrapper around another function?

Linus replied:

"Yes, Joe, I have.

"What part of 'there are now so many of them that no human being can keep track of them' didn't you see as a problem?

"How many broken string functions are we going to do, adding yet another one when you notice that the _last_ one wasn't great?

"We never seem to remove the broken ones. We just add yet another one, and have a never-ending jumble of random letters."

And in response to Joe indicating that he didn't feel the source size needed to be limited, Linus replied with some detailed analysis, including samples of broken code he found in the current kernel tree:

"You just proved my point. You don't understand that sources can also be limited, and the limit on a source can be *smaller* than the limit of a destination.

"Did we learn *NOTHING* from the complete and utter disaster that was strlcpy()?

"Do you not understand why strlcpy() was unacceptably bad, and why the people who converted strncpy() to it introduced real bugs?

"The fact is, it's not just the destination that has a size limit. The source often has one too.

"And no, the source is not always guaranteed to be NUL-terminated, nor is the source buffer guaranteed to be larger than the destination buffer.

"Now, if you *know* that the source is smaller than the destination size, you can do:

len = strnlen(src, srclen);
memcpy(dst, len);
dst[len] = 0;

"and that's not wrong, but that works only when

(a) you actually do the above

(b) you have no data races on src (or you at least only require that 'dst' is NUL-terminated, not that 'len' is necessarily the correct length of the result

(c) you actually know as the programmer that yes, the source is definitely smaller than the destination.

"and honestly, people don't get _any_ of that right.

"For one thing, the buffer sizes of the source and destination may be two different things and some #define. It's hard to tell that one is always smaller than the other (or that they are always the same size). So then to get it right in the *general* case, you may need to do something like

if (srcsize < dstsize) {
  .. do the above ..
} else {

"do you see the reason?

"Do you see why strlcpy() is only safe to do when the allocation size of the source buffer is at least as big as the allocation size of the destination buffer?

"For example, I just grepped for some patterns, and I can find code like this in the kernel

name_len = strnlen(fileName, PATH_MAX);
name_len++;  /* trailing null */
strncpy(pSMB->fileName, fileName,   name_len);

"where pretty much everything is wrong. The comment is fundamentally wrong, and even spells "nul" wrong. Christ.

"Here's another one:

/* will be less than a page size */
len = strnlen(link,   ocfs2_fast_symlink_chars(  inode->i_sb));
kaddr = kmap_atomic(page);
memcpy(kaddr, link, len + 1);

"and notice how this time at least the comment is (hopefully) correct. But the code is wrong once again, because it doesn't actually do the correct pattern I showed above, it does a "memcpy(len+1)" instead. Bzzt. WRONG!

"What I think the code *wants* to do is

kaddr = kmap_atomic(page);
  // destination and destination
  // size limit
  kaddr, PAGE_SIZE,
  // source and source size limit
  link, ocfs2_fast_symlink_chars(    inode->i_sb)

"ie the destination has one size, and the source has another size, and the source may or may not be NUL-terminated.

"And then the programmer wouldn't have needed the comment, and wouldn't have needed to make sure that yes, ocfs2_fast_symlink_chars() is guaranteed to be less than PAGE_SIZE.

"Again, the code we actually _have_ in the kernel is not sensible. It doesn't actually nul-terminate the destination, despite clearly _trying_ to (note that "len+1" in the memcpy).

"Now, it's possible that it doesn't care about properly nul-terminating things. And it's possible that the source is always nul-terminated to begin with unless the filesystem is corrupted. But the code clearly _tries_ to do something, and fails.

"Because copying a string is complicated, particularly when the allocations for source and destination may be entirely different.

"On a happier note, I actually found a correct code case too. Our 'kstrndup()' function seems to actually be at a first glance entirely bug-free, and actually takes a limited source string size, and gives you back a nul-terminated destination string of a different size. Of course, that's a simple case, because the size of the destination is something that that function actually controls, so getting it right is actually somewhat simpler."

The main discussion (about patch handling in the kernel development process) continued, but I wanted to bring up this side issue. It's clearly not an urgent thing, because Linus grepped around in the kernel tree and found lots of broken and buggy code, but he didn't ask anyone to fix it. His concern is broader. He doesn't like the proliferation of special-case string handling routines that he feels should all be replaced by something clear, simple, and correct.

When Linus comes out with this sort of opinion, it's generally followed by various developers trying to give him what he requested. So probably in the next six months, we'll see an effort to make a proper string handling function and get rid of all the rest. I expect it'll look a lot like the struggle to create a good revision control system, except at the end he probably won't have to roll this one himself.

Speeding Up Database Workloads

Recently, Khalid Aziz from Oracle wanted to reclaim unused memory as fast as possible, to be available for other processes. In particular, he said, memory reclamation and compaction were normally triggered when the system was already running low on available RAM, which meant that running processes may have already begun to experience allocation delays.

Memory compaction is distinct from compression. In compression, you try to take up less space by recognizing repeating patterns in a given block of memory. In compaction, you just try to group allocated memory blocks together, in order to have larger available blocks of free space.

Ordinarily I'd expect this sort of issue to come from someone in the embedded systems industry, but Khalid is in the database universe, where computing resources are often tight as well. He pointed out that "In real life, forced direct reclamation has been seen to cause sudden spike in time it takes to populate a new database or an extraordinary unpredictable latency in launching a new server on cloud platform. These events create SLA [Service Level Agreement] violations which are expensive for businesses."

He wanted to implement a kernel feature to track memory trends over time, in order to predict when reclamation and compaction would become necessary, so that a running system could take steps to free up memory before it started to represent numbers of dollars lost.

The question was, how to actually capture memory usage data on a running system, such that the data collection itself didn't slow everything down. Essentially, even at the risk of lower accuracy, Khalid wanted to identify opportune moments to grab relevant data in order to graph free pages over time. He said, "Capturing these data points and computing trend line for pages of order 0-MAX_ORDER allows us to not only foresee free pages exhaustion point but also severe fragmentation points in future."

Khalid posted two patches to implement this. However, Michal Hocko objected, saying that this sort of analysis could be done just as easily from user space. Free memory pages could be tracked from anywhere, and he saw no need to load up the kernel with any of this logic.

Khalid replied, "the initial prototype to assess the feasibility of this approach was written in userspace for a very limited application. We wrote the initial prototype to monitor fragmentation and used /sys/devices/system/node/node*/compact to trigger compaction."

He went on, "The primary reason to implement this logic in the kernel is to make the kernel self-tuning. The more knobs we have externally, the more complex it becomes to tune the kernel externally. If we can make the kernel self-tuning, we can actually eliminate external knobs and simplify kernel admin. In spite of availability of tuning knobs and large number of tuning guides for databases and cloud platforms, allocation stalls is a routinely occurring problem on customer deployments. A best fit line algorithm shows immeasurable impact on system performance yet provides measurable improvement and room for further refinement."

But Michal pointed out that there were actually many approaches to measuring resource pressures. If Khalid's approach would be valuable in Oracle's particular use case, that didn't necessarily mean it would be beneficial for other users. Why have what seemed to be a specialized solution, in the official kernel tree?

Khalid didn't argue that particular point, but he said that in this case, the feature could really only be effective as an in-kernel mechanism. From inside the kernel, it would be possible for the kernel to tune itself very rapidly and delicately in response to the changing environment. If done in user space, those same subtle responses would not be available. Delays in accessing the data would lower their value.

But he explained that his code was intended to be generally useful, not just for Oracle workloads. He said, "It is true different workloads will have different requirements and that is what I am attempting to address here. Instead of tweaking the knobs statically based upon one workload requirements, I am looking at the trend of memory consumption instead. A best fit line showing recent trend can be quite indicative of what the workload is doing in terms of memory."

Michal wasn't convinced. And he also pointed out that the kernel did already have self-tuning mechanisms for these things. Rather than implementing new self-tuning features, he said, "let's talk about what is missing in the existing tuning we already provide. I do agree that compaction needs some love but I am under impression that min_free_kbytes and watermark_*_factor should give a decent abstraction to control the background reclaim. If that is not the case then I am really interested on examples because I might be easily missing something there."

Khalid remarked, "Something so fundamental to kernel memory management as making free pages available when they are needed really should be taken care of in the kernel itself. Moving it to userspace just means the kernel is hobbled unless one installs and tunes a userspace package correctly."

In terms of what is missing in existing internal memory-tuning mechanisms, Khalid said, "last week an email crossed my mailbox where an order 4 allocation failed on a server that has 768 GB memory and had 355,000 free pages of order 2 and lower available at the time. That allocation failure brought down an important service and was a significant disruption."

His goal was to allow the kernel to address these situations itself, without requiring system administrators to tweak configurations for each new workload.

Michal still didn't like it. He said, "existing autotuning works mostly ok for a vast variety of workloads. A more clever tuning is possible and people are doing that already. Especially for cases when the machine is heavily overcommitted." And he pointed out that in order for Khalid's patches to be accepted into the source tree, they would have to first be tested on a wide variety of workloads, in order to prove they did not pose a risk to existing users. Short of that, Michal would remain skeptical. So, given the difficulties of actually doing that level of testing, Michal said, "I would really focus on discussing whether we have sufficient APIs to tune the kernel to do the right thing when needed. That requires to identify gaps in that area."

There was a little more back-and-forth, but the discussion essentially petered out at that point. Khalid was disappointed, because, in terms of testing different workloads, he said, "I have seen DB and cloud server workloads suffer under default behavior of reclaim/compaction. It manifests itself as prolonged delays in populating new database and in launching new cloud applications."

But he acknowledged, "It is fair to ask for the predictive algorithm to be proven before pulling something like this in kernel. I will implement this same algorithm in userspace and use existing knobs to tune kernel dynamically. Running that with large number of workloads will provide data on how often does this help."

There are a lot of areas of kernel development that developers find frustrating. Security is one, because you never know when some obscure aspect of the kernel will turn out to have a security implication for some new feature you've just spent weeks implementing. There's just no knowing until you post your patch and someone says, "wait, there's a problem."

It's similar here. Any time you try to adjust the way memory is handled, or how the kernel switches between running processes, or anything else that is meant to speed up a particular workload, you're going to run into the problem that your fix needs to work not just on your workload, but on everyone else's too – in spite of the fact that you don't know "everyone else" and have no way to test your patch on all the systems of the world.

But over time, strangely enough, these difficult features do come into being, and they tend to become cleaner, more secure, and more generalizable.

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Kernel News

    Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

  • Virtual Performance

    Virtual performance tuning is a lot like ordinary performance tuning – but not exactly.

  • Working with the Kernel

    If you work with third-party hardware drivers, or even if you just need to fix a broken system, someday you might need to upgrade the Linux kernel.

  • Performance Tuning Toolbox

    Tune up your systems and search out bottlenecks with these handy performance tools.

  • Kernel Hacks Intro

    If you get right down to it, the Linux kernel is the real Linux. This month we focus on tools for tuning and tailoring the kernel.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95