Zack's Kernel News

Zack's Kernel News

Article from Issue 174/2015

Chronicler Zack Brown reports on the latest news, views, dilemmas, and developments within the Linux kernel community.

Permanent Deletion

Alexander Holler was unsatisfied with the way filesystems typically delete files. To save time, deleting a file typically means that the filesystem treats that range of data as available instead of in use. The problem with this approach is that there are relatively easy-to-use tools that some stranger might use to recover your private data after obtaining your hard drive. Alexander wanted to allow regular users to truly wipe their data off of storage media, rather than just have it *appear* to be gone.

Alexander's simple solution was to "overwrite the contents of a file by request from userspace. Filesystems do know where on the storage they have written the contents to, so why not just let them delete that stuff themselves instead."

He posted some patches, implementing a new system call that would delete files this way. Alan Cox, however, rained all over that parade. Alan said:

The last PC hard disks that were defined to do what you told them were ST-506 MFM and RLL devices. IDE disks are basically 'disk emulators', SSDs vastly more so.

An IDE disk can do what it likes with your I/O so long as your requests and returns are what the standard expects. So for example if you zero a sector, it's perfectly entitled to set a bit in a master index of zeroed sectors. You can't tell the difference and externally it looks like an ST506 disc with extensions. Even simple devices may well move blocks around to deal with bad blocks, or high usage spots to avoid having to keep rewriting the tracks either side.

An SSD internally has minimal relationship to a disc. If you have the tools to write a file, write over it, discard it and then dump the flash chips you'll probably find it's still there.

Alexander thanked Alan for the info but said that he wasn't looking for ways to truly make data recovery impossible. He just wanted to make it inconvenient for ordinary "black hat" type people, who didn't have government-sized resources.

Russ Dill pointed out that Alexander's hopes were most likely doomed to failure. He posted some Strace output of a vim session, showing that the data was copied to new locations as a matter of course, as a way to avoid catastrophic misery after unpredicted system crashes. He also reiterated what Alexander himself had said – that filesystems don't cooperate with real deletion.

Alexander pointed out that even in spite of these obstacles, his patch was still an improvement over the kernel's current behavior. So, even if the data still to some extent existed on the drive, it would at least require significant resources to re-humpty-dumptify.

The discussion ended with no real conclusion. I would guess, however, that Alexander's patch would not be seen as a true improvement by Linus Torvalds or the other big-timers. They'd probably say that if data were still available to be recovered, then folks would write code to make it easier to recover. They'd also probably say that the right place to implement Alexander's features would be at the filesystem layer, providing a given filesystem with the ability to track and permanently delete all data associated with a given file. But, I don't know for sure.

Resource Constraints in cgroups

Aleksa Sarai wanted to enhance cgroups (the building blocks of the whole anything-aaS explosion currently sweeping the globe) to limit the number of open processes. The whole point of cgroups is to create a bubble of limited resources that resembles an independently running Linux system. The bubble includes CPUs, RAM, physical storage, and whatnot. Aleksa wanted to add an open process constraint to the bubble and posted a patch to implement it.

Tejun Heo, however, replied that this type of resource wasn't appropriate for cgroups to control. He said that a better approach, and one that had already been implemented, was to have cgroups constrain the amount of memory available to the virtualized kernel.

Richard Weinberger asked Tejun if the plan was "to limit kernel memory per cgroup such that fork bombs and stuff cannot harm other groups of processes." Tejun said that, yes, they were very close to implementing that in the kmemcg code.

Austin Hemmelgarn, however, pointed out that RAM limitation wasn't the only reason to want to limit the number of open processes. Constraining the number of open processes would make it easier to ensure that certain tools like the NTP daemon, which needed just so many processes and no more, were running properly. It would also prevent certain denial-of-service attacks.

Tejun thought that all of Austin's examples represented niche areas that could be handled in a simpler and less heavyhanded way than adding another cgroup controller. Tejun added, "I'm pretty strongly against adding controllers for things which aren't fundamental resources in the system." So, he went on, constraints on things like the number of open files, number of pipe buffers, and so on, were all things he'd oppose.

Tim Hockin, however, pointed out that Tejun's idea of limiting kernel memory via kmemcg had been promised years earlier and was so long overdue that something like Aleksa's patch might as well be accepted as actually addressing the problem right now.

Tejun agreed that the kmemcg plan was taking longer than expected, but that "kmemcg reclaimer just got merged and … the new memcg interface which will tie kmemcg and memcg together." And, he told Tim to butt out or make a meaningful contribution.

Tim replied, "I'm just vocalizing my support for this idea in defense of practical solutions that work NOW instead of 'engineering ideals' that never actually arrive. As containers take the server world by storm, stuff like this gets more and more important."

Tejun said, "As for the never-arriving part, well, it is arriving. If you still can't believe, just take a look at the code." He added:

Note that this is [a] subset of a larger problem … there's a patchset trying to implement writeback IO control from the filesystem layer. cgroup control of writeback has been a thorny issue for over three years now and the rationale for implementing this reversed controlling scheme is about the same – doing it properly is too difficult, let's bolt something on the top as a practical measure.

I think it'd be seriously short-sighted to give in and merge all those. These sorts of shortcuts are crippling in the long term. Again, similarly, proper cgroup writeback support is literally right around the corner.

The situation sure can be frustrating if you need something now but we can't make decisions solely on that. This is a lot longer term project and we better, for once, get things right.

Austin reentered the discussion at this point, addressing Tejun's idea of only wanting to constrain fundamental system resources like RAM size and disk space. He said, "PIDs are a fundamental resource, you run out and it's an only marginally better situation than OOM, namely, if you don't already have a shell open which has kill built in (because you can't fork), or have some other reliable way to terminate processes without forking, you are stuck either waiting for the problem to resolve itself, or have to reset the system."

So, Austin supported Aleksa's patch as a way to constrain the number of PIDs used by a virtual system. Tejun acknowledged that this was a valid point and said he'd give it more thought and see what he could come up with. On a technical note, he added, "Currently, we're capping max pid at 4M which translates to some tens of gigs of memory which isn't a crazy amount on modern machines. The hard(er) barrier would be around 2^30 (2^29 from futex side, apparently) which would also be reachable on configurations w/terabytes of memory."

The thread actually devolved into a minor flame war between Tejun and Tim, and so the technical side of things petered out. However, if Tejun is right that the kmemcg code is nearly ready, the disagreement may become moot at some point. In the meantime, nothing more on the PID issue was said.

Ultimately, it seems that Tejun's fundamental point is that cgroups should be implemented in the way that makes the best abstract sense, rather than the way that solves the most immediately desired problems. The unspoken argument behind this is that cgroup security is hard, and we don't want our future selves to regret shortcuts we took today.

On the Aleksa side of things, the main point seems to be that cgroups are useful and they should support useful features rather than mapping to an arbitrary metaphor like creating a "virtual system." Both sides of the argument have merit, but I'm betting the security-obsessed side will tend to win out when it comes time to convince Linus Torvalds to accept a patch.

Tracing Gets Its Own FS

Steven Rostedt submitted patches to implement a new TraceFS filesystem, which had used DebugFS up until that point. The problem with using DebugFS for tracing, Steven said, was that if you mounted DebugFS, you got all the debugging from subsystems throughout the kernel, which might not be what you wanted. He said, "there are systems that would like to perform tracing but do not mount debugfs for security reasons. That is because any subsystem may use debugfs for debugging, and these interfaces are not always tested for security." A new TraceFS would allow users to access the tracing subsystem without all that overhead and risk.

Steven also pointed out that tracing was beginning to outgrow DebugFS's features. He said, "debugfs does not support the system calls for mkdir and rmdir. Tracing uses these system calls to create new instances for sub buffers. This was done by a hack that hijacked the dentry ops from the 'instances' debugfs dentry, and replaced it with one that could work. Instead of using this hack, tracefs can provide a proper interface to allow the tracing system to have a mkdir and rmdir feature."

He added, "To maintain backward compatibility with older tools that expect that the tracing directory is mounted with debugfs, the tracing directory is still created under debugfs and tracefs is automatically mounted there."

It seems very clear that Linus Torvalds will accept this code – he tried to accept it into Linux 4.0, but Steven held it back himself. It turned out that there were some technical obstacles to overcome before the code would fit properly into the kernel.

Specifically, the perf tools had hardcoded the assumption that the tracing directory would be mounted under DebugFS; so they wouldn't see the tracing directory if it were mounted in any other way. Steven posted a patch to fix this, and it was accepted by Arnaldo Carvalho de Melo. However, that change didn't make it into Linus's 4.0 code, so Steven decided to wait for perf to catch up before resubmitting the TraceFS filesystem.

Another interesting issue that emerged briefly but went nowhere was the possibility that TraceFS should be based on KernFS. Greg Kroah-Hartman originally made the suggestion, and Tejun Heo argued in favor of this as well. However, it turned out that KernFS had its own complexities, as well as poor documentation – Tejun said at one point, "I didn't write any while extracting it out of sysfs. Sorry about that. I should get to it."

In response to Greg's suggestion, Al Viro said, "I would recommend against that – kernfs is overburdened by their need to accommodate cgroup weirdness. IMO it's not a good model for anything, other than an anti-hard-drugs poster ('don't shoot that shit, or you might end up hallucinating _this_')."

Steven remarked, "OK, I'm not the only one that thought kernfs seemed to go all over the place. I guess I now know why. It was more of a hook for cgroups. I can understand why cgroups needed it, as I found that creating files from a mkdir and removing them with rmdir causes some pain in vfs with handling of locking." Eventually, he said, "I think I'm convinced that kernfs is not yet the way to go. I'm going to continue on with my current path."

The Author

The Linux kernel mailing list comprises the core of Linux development activities. Traffic volumes are immense, often reaching 10,000 messages in a week, and keeping up to date with the entire scope of development is a virtually impossible task for one person. One of the few brave souls to take on this task is Zack Brown.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.