Charting expletives from the Linux Kernel Mailing List
Climate Study
Kernel amateurs are best advised to read summaries of the heated discussions on the Linux Kernel Mailing List (LKML) before they delve in. We analyze 2.5 million postings to study the density of cursing.
Every now and then, a message reaches social media that Linux boss Linus Torvalds has flipped out once again and dressed down kernel colleagues with rude words. Some Linux enthusiasts look on this with amusement, enjoying the tirades of the great dictator over a cool drink after work; others see the harsh nature of the language as representing an intimidating boy's club culture that privileges insiders.
The issue of language on the kernel list has been in the foreground for the last few years. In 2013, Intel developer Sarah Sharp led an effort to improve civility among kernel developers [1], and Red Hat's Lennart Poettering has also spoken up for more politeness and less abusive language [2].
In 2015, Linus responded to criticism by posting a Code of Conflict [3] that affirms the need for civility in the code review process, instructing developers to contact the Linux Foundation's Technical Advisory Board if they feel the process is threatening or abusive, and ending with a directive to not let things get personal:
As a reviewer of code, please strive to keep things civil and focused on the technical issues involved. We are all humans, and frustrations can be high on both sides of the process. Try to keep in mind the immortal words of Bill and Ted, "Be excellent to each other."
Whether you favor the harsh language of some on the kernel list, or whether you still see room for reform, you might have noticed that most of the discussion centers around anecdotes and opinions – no one ever seems to quantify it.
We decided to work through this phenomenon mathematically. For the dataset, we used 2.5 million LKML posts, which were first fed into a MySQL database, and then beaten with Perl and R scripts and presented graphically.
Figure 1 demonstrates the development of the LKML by means of the number of posts over 20 years from 1996 to the present day, with the start of 2016 projected proportionally. The almost linear increase, from 20,000 posts in 1996 to an estimated figure exceeding 270,000 for the current year of 2016, is evidence of the natural growth of the project and its uninterrupted popularity.
Long Tail
What about the number of members; do most of the posts come from a few extra active highfliers, and the rest as a long tail of Linux hobbyists who only write once or twice a year? An R script reads the metadata re-exported from MySQL into CSV format and prints the graphic in Figure 2.
It turns out that a few top posters over the decades have fired off more than 30,000 emails; a few dozen members, Torvalds himself among them, more than 10,000; and then around another 100 have exceeded 5,000. As expected, the curve levels off on its right side.
Expletives
Before entering analysis of civility on the LKML, it is necessary to clarify when exactly a word is a swear word. Clearly, what is considered profane depends strongly on the cultural environment. One possible approach is offered by the gold standard prevailing in the US: the "Seven Words You can Never Say on Television" compiled by the comedian George Carlin in 1972, referencing words that no publicly aired television or radio stations in the US could send into the ether without first masking them with an annoying 1kHz sound [4] (subscription channels like HBO are the exception).
You can probably guess most of the seven words, which, predictably, center on sex acts, body parts, and bodily functions, but if you have any questions, search for the "seven dirty words" on Wikipedia [5]. If you do not know them all, you are very welcome to use an online dictionary on your own for clarification, but please only do this with your browser set to "incognito" mode.
The CPAN Perl module Regexp::Common is available to determine whether a text includes one of the vulgarities; it searches for them at lightning speed with regular expressions using the profanity
key. The filter, however, will not find coded phrasings or blanked-out words such as f*ck; the regular expressions would have to be expanded for this.
But it also finds words that sound offensive to European ears. While an American might think nothing of the expression "a bunch of crap," except perhaps to find it funny depending on the context, Her Britannic Majesty might not be amused at high tea.
If you use regexes to trawl through the historic contributions to the LKML by Linus Torvalds, the filter jumps to July 1996 for the first instance. The member Aaron Tiensivu had written, under the title "Not a Bible Thumper," that the most amazing profanities were concealed in the kernel code (Figure 3). The discussion took its course until Torvalds exercised his authority and stated that, although he was opposed to political correctness, he also didn't see a point in being intentionally rude for no reason, adding ambiguously, "The reason the active kernel messages should be nice is that while I hate politically correct, I do not believe in being actively offensive either except when I _want_ to offend somebody. And there is no point in offending the occasional user."
More recently, Torvalds has also not shied away from arguing with a coarse tone that, if used against work colleagues in an American company, probably would have seen the HR department called to the scene immediately. At the end of 2012, he berated a maintainer who had not, in his opinion, understood the first rule of kernel maintenance: "We do not break userspace." He told the maintainer to "shut the fuck up"; a kernel change that causes problems for a userland program would always be a bug in the kernel (Figure 4).
What has been the historical development of profanities on the LKML? Figure 5 shows that there were two peaks in 2000 and 2008 with around 1,200 expletive emails, with the last decade exhibiting a strongly falling trend. Taking into account that the number of postings per year is constantly increasing, the potty-mouth count is dropping significantly. However, the figure for 2016 only shows the postings up to July, so the adjusted figure would probably be around the 2015 level.
Who uses the most swear words? Listing 1 shows how many posts the ten biggest boors sent out. At the top is the dictator himself. The list includes a number of non-native speakers – in my experience, non-natives often fling around expletives in English with little sensitivity to disguise their limited vocabulary. That said, the top 10 also enshrines some native English speakers.
Listing 1
Top Swearers
01 Linus Torvalds ........ 1308 02 Alexander Viro ........ 759 03 Peter Zijlstra ........ 548 04 Rik van Riel .......... 397 05 Thomas Gleixner ....... 324 06 Alan Cox .............. 322 07 Andrew Morton ......... 278 08 Ingo Molnar ........... 250 09 Christoph Hellwig ..... 243 10 Benjamin Herrenschmidt 180
What range of words do the maintainers use during their stressful work? Nothing out of the ordinary, as you can see from the pie chart in Figure 6: The list fits pretty closely with the usual repertoire of the American construction worker. The clear favorite is the word "crap."
Conclusion
When used in moderation, a strong word can definitely prevent any possible misunderstandings. Linus has said his use of language is intended to keep developers alert and doing their best work – to fix the problems first before sending problematic code up the development tree. On the other hand, Linux bills itself as a meritocracy, and if worthy and potentially productive programmers are choosing not to participate because they are put off by intimidating and sometimes abusive language, the result is a loss for Linux.
Of course, the study described in this article does not attempt to uncover intimidation or abuse but is only searching for the presence of words. As Sarah Sharp points out in a 2013 kernel list post summarizing her position [7], it is possible to use obscenities in a way that is not personally abusive. Saying "If you give a flying fuck about diversity, you should avoid verbal abuse" is not the same as saying "SHUT THE FUCK UP."
Still, real numbers offer real insights into the use of language on the kernel list, and the fact that foul language is on a downward trend should be of some comfort to those who argue for better word choice.
Infos
- Sarah Sharp post on civility: https://lkml.org/lkml/2013/7/15/329
- Lennart Poettering post on civility: https://plus.google.com/app/basic/stream/z13rdjryqyn1xlt3522sxpugoz3gujbhh04
- Linux Code of Conflict: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b0bc65729070b9cbdbb53ff042984a3c545a0e34
- Bleep censor: https://en.wikipedia.org/wiki/Bleep_censor
- Seven Dirty Words: https://en.wikipedia.org/wiki/Seven_dirty_words
- Linus Torvalds, "Re: Not a bible thumper. . .": https://lkml.org/lkml/1996/7/20/1
- Sarah Sharp's summary: https://lkml.org/lkml/2013/7/19/634
- Listings for this article: ftp://www.linux-magazine.com/pub/listings/magazine/192/Perl
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.