Charting expletives from the Linux Kernel Mailing List

Climate Study

© Lead Image © Cornelius, Fotolia.com

© Lead Image © Cornelius, Fotolia.com

Article from Issue 192/2016
Author(s): , Author(s):

Kernel amateurs are best advised to read summaries of the heated discussions on the Linux Kernel Mailing List (LKML) before they delve in. We analyze 2.5 million postings to study the density of cursing.

Every now and then, a message reaches social media that Linux boss Linus Torvalds has flipped out once again and dressed down kernel colleagues with rude words. Some Linux enthusiasts look on this with amusement, enjoying the tirades of the great dictator over a cool drink after work; others see the harsh nature of the language as representing an intimidating boy's club culture that privileges insiders.

The issue of language on the kernel list has been in the foreground for the last few years. In 2013, Intel developer Sarah Sharp led an effort to improve civility among kernel developers [1], and Red Hat's Lennart Poettering has also spoken up for more politeness and less abusive language [2].

In 2015, Linus responded to criticism by posting a Code of Conflict [3] that affirms the need for civility in the code review process, instructing developers to contact the Linux Foundation's Technical Advisory Board if they feel the process is threatening or abusive, and ending with a directive to not let things get personal:

As a reviewer of code, please strive to keep things civil and focused on the technical issues involved. We are all humans, and frustrations can be high on both sides of the process. Try to keep in mind the immortal words of Bill and Ted, "Be excellent to each other."

Whether you favor the harsh language of some on the kernel list, or whether you still see room for reform, you might have noticed that most of the discussion centers around anecdotes and opinions – no one ever seems to quantify it.

We decided to work through this phenomenon mathematically. For the dataset, we used 2.5 million LKML posts, which were first fed into a MySQL database, and then beaten with Perl and R scripts and presented graphically.

Figure 1 demonstrates the development of the LKML by means of the number of posts over 20 years from 1996 to the present day, with the start of 2016 projected proportionally. The almost linear increase, from 20,000 posts in 1996 to an estimated figure exceeding 270,000 for the current year of 2016, is evidence of the natural growth of the project and its uninterrupted popularity.

Figure 1: Number of posts on the LKML over 20 years.

Long Tail

What about the number of members; do most of the posts come from a few extra active highfliers, and the rest as a long tail of Linux hobbyists who only write once or twice a year? An R script reads the metadata re-exported from MySQL into CSV format and prints the graphic in Figure 2.

Figure 2: A few dozen members contribute more than 10,000 posts, and a few dozen more, more than 30,000.

It turns out that a few top posters over the decades have fired off more than 30,000 emails; a few dozen members, Torvalds himself among them, more than 10,000; and then around another 100 have exceeded 5,000. As expected, the curve levels off on its right side.

Expletives

Before entering analysis of civility on the LKML, it is necessary to clarify when exactly a word is a swear word. Clearly, what is considered profane depends strongly on the cultural environment. One possible approach is offered by the gold standard prevailing in the US: the "Seven Words You can Never Say on Television" compiled by the comedian George Carlin in 1972, referencing words that no publicly aired television or radio stations in the US could send into the ether without first masking them with an annoying 1kHz sound [4] (subscription channels like HBO are the exception).

You can probably guess most of the seven words, which, predictably, center on sex acts, body parts, and bodily functions, but if you have any questions, search for the "seven dirty words" on Wikipedia [5]. If you do not know them all, you are very welcome to use an online dictionary on your own for clarification, but please only do this with your browser set to "incognito" mode.

The CPAN Perl module Regexp::Common is available to determine whether a text includes one of the vulgarities; it searches for them at lightning speed with regular expressions using the profanity key. The filter, however, will not find coded phrasings or blanked-out words such as f*ck; the regular expressions would have to be expanded for this.

But it also finds words that sound offensive to European ears. While an American might think nothing of the expression "a bunch of crap," except perhaps to find it funny depending on the context, Her Britannic Majesty might not be amused at high tea.

If you use regexes to trawl through the historic contributions to the LKML by Linus Torvalds, the filter jumps to July 1996 for the first instance. The member Aaron Tiensivu had written, under the title "Not a Bible Thumper," that the most amazing profanities were concealed in the kernel code (Figure 3). The discussion took its course until Torvalds exercised his authority and stated that, although he was opposed to political correctness, he also didn't see a point in being intentionally rude for no reason, adding ambiguously, "The reason the active kernel messages should be nice is that while I hate politically correct, I do not believe in being actively offensive either except when I _want_ to offend somebody. And there is no point in offending the occasional user."

Figure 3: A post denounces curse words used in the kernel code.

More recently, Torvalds has also not shied away from arguing with a coarse tone that, if used against work colleagues in an American company, probably would have seen the HR department called to the scene immediately. At the end of 2012, he berated a maintainer who had not, in his opinion, understood the first rule of kernel maintenance: "We do not break userspace." He told the maintainer to "shut the fuck up"; a kernel change that causes problems for a userland program would always be a bug in the kernel (Figure 4).

Figure 4: Linus Torvalds goes after a maintainer.

What has been the historical development of profanities on the LKML? Figure 5 shows that there were two peaks in 2000 and 2008 with around 1,200 expletive emails, with the last decade exhibiting a strongly falling trend. Taking into account that the number of postings per year is constantly increasing, the potty-mouth count is dropping significantly. However, the figure for 2016 only shows the postings up to July, so the adjusted figure would probably be around the 2015 level.

Figure 5: Post count with foul language over the years.

Who uses the most swear words? Listing 1 shows how many posts the ten biggest boors sent out. At the top is the dictator himself. The list includes a number of non-native speakers – in my experience, non-natives often fling around expletives in English with little sensitivity to disguise their limited vocabulary. That said, the top 10 also enshrines some native English speakers.

Listing 1

Top Swearers

01 Linus Torvalds ........ 1308
02 Alexander Viro ........  759
03 Peter Zijlstra ........  548
04 Rik van Riel ..........  397
05 Thomas Gleixner .......  324
06 Alan Cox ..............  322
07 Andrew Morton .........  278
08 Ingo Molnar ...........  250
09 Christoph Hellwig .....  243
10 Benjamin Herrenschmidt   180

What range of words do the maintainers use during their stressful work? Nothing out of the ordinary, as you can see from the pie chart in Figure 6: The list fits pretty closely with the usual repertoire of the American construction worker. The clear favorite is the word "crap."

Figure 6: The most popular expletives on the mailing list.

Conclusion

When used in moderation, a strong word can definitely prevent any possible misunderstandings. Linus has said his use of language is intended to keep developers alert and doing their best work – to fix the problems first before sending problematic code up the development tree. On the other hand, Linux bills itself as a meritocracy, and if worthy and potentially productive programmers are choosing not to participate because they are put off by intimidating and sometimes abusive language, the result is a loss for Linux.

Of course, the study described in this article does not attempt to uncover intimidation or abuse but is only searching for the presence of words. As Sarah Sharp points out in a 2013 kernel list post summarizing her position [7], it is possible to use obscenities in a way that is not personally abusive. Saying "If you give a flying fuck about diversity, you should avoid verbal abuse" is not the same as saying "SHUT THE FUCK UP."

Still, real numbers offer real insights into the use of language on the kernel list, and the fact that foul language is on a downward trend should be of some comfort to those who argue for better word choice.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News

njobs Europe
What:
Where:
Country:
Njobs Netherlands Njobs Deutschland Njobs United Kingdom Njobs Italia Njobs France Njobs Espana Njobs Poland
Njobs Austria Njobs Denmark Njobs Belgium Njobs Czech Republic Njobs Mexico Njobs India Njobs Colombia