Needle in a Haystack

A Case for Redundancy

One reason to extract everything by default is just to show what is possible: If you really want or need it, you can automatically save the text content of every email you ever received as one single file.

The other reason to extract everything is the same reason why, by default, the script uses four different tools: the very nature and age of email. The good part of email being an open, decades-old standard is that we can still read very old messages. The bad part, especially when dealing with old email archives, is that, over the years, countless email applications of variable quality, plus careless email users, applied that standard with great creativity.

A generic old email archive may contain attachments in long-forgotten formats or named with weird characters with non-standard extensions, or none at all.

Email bodies can be nested in the strangest ways: For example, think of email, in both text and HTML format with attachments, that has been forwarded and then replied to from an email digest. Even a message's "plain text" may be encoded in any way possibly conceived over the last 50 years. Oh, and have you ever tried to count how many different combinations of language and format there are to write a "simple" date?

Even the way in which archives have been stored on disk is relevant. Think of a Maildir first compressed and saved to a floppy, later copied on a CD-ROM and then back onto a hard drive, and finally to the cloud – every time on a different filesystem. In general, each of these migrations might alter the name, access, or creation time of an email file.

This is why the script does the "same" job four times. I have found that, no matter how you set it, each tool returns a slightly different set of files. However, for the reasons above, I cannot predict which of them might return the best file (whatever that means) for all the mailboxes that you may encounter. So you get four tools, and you decide whether to keep them all or not. I will discuss this issue later further in the tutorial.

Regardless of which tool you use and how, it is almost certain that you will end up with lots of duplicate or useless files. Even one single email thread can produce duplicates because of emojis, HTML email with the same image in the signature of each message, or an attachment that was sent to many people with you CC'd each time. There is no way to avoid this, which is the reason for last part of the script.

Because of space constraints, I cannot describe in detail all the options used for the several commands presents in the code: To know how they work, please check each command's man pages.

The script takes two arguments. The first, saved in the MAILBOX variable, is the Maildir folder that the script should scan for attachments. The second, TARGET, is the folder where it should save the files it finds.

TMPDIR is, as its name suggests, just a temporary work folder, and CNT is a counter used to generate unique names for its subfolders. The purpose of the associative array CHECKSUMS defined in line 7 is explained in Listing 3.

Lines 9 to 18 define a shell function called emaildirname that first defines three variables: the time the file containing the current email message was created (MSGTS) and the time that email was received (MSGDATE) and its subject (MSGSUBJ).

Keep in mind the many different date formats, which can also be ambiguous if they don't mention a time zone. File creation times are often not preserved when files are copied from one medium to another. Consequently, more often than not, neither of those timestamps will be correct. They are just the least worst guess possible to make in a simple shell script.

In practice, line 10 saves in MSGTS the timestamp of the current email file, expressed in seconds in Unix time (i.e., seconds that have elapsed since January 1, 1970). That number is then converted with the date command in two different formats in ORIGTS and FILENAMETS (lines 11-12), for reasons that will become clear later.

MSGDATE, in line 13, stores the content of the first Date: header in the current email (there may be many of them, if the email is a reply to a reply to a reply). MSGSUBJ, instead, stores the first 200 characters (minus the initial Subject: substring) of the subject email header.

The emaildirname function's main job is to generate a unique folder name (DIR) that includes, in a more or less readable way, both the current received date and subject of the email. To do this, line 15 concatenates the two corresponding variables, replacing with sed any non-alphanumeric characters with dashes. Next, to guarantee that DIR has a unique value, the $CNT variable is appended right before incrementing it. Without this trick, two distinct copies of the same email would be processed in the same folder.

In this script, any line containing the strings EXTR or DUPX simply print to the terminal some information about what is happening. Here, lines 23 to 25 print the name of the input mailbox, how many messages it contains, how big it is, and where the attachments will be stored.

The four Linux command-line tools – all available as binary packages in the most popular distributions – with the ability to split the email parts into separate files are: mu, uudeview, munpack, and ripmime. Listing 2 runs them all, one at a time, on all of the given $MAILBOX messages by means of the for loop starting in line 29. A separate folder for each tool is created inside $TMPDIR (line 31), and the $CNT variable is reset.

Then, for each single email message contained in the given $MAILBOX (line 35), four things happen in sequence. First, the emaildirname function updates the values of the variables previously described. Second (lines 38-39), the script creates a unique folder to store all the current email parts. Third, the case statement (line 40) passes to the active tool the current email, making it dump all its separate parts inside that same folder (lines 42, 46, 50, and 54). A numeric index $TOOLNUM is also associated with each tool (more on this later).

Sometimes, the email decoder tool will find nothing to save inside the $TMPDIR/$TOOL/$DIR folder. In that case, the loop started in line 35 will just move on to the next email.

If $NUMFILES is greater than zero (lines 59-60), $TMPDIR/$TOOL/$DIR will contain potentially tons of files (e.g., plain text body, HTML body, many icons, attachments, etc.) with the weirdest possible names and often without any extensions. Therefore, the script saves all those file names into a list (line 62) then reads them all with the while loop of line 63, and renames all the corresponding files. This fourth step is necessary to make both the rest of the script and, generally speaking, your future management of those files easier. Trust me on this.

The renaming that happens in lines 65 to 67 is probably the blackest piece of magic of the whole thing but luckily is much easier to explain:

  • Line 65 is standard Bash syntax to save into $NEWNAME the name of the current file, minus its extension, if present.
  • Line 66 just repeats the same trick from line 15 (Listing 1): All the non-alphanumeric characters become dashes.
  • Line 67 is, like line 65, standard Bash voodoo to fetch the extension (if present) of a file name.

If a file has no extension, both $EXT and $NEWNAME will have the same value. In my experience, at least 95 percent of the time, this happens with files that are the body of an email, in plain or HTML format. Therefore, I just give those files the telling probablyemailbody.txt extension (lines 69 to 72).

Finally, line 73 gives the file a new name composed by $FILENAMETS and calculated by the emaildirname function, plus the current $TOOLNUM, $CNT, $NEWNAME, and, of course, its extension.

The -t option of the touch command in line 74 sets the freshly renamed file's creation timestamp to the creation timestamp contained in $ORIGTS. That value is the same as $FILENAMETS, but because the -t option wants a different format, I put it into a separate variable for clarity.

The final part of the main loop (lines 79 to 82) just does some reporting and moves all the files that were extracted and given unique names to the $TARGET/tmp folder.

At this point in the script, all the files extracted by all the tools are inside the folder $TARGET/tmp. Depending on the content of your mailboxes and on how you tweaked the script, you may also get a lot of empty files: Line 86 (Listing 3) finds and deletes them.

Now is the time to tackle a bigger problem: Regardless of how the code was tweaked, that folder will contain many duplicate files (Figure 2). Removing them is the task of the loop in lines 92 to 104.

Figure 2: Running the script with the default options will produce hundreds of files with meaningless names and lots of duplicates.

Line 92 finds and sorts by name all the files inside $TARGET/tmp to look at them one by one. Line 94 stores the checksum of the current file in the $CK variable. If the CHECKSUMS associative array declared in line 7 (Listing 1) already contains a key equal to $CK (line 92), it means the loop has already found a first copy of that file, and so it can safely remove the other copy found in this iteration (to be precise, it leaves it in a folder that will be moved elsewhere in line 106, but it has the same result). Therefore, line 97 prints what will happen, and line  98 increments the counter of duplicate files.

If there is no key equal to $CK inside CHECKSUMS, then $F is the first copy of that particular file. Therefore, it is moved to the parent directory, and another key-value element for $CK is added to the CHECKSUMS array in line 102.

By now, you can also appreciate all the effort put into giving each file a unique name that starts with its (assumed!!!) creation time: Because line 92 sorts files by name, the first copy of any file that will be processed (i.e., the only one that will be kept) also is the one with the oldest timestamp. For the same reason, should you want to keep the newest copy instead, you should just add the -r (reverse) sort switch, in line 92.

The last four lines of code generate some final statistics, after moving the $TARGET/tmp folder outside of $TARGET, so you can still look at all those files if something went wrong.

Scanning a Real Mailbox

To give you an idea of the performance of both the complete script and each of the individual tools it uses, I ran the script three times on one of the largest mailboxes in my email archive. In the first run, I used all four tools, with exactly the options shown in Listings 1 through 3, to save every component of each message. Listing 4 shows the complete report generated by the script (compare it with the source code to understand exactly how each piece of information was generated):

Listing 4

First Run – All Four Tools

EXTR mbox      : /home/marco/mbox-extract/original/2017.06
EXTR contains  : 1685 messages ( 150224 KBytes )
EXTR target    : /home/marco/mbox-extract/mu-2017.06
EXTR
EXTR mu        : 08:57:25   start
EXTR mu        : 08:59:10   end
EXTR mu        : 348        files extracted (      0 empty)
EXTR
EXTR uudeview  : 08:59:11   start
EXTR uudeview  : 09:01:06   end
EXTR uudeview  : 648        files extracted (      0 empty)
EXTR
EXTR munpack   : 09:01:09   start
EXTR munpack   : 09:03:50   end
EXTR munpack   : 2780       files extracted (      1 empty)
EXTR
EXTR ripmime   : 09:03:58   start
EXTR ripmime   : 09:07:18   end
EXTR ripmime   : 4334       files extracted (   1017 empty)
EXTR
EXTR cleaning  : 09:07:32   start
EXTR    total  : 5369       files found, after removing 1723 duplicates
EXTR cleaning : 09:08:42   end

The report shows that my email archive for June 2017 contains 1,685 messages and takes about 150MB of disk space. It also shows that the time spent by each tool, in addition to the number of files it found, varies greatly.

The mu program/tool saved 348 files in less than two minutes, whereas ripmime saved more than 4,300 files in six minutes – including more than 1,000 empty files (not its fault, as explained below). Eventually, more than 1,700 duplicates were removed, leaving more than 5,300 files, between attachments, email bodies, digital signatures, and whatnot in the attachments-2017.06 folder.

The second run, I made munpack and ripmime ignore almost everything but the attachments by:

  • Not passing the -t switch to munpack. This means "ignore the plain text MIME parts of multipart messages."
  • Adding the --no-nameless switch to ripmime to make it skip nameless attachments.

Running the script on the same mailbox in "attachment-only mode" produced the results shown in Listing 5.

Listing 5

Second Run – Attachment-Only Mode

EXTR mu        : 10:52:23   start
EXTR mu        : 10:54:07   end
EXTR mu        : 348        files extracted (      0 empty)
EXTR
EXTR uudeview  : 10:54:08   start
EXTR uudeview  : 10:55:50   end
EXTR uudeview  : 648        files extracted (      0 empty)
EXTR
EXTR munpack   : 10:55:52   start
EXTR munpack   : 10:57:34   end
EXTR munpack   : 655        files extracted (      0 empty)
EXTR
EXTR ripmime   : 10:57:36   start
EXTR ripmime   : 10:59:25   end
EXTR ripmime   : 325        files extracted (      2 empty)
EXTR
EXTR cleaning  : 10:59:26   start
EXTR    total : 846       files found, after removing 1128 duplicates
EXTR cleaning : 10:59:43   end

As expected, both munpack and ripmime were much more efficient, and overall, the script produced many fewer files to deal with manually. However, a direct inspection of this run's 846 results showed that about three quarters of them were email bodies, not attachments, almost all extracted by uudeview or munpack. Therefore, in the third run, I removed those two tools from line 29 (Listing 2), leaving the script to use only mu and ripmime. Listing 6 shows what remained after deduplication.

Listing 6

Third Run – mu and ripmime

EXTR cleaning : 11:35:08  start
EXTR    total : 245       files found, after removing 426 duplicates
EXTR cleaning : 11:35:16  end

This third run did produce, almost exclusively, actual attachments (plus a few logos embedded in HTML email). All the numbers prove, I hope, why I present a script that can extract every single part of an email message in four different ways. Email messages can be so different from each other that, especially in older archives, one tool might find what the others miss.

The reports show that if the content of your email archive is similar to mine and you only care about attachments, you can very likely get away with enabling only mu and ripmime with the --no-nameless option.

At the same time, if you also need to save the bodies of your messages in separate files, if you have old archives created with non-standard clients, or if you want to be as sure as technically possible that you did not miss anything, you can get there using all four tools sequentially and waiting just a few more minutes. Again, in my opinion this is a decision that you have to make. I  do suggest, however, to make at least one trial run with all four tools.

Other Scenarios

Regardless of how you configure it, this script is a lifesaver whenever you have to recover one or many attachments quickly from large email archives. Of course, in many cases it would not be enough or it would be overkill. For lack of space, I cannot explore all those cases in detail, but the following sections have some pointers on how to handle the most common possibilities.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Charly's Column

    Charly loves to be organized, but he also likes to have access to mail that reached him when the dinosaurs were still roaming the earth.

  • Mutt for Beginners

    Mutt, a command-line email client, can do anything a desktop client can with less overhead and a smaller attack surface. Here's how to get started.

  • Archiving Email

    Email archiving involves more than just backing up your email directories. It is also a question of classifying the email and making it easier for users to find their way around overfilled email folders.

  • Hypermail

    Hypermail converts email messages to HTML and allows you to group your messages in tidy archives.

  • Perl: IMAP Chat Log

    Are you interested in storing, organizing, and searching instant messaging conversations on your IMAP server? The Perl script in this month’s column can help you do just that.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News