Needle in a Haystack
A Case for Redundancy
One reason to extract everything by default is just to show what is possible: If you really want or need it, you can automatically save the text content of every email you ever received as one single file.
The other reason to extract everything is the same reason why, by default, the script uses four different tools: the very nature and age of email. The good part of email being an open, decades-old standard is that we can still read very old messages. The bad part, especially when dealing with old email archives, is that, over the years, countless email applications of variable quality, plus careless email users, applied that standard with great creativity.
A generic old email archive may contain attachments in long-forgotten formats or named with weird characters with non-standard extensions, or none at all.
Email bodies can be nested in the strangest ways: For example, think of email, in both text and HTML format with attachments, that has been forwarded and then replied to from an email digest. Even a message's "plain text" may be encoded in any way possibly conceived over the last 50 years. Oh, and have you ever tried to count how many different combinations of language and format there are to write a "simple" date?
Even the way in which archives have been stored on disk is relevant. Think of a Maildir first compressed and saved to a floppy, later copied on a CD-ROM and then back onto a hard drive, and finally to the cloud – every time on a different filesystem. In general, each of these migrations might alter the name, access, or creation time of an email file.
This is why the script does the "same" job four times. I have found that, no matter how you set it, each tool returns a slightly different set of files. However, for the reasons above, I cannot predict which of them might return the best file (whatever that means) for all the mailboxes that you may encounter. So you get four tools, and you decide whether to keep them all or not. I will discuss this issue later further in the tutorial.
Regardless of which tool you use and how, it is almost certain that you will end up with lots of duplicate or useless files. Even one single email thread can produce duplicates because of emojis, HTML email with the same image in the signature of each message, or an attachment that was sent to many people with you CC'd each time. There is no way to avoid this, which is the reason for last part of the script.
Because of space constraints, I cannot describe in detail all the options used for the several commands presents in the code: To know how they work, please check each command's man pages.
The script takes two arguments. The first, saved in the MAILBOX
variable, is the Maildir folder that the script should scan for attachments. The second, TARGET
, is the folder where it should save the files it finds.
TMPDIR
is, as its name suggests, just a temporary work folder, and CNT
is a counter used to generate unique names for its subfolders. The purpose of the associative array CHECKSUMS
defined in line 7 is explained in Listing 3.
Lines 9 to 18 define a shell function called emaildirname
that first defines three variables: the time the file containing the current email message was created (MSGTS
) and the time that email was received (MSGDATE
) and its subject (MSGSUBJ
).
Keep in mind the many different date formats, which can also be ambiguous if they don't mention a time zone. File creation times are often not preserved when files are copied from one medium to another. Consequently, more often than not, neither of those timestamps will be correct. They are just the least worst guess possible to make in a simple shell script.
In practice, line 10 saves in MSGTS
the timestamp of the current email file, expressed in seconds in Unix time (i.e., seconds that have elapsed since January 1, 1970). That number is then converted with the date
command in two different formats in ORIGTS
and FILENAMETS
(lines 11-12), for reasons that will become clear later.
MSGDATE
, in line 13, stores the content of the first Date: header in the current email (there may be many of them, if the email is a reply to a reply to a reply). MSGSUBJ
, instead, stores the first 200 characters (minus the initial Subject: substring) of the subject email header.
The emaildirname
function's main job is to generate a unique folder name (DIR
) that includes, in a more or less readable way, both the current received date and subject of the email. To do this, line 15 concatenates the two corresponding variables, replacing with sed
any non-alphanumeric characters with dashes. Next, to guarantee that DIR
has a unique value, the $CNT
variable is appended right before incrementing it. Without this trick, two distinct copies of the same email would be processed in the same folder.
In this script, any line containing the strings EXTR
or DUPX
simply print to the terminal some information about what is happening. Here, lines 23 to 25 print the name of the input mailbox, how many messages it contains, how big it is, and where the attachments will be stored.
The four Linux command-line tools – all available as binary packages in the most popular distributions – with the ability to split the email parts into separate files are: mu
, uudeview
, munpack
, and ripmime
. Listing 2 runs them all, one at a time, on all of the given $MAILBOX
messages by means of the for
loop starting in line 29. A separate folder for each tool is created inside $TMPDIR
(line 31), and the $CNT
variable is reset.
Then, for each single email message contained in the given $MAILBOX
(line 35), four things happen in sequence. First, the emaildirname
function updates the values of the variables previously described. Second (lines 38-39), the script creates a unique folder to store all the current email parts. Third, the case
statement (line 40) passes to the active tool the current email, making it dump all its separate parts inside that same folder (lines 42, 46, 50, and 54). A numeric index $TOOLNUM
is also associated with each tool (more on this later).
Sometimes, the email decoder tool will find nothing to save inside the $TMPDIR/$TOOL/$DIR
folder. In that case, the loop started in line 35 will just move on to the next email.
If $NUMFILES
is greater than zero (lines 59-60), $TMPDIR/$TOOL/$DIR
will contain potentially tons of files (e.g., plain text body, HTML body, many icons, attachments, etc.) with the weirdest possible names and often without any extensions. Therefore, the script saves all those file names into a list (line 62) then reads them all with the while
loop of line 63, and renames all the corresponding files. This fourth step is necessary to make both the rest of the script and, generally speaking, your future management of those files easier. Trust me on this.
The renaming that happens in lines 65 to 67 is probably the blackest piece of magic of the whole thing but luckily is much easier to explain:
- Line 65 is standard Bash syntax to save into
$NEWNAME
the name of the current file, minus its extension, if present. - Line 66 just repeats the same trick from line 15 (Listing 1): All the non-alphanumeric characters become dashes.
- Line 67 is, like line 65, standard Bash voodoo to fetch the extension (if present) of a file name.
If a file has no extension, both $EXT
and $NEWNAME
will have the same value. In my experience, at least 95 percent of the time, this happens with files that are the body of an email, in plain or HTML format. Therefore, I just give those files the telling probablyemailbody.txt
extension (lines 69 to 72).
Finally, line 73 gives the file a new name composed by $FILENAMETS
and calculated by the emaildirname
function, plus the current $TOOLNUM
, $CNT
, $NEWNAME
, and, of course, its extension.
The -t
option of the touch
command in line 74 sets the freshly renamed file's creation timestamp to the creation timestamp contained in $ORIGTS
. That value is the same as $FILENAMETS
, but because the -t
option wants a different format, I put it into a separate variable for clarity.
The final part of the main loop (lines 79 to 82) just does some reporting and moves all the files that were extracted and given unique names to the $TARGET/tmp
folder.
At this point in the script, all the files extracted by all the tools are inside the folder $TARGET/tmp
. Depending on the content of your mailboxes and on how you tweaked the script, you may also get a lot of empty files: Line 86 (Listing 3) finds and deletes them.
Now is the time to tackle a bigger problem: Regardless of how the code was tweaked, that folder will contain many duplicate files (Figure 2). Removing them is the task of the loop in lines 92 to 104.
Line 92 finds and sorts by name all the files inside $TARGET/tmp
to look at them one by one. Line 94 stores the checksum of the current file in the $CK
variable. If the CHECKSUMS
associative array declared in line 7 (Listing 1) already contains a key equal to $CK
(line 92), it means the loop has already found a first copy of that file, and so it can safely remove the other copy found in this iteration (to be precise, it leaves it in a folder that will be moved elsewhere in line 106, but it has the same result). Therefore, line 97 prints what will happen, and line 98 increments the counter of duplicate files.
If there is no key equal to $CK
inside CHECKSUMS
, then $F
is the first copy of that particular file. Therefore, it is moved to the parent directory, and another key-value element for $CK
is added to the CHECKSUMS
array in line 102.
By now, you can also appreciate all the effort put into giving each file a unique name that starts with its (assumed!!!) creation time: Because line 92 sorts files by name, the first copy of any file that will be processed (i.e., the only one that will be kept) also is the one with the oldest timestamp. For the same reason, should you want to keep the newest copy instead, you should just add the -r
(reverse) sort
switch, in line 92.
The last four lines of code generate some final statistics, after moving the $TARGET/tmp
folder outside of $TARGET
, so you can still look at all those files if something went wrong.
Scanning a Real Mailbox
To give you an idea of the performance of both the complete script and each of the individual tools it uses, I ran the script three times on one of the largest mailboxes in my email archive. In the first run, I used all four tools, with exactly the options shown in Listings 1 through 3, to save every component of each message. Listing 4 shows the complete report generated by the script (compare it with the source code to understand exactly how each piece of information was generated):
Listing 4
First Run – All Four Tools
EXTR mbox : /home/marco/mbox-extract/original/2017.06 EXTR contains : 1685 messages ( 150224 KBytes ) EXTR target : /home/marco/mbox-extract/mu-2017.06 EXTR EXTR mu : 08:57:25 start EXTR mu : 08:59:10 end EXTR mu : 348 files extracted ( 0 empty) EXTR EXTR uudeview : 08:59:11 start EXTR uudeview : 09:01:06 end EXTR uudeview : 648 files extracted ( 0 empty) EXTR EXTR munpack : 09:01:09 start EXTR munpack : 09:03:50 end EXTR munpack : 2780 files extracted ( 1 empty) EXTR EXTR ripmime : 09:03:58 start EXTR ripmime : 09:07:18 end EXTR ripmime : 4334 files extracted ( 1017 empty) EXTR EXTR cleaning : 09:07:32 start EXTR total : 5369 files found, after removing 1723 duplicates EXTR cleaning : 09:08:42 end
The report shows that my email archive for June 2017 contains 1,685 messages and takes about 150MB of disk space. It also shows that the time spent by each tool, in addition to the number of files it found, varies greatly.
The mu
program/tool saved 348 files in less than two minutes, whereas ripmime
saved more than 4,300 files in six minutes – including more than 1,000 empty files (not its fault, as explained below). Eventually, more than 1,700 duplicates were removed, leaving more than 5,300 files, between attachments, email bodies, digital signatures, and whatnot in the attachments-2017.06
folder.
The second run, I made munpack
and ripmime
ignore almost everything but the attachments by:
- Not passing the
-t
switch tomunpack
. This means "ignore the plain text MIME parts of multipart messages." - Adding the
--no-nameless
switch toripmime
to make it skip nameless attachments.
Running the script on the same mailbox in "attachment-only mode" produced the results shown in Listing 5.
Listing 5
Second Run – Attachment-Only Mode
EXTR mu : 10:52:23 start EXTR mu : 10:54:07 end EXTR mu : 348 files extracted ( 0 empty) EXTR EXTR uudeview : 10:54:08 start EXTR uudeview : 10:55:50 end EXTR uudeview : 648 files extracted ( 0 empty) EXTR EXTR munpack : 10:55:52 start EXTR munpack : 10:57:34 end EXTR munpack : 655 files extracted ( 0 empty) EXTR EXTR ripmime : 10:57:36 start EXTR ripmime : 10:59:25 end EXTR ripmime : 325 files extracted ( 2 empty) EXTR EXTR cleaning : 10:59:26 start EXTR total : 846 files found, after removing 1128 duplicates EXTR cleaning : 10:59:43 end
As expected, both munpack
and ripmime
were much more efficient, and overall, the script produced many fewer files to deal with manually. However, a direct inspection of this run's 846 results showed that about three quarters of them were email bodies, not attachments, almost all extracted by uudeview
or munpack
. Therefore, in the third run, I removed those two tools from line 29 (Listing 2), leaving the script to use only mu
and ripmime
. Listing 6 shows what remained after deduplication.
Listing 6
Third Run – mu and ripmime
EXTR cleaning : 11:35:08 start EXTR total : 245 files found, after removing 426 duplicates EXTR cleaning : 11:35:16 end
This third run did produce, almost exclusively, actual attachments (plus a few logos embedded in HTML email). All the numbers prove, I hope, why I present a script that can extract every single part of an email message in four different ways. Email messages can be so different from each other that, especially in older archives, one tool might find what the others miss.
The reports show that if the content of your email archive is similar to mine and you only care about attachments, you can very likely get away with enabling only mu
and ripmime
with the --no-nameless
option.
At the same time, if you also need to save the bodies of your messages in separate files, if you have old archives created with non-standard clients, or if you want to be as sure as technically possible that you did not miss anything, you can get there using all four tools sequentially and waiting just a few more minutes. Again, in my opinion this is a decision that you have to make. I do suggest, however, to make at least one trial run with all four tools.
Other Scenarios
Regardless of how you configure it, this script is a lifesaver whenever you have to recover one or many attachments quickly from large email archives. Of course, in many cases it would not be enough or it would be overkill. For lack of space, I cannot explore all those cases in detail, but the following sections have some pointers on how to handle the most common possibilities.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Budgie 10.10 Scheduled for Q1 2025 with a Surprising Desktop Update
If Budgie is your desktop environment of choice, 2025 is going to be a great year for you.
-
Firefox 134 Offers Improvements for Linux Version
Fans of Linux and Firefox rejoice, as there's a new version available that includes some handy updates.
-
Serpent OS Arrives with a New Alpha Release
After months of silence, Ikey Doherty has released a new alpha for his Serpent OS.
-
HashiCorp Cofounder Unveils Ghostty, a Linux Terminal App
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.