Purifying your scanned PDF files
New View

© Lead Image © Sasin Paraksa, 123RF.com
Having trouble reading that scanned PDF? You can add a little more contrast with some help from ImageMagick.
Gone are the days when you needed to go to the library for a book. Now you can download the book electronically, load it on your e-reader or tablet, and start enjoying it. But all electronic books are not equal. Particularly infuriating are electronic books that are actually scanned images of old print books. Scanned images of old books, which typically come in PDF format, are difficult to read on a black-and-white E Ink screen, where fading text and yellowing pages appear as blurs, blotches, and dark-gray backgrounds.
Luckily, you can clear up that blurry scanned image with a few tricks from ImageMagick. This article describes a method you can use to spruce up a scanned electronic book. Note: If you obtained the book from a lender or through another vendor, be sure the license supports this type of file manipulation.
Getting Started
I needed a copy of a sociology book, A Place on the Corner, by Elijah Anderson; the only place I could find it in electronic format was The Internet Archive. They had a scanned copy, so I loaded it on my Sony DPT-RP1/B e-reader, but the text was difficult to read (Figure 1). Dark spots appeared on the page, and the text was just a bit darker than the background, with poor contrast (Figure 2).
Hoping to find a better view, I installed the pdfimages
package on my Ubuntu 18.04 system:
sudo apt install pdfimages
pdfimages
is used to extract images from PDF files and save them as Portable Pixmap (PPM), Portable Bitmap (PBM), or another format. Since the scanned pages in the book are image files, all you need to do is drop the PDF into an empty folder and enter the following command:
pdfimages A_place_on_the_corner.pdf
This command will generate a large amount of PPM and PBM image files (Figures 3 and 4). The PPM files contain all the shadows and imperfections of the scanned pages, and the PBM files are a negative, clean, white-on-black version. Now all you need to do is reverse the colors in the PBM negatives. You don't need the PPM files, so you can remove them:
rm *.ppm
To maintain quality, convert all remaining PBM files into PNG format using mogrify
, which is part of the ImageMagick package. (See also the "More Memory?" box for using ImageMagick with large files.) First be sure to install ImageMagick:
sudo apt install imagemagick
More Memory?
On some Linux systems, you'll receive an error message when you use the ImageMagick command-line tools with large files. This message is due to a memory limitation that you can address by editing the /etc/ImageMagick-6/policy.xml
file so that the memory
and disk
values get more power. Everything should work fine with the policy.xml
settings that have resource values modified as shown in Listing 1.
Listing 1
policy.xml Settings
01 <policy domain="resource" name="memory" value="2256MiB"/> 02 <policy domain="resource" name="map" value="512MiB"/> 03 <policy domain="resource" name="width" value="16KP"/> 04 <policy domain="resource" name="height" value="16KP"/> 05 <policy domain="resource" name="area" value="128MB"/> 06 <policy domain="resource" name="disk" value="20GiB"/>
mogrify
is used to manipulate graphic files: rotate, crop, flip, blur, and join. You can also use mogrify
to convert from one format to another. Search for all PBM files in the folder and use mogrify
to convert them to PNG format automatically:
find -name '*.pbm' -print0 | xargs -0 -r mogrify -format png
Now that the pages of the book are in PNG format, you don't need the PBM files anymore, so you can delete them:
rm *.pbm
convert
is another command-line interface tool shipped with ImageMagick that does about the same thing as mogrify
, but it is also able to invert the colors of an image file. The PNG files are currently in negative, so you can convert them to look like regular book pages using the -negate
attribute of convert
. Output the results as JPG files to keep the file sizes low:
ls -1 *.png | xargs -n 1 bash -c 'convert "$0" -negate "${0%.png}.jpg"'
You can now delete the PNG files and inspect the remaining JPG images. Should you feel they need more contrast, you can use the -level
argument of the convert
command to bulk-modify the contrast of your files:
ls -1 *.jpg | xargs -n 1 bash -c 'convert "$0" -level 60 "${0%.jpg}.jpg"'
Replace 60
with a value ranging from 1
to 100
, with a higher value for more contrast, or reduce the value for less contrast. The next step is to insert the now-cleaned and easily-readable scanned pages into a PDF file. Should you wish to keep the resulting PDF as small as possible, you can batch-resize the JPG files beforehand to 50 percent of their actual size using mogrify
:
ls -1 *.jpg | xargs -n 1 bash -c 'mogrify "$0" -resize 50 %"${0%.jpg}.jpg"'
All the files were extracted from the scanned PDF book and numbered in order, and the file names haven't changed through all the conversion steps, so the entire book is ready to insert, page by page, back into a PDF. You can easily add the converted pages back in with:
convert *.jpg a_place_on_the_corner-purified.pdf
The result is a PDF file that has crisp black text on a white background, and the difference is noticeable (Figure 5). You can load it on your e-reader or tablet and read it without eye strain.
Conclusion
Using this PDF purifying method, you can get rid of pinkish or yellow backgrounds from scanned documents. Figure 6 shows a comparison of the purified text with the original version of the book. This method will also help you remove other artifacts of the scanning process, such as the faint text lines that appear from the text on the other side of the scanned page, as well as dirty fingerprint marks, light coffee stains, or pencil marks.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Find SysAdmin Jobs
News
-
OpenMandriva Lx 23.03 Rolling Release is Now Available
OpenMandriva "ROME" is the latest point update for the rolling release Linux distribution and offers the latest updates for a number of important applications and tools.
-
CarbonOS: A New Linux Distro with a Focus on User Experience
CarbonOS is a brand new, built-from-scratch Linux distribution that uses the Gnome desktop and has a special feature that makes it appealing to all types of users.
-
Kubuntu Focus Announces XE Gen 2 Linux Laptop
Another Kubuntu-based laptop has arrived to be your next ultra-portable powerhouse with a Linux heart.
-
MNT Seeks Financial Backing for New Seven-Inch Linux Laptop
MNT Pocket Reform is a tiny laptop that is modular, upgradable, recyclable, reusable, and ships with Debian Linux.
-
Ubuntu Flatpak Remix Adds Flatpak Support Preinstalled
If you're looking for a version of Ubuntu that includes Flatpak support out of the box, there's one clear option.
-
Gnome 44 Release Candidate Now Available
The Gnome 44 release candidate has officially arrived and adds a few changes into the mix.
-
Flathub Vying to Become the Standard Linux App Store
If the Flathub team has any say in the matter, their product will become the default tool for installing Linux apps in 2023.
-
Debian 12 to Ship with KDE Plasma 5.27
The Debian development team has shifted to the latest version of KDE for their testing branch.
-
Planet Computers Launches ARM-based Linux Desktop PCs
The firm that originally released a line of mobile keyboards has taken a different direction and has developed a new line of out-of-the-box mini Linux desktop computers.
-
Ubuntu No Longer Shipping with Flatpak
In a move that probably won’t come as a shock to many, Ubuntu and all of its official spins will no longer ship with Flatpak installed.