PDF creators, extractors, and editors tested
Mutool
The small command-line tool Mutool [5] is part of the simple PDF viewer, MuPDF. Manufacturer Artifex describes it as the "Swiss army knife of PDF manipulation tools." If this is true, something is definitely wrong with Swiss engineering: To be more precise, it can only regenerate the PDF, extract the fonts and images, display some information, and arrange the pages on a giant poster.
Like MuPDF, Mutool is released under the Affero GPL. We looked at version 1.2.2, which can be found in the repositories of Ubuntu 13.10 as the mupdf-tools
package.
Mutool uncomplainingly extracted all the images and the associated text from the documents written with InDesign, LibreOffice, and Scribus. Because the tool had no idea what to do with vector graphics, it only returned the font used in the Inkscape PDF, DejaVu Sans.
Mutool stores the fonts in the file format that it finds in the PDF. In InDesign documents, this meant that PostScript fonts in CFF and CID [6] formats were generated. In contrast, the free applications had embedded TrueType fonts. The exception is Scribus, which embeds fonts as PFA files.
Mutool always provides images in PNG format; the tool arbitrarily converts all other image types. Mutool will open encrypted PDF files if the user tells it the password via an extra parameter.
Poppler Utilities
The pdftotext
command-line tool now belongs to the Poppler Toolbox, which in turn was created as a fork of Xpdf [7]. Most distributions provide Poppler in their repositories; on Ubuntu 13.10, pdftotext
resides in the poppler-utils
package. We tested version 0.24.1.
As its name suggests, pdftotext extracts the text from a PDF document. The results will always require postprocessing: In multicolumn documents, the extraction begins at the top left and stops at the bottom right on a page. The author box in the sample article ended up in the middle of the text, but at least no text was lost.
Using command-line options, you can restrict the analysis to individual pages and rectangular areas. On request, pdftotext will try to keep the layout (Figures 8 and 9). Columns and indentation are simulated with spaces. This makes it possible to read the LibreOffice test document, but spaces hinder further processing.

Although users can also specify the character encoding, many non-standard characters were not recognized in the text generated from the sample documents. Pdftotext had no problems with password protection in the LibreOffice document; users only need to pass in the password with a parameter.
Fishing for Photos
In addition to pdftotext
, the Poppler Tools also include pdfimages
, which extracts images, and pdftohtml
, which converts a PDF into HTML pages. Pdfimages only extracts bitmap images from the PDF and then stores them in PPM format. You need to specify the -j
switch to create JPEG images. As in pdftotext, you can pass in the password; the tool cannot handle vector graphics.
Pdftohtml behaves like a mixture of pdftotext and pdfimages: It extracts the images and dumps the text into one or more HTML files. To avoid a jumble of characters, we explicitly specified character encoding. Pdftohtml will add the links in the PDF document to the HTML results, if so desired.
The extracted text was just as jumbled as pdftotext, and users cannot force the layout in this case. To compensate, the tool can create a "complex document." Here, the layout and the images are transferred into a large PNG image, onto which the browser then superimposes the text (Figure 10). The result is indeed reminiscent of the origin layout, but you can only copy and edit the text. Moreover, pdftohtml totally ignores all vector graphics.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Find SysAdmin Jobs
News
-
KDE Plasma 5.27 Beta is Ready for Testing
The latest beta iteration of the KDE Plasma desktop is now available and includes some important additions and fixes.
-
Netrunner OS 23 Is Now Available
The latest version of this Linux distribution is now based on Debian Bullseye and is ready for installation and finally hits the KDE 5.20 branch of the desktop.
-
New Linux Distribution Built for Gamers
With a Gnome desktop that offers different layouts and a custom kernel, PikaOS is a great option for gamers of all types.
-
System76 Beefs Up Popular Pangolin Laptop
The darling of open-source-powered laptops and desktops will soon drop a new AMD Ryzen 7-powered version of their popular Pangolin laptop.
-
Nobara Project Is a Modified Version of Fedora with User-Friendly Fixes
If you're looking for a version of Fedora that includes third-party and proprietary packages, look no further than the Nobara Project.
-
Gnome 44 Now Has a Release Date
Gnome 44 will be officially released on March 22, 2023.
-
Nitrux 2.6 Available with Kernel 6.1 and a Major Change
The developers of Nitrux have officially released version 2.6 of their Linux distribution with plenty of new features to excite users.
-
Vanilla OS Initial Release Is Now Available
A stock GNOME experience with on-demand immutability finally sees its first production release.
-
Critical Linux Vulnerability Found to Impact SMB Servers
A Linux vulnerability with a CVSS score of 10 has been found to affect SMB servers and can lead to remote code execution.
-
Linux Mint 21.1 Now Available with Plenty of Look and Feel Changes
Vera has arrived and although it is still using kernel 5.15, there are plenty of improvements sure to please everyone.