PDF creators, extractors, and editors tested


The small command-line tool Mutool [5] is part of the simple PDF viewer, MuPDF. Manufacturer Artifex describes it as the "Swiss army knife of PDF manipulation tools." If this is true, something is definitely wrong with Swiss engineering: To be more precise, it can only regenerate the PDF, extract the fonts and images, display some information, and arrange the pages on a giant poster.

Like MuPDF, Mutool is released under the Affero GPL. We looked at version 1.2.2, which can be found in the repositories of Ubuntu 13.10 as the mupdf-tools package.

Mutool uncomplainingly extracted all the images and the associated text from the documents written with InDesign, LibreOffice, and Scribus. Because the tool had no idea what to do with vector graphics, it only returned the font used in the Inkscape PDF, DejaVu Sans.

Mutool stores the fonts in the file format that it finds in the PDF. In InDesign documents, this meant that PostScript fonts in CFF and CID [6] formats were generated. In contrast, the free applications had embedded TrueType fonts. The exception is Scribus, which embeds fonts as PFA files.

Mutool always provides images in PNG format; the tool arbitrarily converts all other image types. Mutool will open encrypted PDF files if the user tells it the password via an extra parameter.

Poppler Utilities

The pdftotext command-line tool now belongs to the Poppler Toolbox, which in turn was created as a fork of Xpdf [7]. Most distributions provide Poppler in their repositories; on Ubuntu 13.10, pdftotext resides in the poppler-utils package. We tested version 0.24.1.

As its name suggests, pdftotext extracts the text from a PDF document. The results will always require postprocessing: In multicolumn documents, the extraction begins at the top left and stops at the bottom right on a page. The author box in the sample article ended up in the middle of the text, but at least no text was lost.

Using command-line options, you can restrict the analysis to individual pages and rectangular areas. On request, pdftotext will try to keep the layout (Figures 8 and 9). Columns and indentation are simulated with spaces. This makes it possible to read the LibreOffice test document, but spaces hinder further processing.

Figure 8: If you attempt to convert the PDF document created by LibreOffice using pdftotext with the Layout parameter, …
Figure 9: … the results will look like this.

Although users can also specify the character encoding, many non-standard characters were not recognized in the text generated from the sample documents. Pdftotext had no problems with password protection in the LibreOffice document; users only need to pass in the password with a parameter.

Fishing for Photos

In addition to pdftotext, the Poppler Tools also include pdfimages, which extracts images, and pdftohtml, which converts a PDF into HTML pages. Pdfimages only extracts bitmap images from the PDF and then stores them in PPM format. You need to specify the -j switch to create JPEG images. As in pdftotext, you can pass in the password; the tool cannot handle vector graphics.

Pdftohtml behaves like a mixture of pdftotext and pdfimages: It extracts the images and dumps the text into one or more HTML files. To avoid a jumble of characters, we explicitly specified character encoding. Pdftohtml will add the links in the PDF document to the HTML results, if so desired.

The extracted text was just as jumbled as pdftotext, and users cannot force the layout in this case. To compensate, the tool can create a "complex document." Here, the layout and the images are transferred into a large PNG image, onto which the browser then superimposes the text (Figure 10). The result is indeed reminiscent of the origin layout, but you can only copy and edit the text. Moreover, pdftohtml totally ignores all vector graphics.

Figure 10: The first, extremely good impression was deceptive: The HTML page produced here by pdftothml consists of a huge PNG image with the text superimposed.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95