PDF creators, extractors, and editors tested
Mutool
The small command-line tool Mutool [5] is part of the simple PDF viewer, MuPDF. Manufacturer Artifex describes it as the "Swiss army knife of PDF manipulation tools." If this is true, something is definitely wrong with Swiss engineering: To be more precise, it can only regenerate the PDF, extract the fonts and images, display some information, and arrange the pages on a giant poster.
Like MuPDF, Mutool is released under the Affero GPL. We looked at version 1.2.2, which can be found in the repositories of Ubuntu 13.10 as the mupdf-tools
package.
Mutool uncomplainingly extracted all the images and the associated text from the documents written with InDesign, LibreOffice, and Scribus. Because the tool had no idea what to do with vector graphics, it only returned the font used in the Inkscape PDF, DejaVu Sans.
Mutool stores the fonts in the file format that it finds in the PDF. In InDesign documents, this meant that PostScript fonts in CFF and CID [6] formats were generated. In contrast, the free applications had embedded TrueType fonts. The exception is Scribus, which embeds fonts as PFA files.
Mutool always provides images in PNG format; the tool arbitrarily converts all other image types. Mutool will open encrypted PDF files if the user tells it the password via an extra parameter.
Poppler Utilities
The pdftotext
command-line tool now belongs to the Poppler Toolbox, which in turn was created as a fork of Xpdf [7]. Most distributions provide Poppler in their repositories; on Ubuntu 13.10, pdftotext
resides in the poppler-utils
package. We tested version 0.24.1.
As its name suggests, pdftotext extracts the text from a PDF document. The results will always require postprocessing: In multicolumn documents, the extraction begins at the top left and stops at the bottom right on a page. The author box in the sample article ended up in the middle of the text, but at least no text was lost.
Using command-line options, you can restrict the analysis to individual pages and rectangular areas. On request, pdftotext will try to keep the layout (Figures 8 and 9). Columns and indentation are simulated with spaces. This makes it possible to read the LibreOffice test document, but spaces hinder further processing.
![](/var/linux_magazin/storage/images/issues/2014/160/pdf-tools/figure-8/607520-1-eng-US/Figure-8_large.png)
Although users can also specify the character encoding, many non-standard characters were not recognized in the text generated from the sample documents. Pdftotext had no problems with password protection in the LibreOffice document; users only need to pass in the password with a parameter.
Fishing for Photos
In addition to pdftotext
, the Poppler Tools also include pdfimages
, which extracts images, and pdftohtml
, which converts a PDF into HTML pages. Pdfimages only extracts bitmap images from the PDF and then stores them in PPM format. You need to specify the -j
switch to create JPEG images. As in pdftotext, you can pass in the password; the tool cannot handle vector graphics.
Pdftohtml behaves like a mixture of pdftotext and pdfimages: It extracts the images and dumps the text into one or more HTML files. To avoid a jumble of characters, we explicitly specified character encoding. Pdftohtml will add the links in the PDF document to the HTML results, if so desired.
The extracted text was just as jumbled as pdftotext, and users cannot force the layout in this case. To compensate, the tool can create a "complex document." Here, the layout and the images are transferred into a large PNG image, onto which the browser then superimposes the text (Figure 10). The result is indeed reminiscent of the origin layout, but you can only copy and edit the text. Moreover, pdftohtml totally ignores all vector graphics.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
![Learn More](https://www.linux-magazine.com/var/linux_magazin/storage/images/media/linux-magazine-eng-us/images/misc/learn-more/834592-1-eng-US/Learn-More_medium.png)
News
-
NVIDIA Released Driver for Upcoming NVIDIA 560 GPU for Linux
Not only has NVIDIA released the driver for its upcoming CPU series, it's the first release that defaults to using open-source GPU kernel modules.
-
OpenMandriva Lx 24.07 Released
If you’re into rolling release Linux distributions, OpenMandriva ROME has a new snapshot with a new kernel.
-
Kernel 6.10 Available for General Usage
Linus Torvalds has released the 6.10 kernel and it includes significant performance increases for Intel Core hybrid systems and more.
-
TUXEDO Computers Releases InfinityBook Pro 14 Gen9 Laptop
Sporting either AMD or Intel CPUs, the TUXEDO InfinityBook Pro 14 is an extremely compact, lightweight, sturdy powerhouse.
-
Google Extends Support for Linux Kernels Used for Android
Because the LTS Linux kernel releases are so important to Android, Google has decided to extend the support period beyond that offered by the kernel development team.
-
Linux Mint 22 Stable Delayed
If you're anxious about getting your hands on the stable release of Linux Mint 22, it looks as if you're going to have to wait a bit longer.
-
Nitrux 3.5.1 Available for Install
The latest version of the immutable, systemd-free distribution includes an updated kernel and NVIDIA driver.
-
Debian 12.6 Released with Plenty of Bug Fixes and Updates
The sixth update to Debian "Bookworm" is all about security mitigations and making adjustments for some "serious problems."
-
Canonical Offers 12-Year LTS for Open Source Docker Images
Canonical is expanding its LTS offering to reach beyond the DEB packages with a new distro-less Docker image.
-
Plasma Desktop 6.1 Released with Several Enhancements
If you're a fan of Plasma Desktop, you should be excited about this new point release.