PDF creators, extractors, and editors tested
PDF Full Wash Cycle
PDF is always a good choice say some people. As a test, we produced PDF files only to maltreat them with several open source programs. Some of the editors and extractors do a very good job, but others fail completely.
Adobe designed the Portable Document Format (PDF) as a layout-preserving transport format for final documents; PDF files cannot be easily edited later on – apart from using standard notation and comment features. The manufacturer, however, does offer a proprietary, commercial editor for Mac and Windows that gives users a restricted ability to delete and move items and correct typos in text.
Because Adobe has disclosed the PDF specification, some tools have fortunately emerged, including some for Linux, that can open PDF files, extract items from them, or even postprocess the files. This only works satisfactorily if the author has set the numerous export settings wisely and uses a program that creates a standards-compliant PDF.
To find out whether, and how well, the tools harmonize, we launched a test on Ubuntu 13.10. Using Inkscape, LibreOffice Writer, and Scribus, the Linux Magazine test team first designed several test documents that they then exported as PDF files with the default settings in each case (see the box "Test Documents").
In Inkscape, the test team laid out a police car from the Openclipart gallery  on a blank page. This vector graphics image contains gradients and numerous overlapping objects. Below the image are some text boxes with jabber text, and two columns simulated by using overlapping text frames. The font we used was Liberation Sans.
In LibreOffice Writer, the test team composed a multipage document with a table of contents, headers and footers, and references and hyperlinks. As a basis, we used the Linux Magazine article "PHP for the Command Line" . The pictures were imported in PNG format and labeled with serial numbers, along with matching captions. Code was given a colored background and a non-proportional font; the body text was again formatted in Liberation Sans.
We added a text box with a frame to each listing, allowing the text to flow around one side. Additionally, the body text was squeezed into two columns in some passages. Illustrations in the form of two simple charts from LibreOffice Draw additionally garnished the document. It also included some LibreOffice-specific elements that the PDF editors and exporters had to deal with. We also defined an access password to protect the PDF.
Working in Scribus, the test team exported some templates that came with the distribution: Brochure, Business Card Collection, Menu, Newsletter, and Cover Page. All of them torture the importers and converters with complex layouts and color gradients. In the flyer-like brochure, several text boxes and bitmap and vector graphic objects overlap. The text flows around some objects. The same applies to the newsletter, which mimics a three-page magazine article. The Business Card Collection consists of 50 colorful business cards on one page. Their backgrounds include vector graphics with gradients. The title page again uses a large complex gradient and a number of overlapping vector objects.
As a kind of reference, we used a one-page original Linux Magazine article  in our lab, of the kind that anyone can purchase from the publisher's store. It contains multiple text boxes, three columns, two bitmap images, and multiple fonts. The PDF was generated by the Adobe InDesign desktop publishing program on Mac OS X.
Because all three programs can also modify PDF documents, the testers unceremoniously fed their results to all of these programs. Additionally, we fed the test documents to the converters gPDFText, Mutool, pdftotxt, pdfimages, pdftohtml, and pdf2svg. Although these are not editors, they do promise to extract text, images, and – in the case of Mutool – even the fonts.
Exporting elements from PDF files is a fairly normal thing to do: In everyday life, people often receive whitepapers, e-books, or presentations as PDF files, or they find them on the web. If you want to cite them or use a graphic, it makes sense to remove the desired components in a digital process, rather than grabbing pixilated Adobe Reader screenshots.
Users of the Inkscape drawing program have very little in the way of options for influencing the generated PDF. For example, you can only choose between PDF versions 1.4 and 1.5. You can also restrict the export to selected parts of the drawing, and the text can be converted into paths or polylines (see the box "Good Fonts, Bad Fonts"). If you do not do the conversion, Inkscape only embeds a subset of the fonts used.
Good Fonts, Bad Fonts
Because you will never find all possible fonts on any given computer, you can embed the fonts used in your PDF documents. The manufacturers of commercial fonts do not like you to do this, however; thus, many PDF creators do not pack the complete font set in the PDF, but only the glyphs for the visible characters.
This, in turn, makes postprocessing difficult because the proofreader will probably not have all the required characters available. Alternatively, some PDF exporters convert text to vector graphics letter by letter. Although these curves scale without loss, it is virtually impossible to edit the text.
When you import a PDF document, you can replace the PDF fonts with similar fonts that are installed locally. Inkscape does an exemplary job of importing its self-written PDF, but the superimposed text frames have been merged into a single large one. The other PDFs also bore up well to inspection; even the complex layout of the page created by InDesign was well preserved (Figure 1).
However, Inkscape puts each line of text or each word group into a separate text frame. Attempts to edit one of the articles were also sobering, because Inkscape does not adjust the size of the text frame, it simply superimposes the characters. We first had to draw a new text frame and then copy the text into it. Also, Inkscape made some silly mistakes with the templates exported from Scribus. The complex gradients were repeatedly missing; for example, the red and blue gradient on the cover page disappeared (Figures 2 and 3).
The business cards burdened the lab computers to the extent that Inkscape's response was very slow. Additionally, the drawing program is unable to import password-protected PDFs and can only import and display one selected page. However, even a small preview helps you select the page.
LibreOffice Writer in version 4.1.2 produces PDF version 1.4 files. At the request of the author, Writer can also deliver PDF/A-1a (see the box "Many Versions"). Users can influence the exports to an amazing degree – for example, defining the quality of the exported images as a percentage that LibreOffice Writer uses when exporting to JPEG format. Alternatively, Lossless Compression is possible.
Adobe launched the Portable Document Format back in 1993. During the past 20 years, the company has expanded the specification six times. Among other things, it added transparencies, forms, and new encryption algorithms repeatedly. The final PDF version 1.7 is from the year 2006. Additionally, PDF serves as the basis for various ISO document formats. The PDF/X standards are used in the prepress field, whereas the PDF/A variants are designed to facilitate archiving.
LibreOffice can also embed the Open Document file in the PDF document, which is designed to facilitate editing later on: A PDF reader then sees the document in PDF format, whereas LibreOffice sees the embedded Open Document. However, Writer does not add the office data to the PDF as a normal appendix, and it consequently does not appear in the Adobe Reader Attachments tab.
On request, however, bookmarks, comments, and a watermark generated from arbitrary text can be migrated to the PDF. Automatically generated tags are designed, among other things, to facilitate access to the PDF by people with special needs. The fonts used are fully transferred to the PDF file – even the standard fonts depend on the user's preference. The word processor converts the form elements into a PDF form; users can choose between FDF, PDF, HTML, and XML for the conversion.
More settings affect the display in Adobe Reader: Among other things, you can hide the menubar and display the document directly as a double-page spread after opening. LibreOffice Writer converts references, links, footnotes, and the entries in the table of contents into appropriate PDF links. Later, one click is all it takes to reach the desired chapter or website.
Authors can password-protect their documents against prying eyes, as well as limit the range of functions by, for example, prohibiting printing of the document. This action is then only possible for users who type in another password that the author has entered.
LibreOffice does not open PDF documents in the Writer word processor, but in the Draw drawing module. This action produced mixed results: The article from InDesign was barely recognizable as such (Figure 4), but the text was complete and editable. In the document from Inkscape, however, LibreOffice mangled the vector graphic (Figure 5). The PDF files from Scribus contained numerous small layout errors, a particularly common one being that long text frames jutted out beyond the page margin, as in the article in Figure 4.
The best result was the PDF document exported by Writer – albeit without the embedded Open Document file: The result was complete, and the layout was precisely preserved. However, Libre Draw dumped each line of text in a separate text frame. The more complex graphics in the PDF also delayed the import process on our lab machine. When it came to the business cards from Scribus, LibreOffice crashed reproducibly. The fact that Draw can handle password-protected PDF documents barely compensates for the above-mentioned shortcomings.
Buy this article as PDF
Azure CTO says Redmond has already considered the unthinkable.
Lead developer quells rumors that the Debian version is slated for center stage.
MSBuild is now just another GitHub project as Redmond continues its path to the light.
Malware could pass data and commands between disconnected computers without leaving a trace on the network.
New rules emphasize collegiality in coding.
Upstart lands in the dust bin as a new era begins for Linux.
HP's annual Cyber Risk report offers a bleak look at the state of IT.
But what do the big numbers really mean?
.NET Core execution engine is the basis for cross-platform .NET implementations.
The Xnote trojan hides itself on the target system and will launch a variety of attacks on command.