Scan Tailor and Paperwork

Papers, Please

Article from Issue 189/2016

Transform piles of paper into a neatly organized and searchable digital library with Scan Tailor and Paperwork.

Magazine articles, important documents, receipts, and whatnot – paper is still the most commonly used storage and distribution medium, and it's not going anywhere in the foreseeable future. The problem with paper is that it tends to pile up and take up all available space at an astonishing speed. Worse yet, every time you discard documents you think you no longer need, there is a risk of throwing away something important.

Fortunately, tools like Scan Tailor and Paperwork provide a solution to this conundrum. Using these applications, you can set up an efficient system for turning paper documents into a searchable library of scanned and cleaned up files.

Processing Scanned Pages with Scan Tailor

Most mainstream Linux distributions come with a scanning utility preinstalled. Ubuntu, for example, ships with Simple Scan, a no-frills tool that's more than adequate for all but the most complex scanning tasks. But scanning documents is only half of the battle. In most cases, you might want to clean up and tweak scanned pages, and this is where Scan Tailor [1] comes into the picture (Figure 1). This application is designed for post-processing scanned pages, and it allows you to split and deskew pages, add and remove borders, as well as generate cleaned up files. Keep in mind, though, that Scan Tailor is not a scanning application, so you need to scan pages before you start using the application.

Figure 1: Scan Tailor interface in all its spartan beauty.

Scan Tailor is available in the software repositories of many popular Linux distributions, so you can install it using the default package manager. On Ubuntu, installing Scan Tailor is a matter of running the sudo apt install scantailor command.

After you launch Scan Tailor, you need to set up a new project (a directory containing scanned pages ready for processing and various project-related files). To do this, press the New Project button and specify the folder containing scanned pages as the input directory. You should then see a list of all scanned files in the Files in Project pane of the Project Files dialog window.

At this point, you can exclude certain pages from the project, if necessary. Then, press the OK button. If the project files have an unspecified or incorrect dots per inch (DPI) resolution, Scan Tailor prompts you to fix it. In the Fix DPI dialog window, select the All Pages entry under the Need Fixing tab, specify the correct DPI resolution (300x300 is a good starting point), and press Apply. Press OK to create and open the project.

Although Scan Tailor's tools are easy to master, you need to understand the overall workflow to use them optimally. Scan Tailor is a batch processor that has several stages: Fix Rotation, Split Pages, Deskew, Select Content, Margins, and Output (Figure 2). For each stage, you need to adjust the available settings for each page in the project. When you press the Run button, the configured action is applied to the pages.

Figure 2: Pages in Scan Tailor are batch-processed in stages.

Here is how this works in practice. Assume you have a project consisting of three scanned pages, and you start post-processing from the Split Pages stage. This action can split double-page scans into two separate pages. Select the first page in the thumbnail sidebar to open the page in the working area. Select the Split Pages stage, and Scan Tailor should automatically detect the appropriate layout.

If the application fails to do this, press the Change button, choose the manual mode, and select the appropriate layout in the Page Layout section. To apply this configuration to other pages, press Change and enable the All option in the Scope section. Alternatively, you can specify settings for each page individually by selecting the next page in the thumbnail sidebar and adjusting the available options. When you're done, press the Run button next to the Split Pages stage to run the action on the pages.

Although the Fix Orientation and Deskew stages in Scan Tailor can be skipped, other stages must be completed before you can move to the Output stage. This last stage is where the application generates processed pages in the TIFF format (Figure 3). Besides the output resolution, you can select one of three modes: Black and White (generates black-and-white pages), Color/Grayscale (produces color or grayscale pages), and Mixed. The latter can be used to generate pages where text is treated as black-and-white areas, while images are handled as color zones.

Figure 3: Generating output result.

In many cases, Scan Tailor does a good job of identifying images, but the application also allows you to specify so-called picture zones manually (Figure 4). To do this, switch to the Picture Zones tab on the right side of the main working area. Use the mouse to draw boxes around images on the page. To clean up the page even further and reduce the size of the final file, you can mark large empty white areas on the page under the Fill Zones tab. Scan Tailor features two more useful tools for fixing pages.

Figure 4: Scan Tailor lets you specify picture zones on a page.

The Dewrapping tool allows you to manually straighten the page, which can come in rather handy when working with scans of wrinkled or otherwise distorted pages (Figure 5). This tool can be used for perspective adjustment, too, so it's perfectly suited for processing images taken with a mobile device. Finally, the Despeckling tool can clean up the page by removing small straight artifacts. Once you've configured the output settings, press Run next to the Output stage to generate processed TIFF images in the specified output folder.

Figure 5: The Dewrapping tool is useful for straightening warped pages.

At this point, Scan Tailor's job is done. You need to assemble multiple pages into a single document, run them through an optical character recognition software, and perform other operations. However, those are beyond the scope of Scan Tailor's functionality. So, how do you actually turn a collection of single pages into a well-organized and searchable library? Enter Paperwork [2].

Building an Archive with Paperwork

Paperwork offers powerful yet easy-to-use tools for archiving, organizing, and searching scanned documents (Figure 6). Written in Python, it is easy to install using a few simple commands. On Ubuntu, you can start by installing the required packages:

Figure 6: Paperwork is an ideal tool for building a searchable library of scanned pages.
sudo apt install python-pip python-setuptools python-dev python-pil libenchant-dev

Next, install Paperwork by running the sudo pip install paperwork command. Finally, you need to check for remaining dependencies and install them using the paperwork-chkdeps command. Once you've done that, you can launch Paperwork by issuing the paperwork command in the terminal.

The first step is to populate the application with scanned pages, and Paperwork provides two ways to do that. If you already have scanned pages, you can import them into Paperwork. This option is ideal for importing pages processed with Scan Tailor. However, Paperwork doesn't support the TIFF format, so you need to convert .tif files from Scan Tailor into the JPEG format first.

The easiest way to do this is to use the mogrify tool that is part of the ImageMagick suite. In the terminal, switch to the directory containing the .tif file and run the mogrify -format jpg *.tif command. Then, switch to Paperwork, select the New Document item in the left sidebar, and choose Import file(s) from the Scan menu. Add all the pages you need, and Paperwork will import them into a single document – no need to collate them manually.

You can also scan pages directly into Paperwork using the Scan button. When you press the Scan button for the first time, Paperwork prompts you to configure some basic settings (Figure 7). To change the default directory for the Paperwork library, specify the desired location in the Work directory field. If Paperwork has successfully detected the connected scanner, you should see its name in the Device field. Then select the appropriate scan type from the Source drop-down list. Interestingly, Paperwork can scan not only regular pages (the Normal source) but also slides (Transparency) and negatives (Negative).

Figure 7: Configuring Paperwork's settings.

Obviously, you wouldn't want to use Paperwork as your preferred tool for scanning negatives, but it can be useful for maintaining records for physical negatives and transparencies. Select the desired scanning resolution (300 is good, 600 is even better) and make sure that the OCR (Optical Character Recognition) option is enabled. Press the Scan button to perform a test scan and make sure that everything works properly. Close the Settings dialog, choose the New document item in the left sidebar, and press Scan to start scanning pages to a new document.

Although Paperwork is not built for processing scanned pages, it has basic cropping and rotating tools. To access them, click on the page you want to edit, and press the Edit button.

When you scan pages in Paperwork, it automatically performs optical character recognition, and the application allows you to run full-text search queries on all scanned documents as if they were regular text files. Start typing a search term in the Search field, and Paperwork returns a list of pages containing the search term. Better still, the application highlights all occurrences of the search term in each returned page. This functionality alone makes Paperwork an indispensable tool for managing scanned pages, but the application has yet another clever trick up its sleeve.

You can add multiple labels to each document in the library and assign a unique color to each label. This allows you to visually identify specific documents in the library, as well as search documents by their labels. In addition to that, you can assign multiple keywords to each document and run keyword-based search queries.

To add labels and keywords to a document, press the Properties button next to it (Figure 8). In the Properties sidebar, press +, give the label a name, and assign a color to it. Enter the desired keywords in the Additional keywords field.

Figure 8: Assigning labels and keywords to a document.

To search documents by labels and keywords, press the Advanced search button next to the Search field. In the Search dialog window, you can configure search criteria that include multiple rules. For example, the advanced search query in Figure 9 finds all documents added to the library between December 31, 2015, and May 25, 2016, and containing the amateur-photographer and sensor-design labels.

Figure 9: Paperwork lets you create advanced queries.

Paperwork stores scanned pages as image files in a dedicated directory, but the application also allows you to export individual pages and entire documents in the library as PDF files (see the "Digitize Documents with Open Note Scanner" box for additional information). This feature gives you an easy way to produce collated documents that can be read on practically any platform. To export a page or a document, choose Export | Page or Export | Document from the Options (a.k.a. Hamburger) menu. Select the desired paper size and quality, specify a path and name for the output file, and select Export.

Digitize Documents with Open Note Scanner

A regular scanner is not the only option you have for scanning paper. If your Android device has a decent camera, you can use it to scan paper documents using a specialized app like Open Note Scanner [3]. This is not the most advanced scanning app for Android, but it does the job, and it's released under the GPLv3 license.

Open Note Scanner captures and processes pages automatically, and it works best with pages that have dark (or preferably black) borders around them. To get the best possible result, you might want to place a loose page on a black surface. When activated, Open Note Scanner detects the page's boundaries, captures the documents, corrects the perspective of the captured image, and converts it to black and white. If you plan to do post-processing in Scan Tailor, you can turn off the image-processing functionality in Open Note Scanner.

Final Word

Scan Tailor and Paperwork make a powerful combination for processing and organizing scanned documents. So, if you need to bring order to your paper chaos, these tools will handle the task with aplomb.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Paperwork Document Manager

    Paperwork was developed to manage the paperless office – a dream as old as desktop PCs.

  • Master PDF Editor 4

    The commercial software tool Master PDF Editor 4 lets you edit the most important portable document format of our times.

  • Gscan2pdf

    Many scanner tools confuse users with functional overkill. The clear-cut gscan2pdf scanning aid gives users a simple approach to converting existing paper documents into space-saving PDF files.

  • Command Line: SANE

    Running your scanner from the command line offers greater control of tasks. We show you how to get started.

  • Nmap Methods

    How does the popular Nmap scanner identify holes in network security? In this article, we examine some Nmap analysis techniques.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95