OCR under Linux

Beyond the Basics

Article from Issue 184/2016
Author(s):

Linux OCR software lags behind proprietary applications. We describe some ways to get better results.

Optical character recognition (OCR) is the extraction of text from images. Users often expect OCR to be as straightforward and easy as photocopying, but that is generally true only in the simplest of cases. More often, OCR is a painstakingly slow series of trials and errors, and that is especially true in free software OCR, which lags far behind the leading proprietary applications.

The reasons that OCR is so labor intensive are obvious when you stop to think. At first, an OCR application with more than 98 percent accuracy sounds reliable, but, assuming 300 words per page, that means an average of three to six errors per page. With a complex layout that includes columns and graphics, the number of errors can easily rise to more than 10 per page [1].

To make matters worse, characters like the number one (1) and the lowercase L (l) or the upper or lowercase O (o) and zero (0) can be difficult to distinguish. Other characters, such as the ampersand and question mark, can have a bewildering range of shapes (Figure 1). In some cases, too, short descenders (the part of a letter below the baseline) might cause a "y" to be read as a "v" instead. Similarly, a "d" might be read as an "a" if the ascenders (the part of the letter above the x-height or medium height of letters) are short.

Figure 1: The variety of different shapes for some characters like ampersands can sometimes defeat OCR applications. This is only one of the problems that all OCR applications face.

In fact, even if the application reads the character set, a font with thin lines or one that has been manually kerned or has anything except a horizontal baseline can be difficult to interpret. The darkness of letters and their background can also affect the success of OCR.

In the case of free software, such difficulties are compounded by a relative lack of attention to OCR. Projects like GOCR [2] or Ocrad [3] proceed so slowly that at times they appear to be inactive. Today, most OCR under Linux depends on Tesseract [4] or CuneiForm [5]. The accuracy of both is roughly equivalent for blocks of text (Figure 2), but CuneiForm tends to be less accurate on highly formatted text (Figure 3), and some users may prefer to avoid CuneiForm because its code is only partially released under a free license. Other OCR applications exist, such as YAGF [6], but they are only front ends for Tesseract or CuneiForm. For better or worse, free software OCR remains primarily at the command line.

Figure 2: Free-licensed OCR does a reasonable job on solid blocks of text, although some letter combinations (e.g., fi in LibreOffice) defeat it.
Figure 3: When a page is highly formatted, free-licensed OCR gives results so poor as to be useless.

Working with Tesseract

Tesseract was first developed by Hewlett Packard from 1985 to 1996. Little work was done on it for a decade, until the code was housed by Google in 2006. It is now housed on GitHub. Tesseract generally installs with an English language pack, but you can also download almost 50 other languages. In fact, much of the recent development work on Tesseract seems to consist of adding languages.

I keep hearing rumors that Tesseract supports multiple graphics formats. However, the versions available in Debian support only .tif images. If you are extracting text from another format, use the ImageMagick convert utility first, which is installed in many distributions by default.

To use the convert utility, enter the original file name and a name for the output file. For example:

convert ORIGINAL OUTPUT

When you have a .tif image, text exaction can also be straightforward:

tesseract FILE.tif OUTPUT.txt

The output is produced with no indication of progress except a return to the command prompt when the process is complete. Output is to plain text, making Tesseract a salvage tool, rather than a means to reproduce the original format.

However, you can also add a few options to the basic command. With -l LANGUAGE, you can specify a language other than English, using the abbreviations given in the man page. Multiple languages can be listed if necessary.

Another useful option is -psm NUMBER, which sets how Tesseract operates, as shown in Table 1. Depending on the image, you might want to try one of these options in the hopes of getting more accurate results.

Table 1

Tesseract Options

0 = Orientation and script detection (OSD) only.

1 = Automatic page segmentation with OSD.

2 = Automatic page segmentation, but no OSD, or OCR.

3 = Fully automatic page segmentation, but no OSD (default)

4 = Assume a single column of text of variable sizes.

5 = Assume a single uniform block of vertically aligned text.

6 = Assume a single uniform block of text.

7 = Treat the image as a single text line.

8 = Treat the image as a single word.

9 = Treat the image as a single word in a circle.

10 = Treat the image as a single character.

Tesseract also supports the option -c configvar=VALUE, which can be added multiple times to use multiple options. However, the only list of configuration variables I have been able to find is a partial one from an outdated Google page [7]; most of the variables are for Japanese, none of which are likely to improve accuracy for English. Perhaps the option is primarily for future development, but, for now, Tesseract either works or it doesn't. If it doesn't, --psm NUMBER is the only tool within Tesseract itself that might improve accuracy.

Working with CuneiForm

CuneiForm is a mixture of freeware and software released under a BSD license. For this reason, in Debian and many of its derivatives, CuneiForm is classified as non-free and will not appear in your list of available packages unless the non-free section of the repositories is enabled.

CuneiForm's basic command structure is even more straightforward than Tesseract's:

cuneiform FILE

However, CuneiForm has several advantages. To start, CuneiForm supports most common graphics format, so in most cases you have no need to convert the original file. Unless you specify an output file, it writes to cuneiform-out.EXTENSION, although with -o OUTPUT, you can give the output a different name. Its default output, like Tesseract's, is plain text, but, you can also complete the -f FORMAT option with <code>html<code> and </code>rtf</code>. For simple text, you may also be able to improve CuneiForm's accuracy for articles, essays, and many other genres with --singlecolumn.

For non-English speakers, CuneiForm's main disadvantage is that it supports only half of the languages that Tesseract does. For all users, CuneiForm may also have the disadvantage of being unstable. In my experience, it has an alarming tendency to end in segmentation faults.

Improving OCR Accuracy

CuneiForm includes options for --dotmatrix and --fax, both of which can sometimes help it read other text that is fragmented or faint. Otherwise, with both CuneiForm and Tesseract, efforts to increase their accuracy requires editing the original graphic – or, safer still – a copy of the original.

Using ImageMagick's convert utility or an editor like GIMP, you can sometimes get better results by:

  • Increasing the contrast
  • Changing the background color
  • Reducing a complex background to a single color
  • Converting the image to grayscale
  • Increasing the size of the image
  • Increase the resolution (dpi)

Of all these edits, increasing the resolution generally has the best results. That is especially true if the image is a screenshot, which is rarely more than 120dpi, and may be 96dpi or lower. Greatly increasing the resolution – sometimes as high as 5000dpi – can often be effective, although with large images, such resolutions can seriously slow or even prevent the handling of the file.

You can also try different combinations of these edits, depending on the circumstances.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News