Fuzzy text searches with agrep

Better Finds

© Lead Image © 3dalia, 123RF.com

© Lead Image © 3dalia, 123RF.com

Article from Issue 184/2016
Author(s):

The agrep tool expands on grep by adding fuzzy search capabilities to text string-matching operations.

The grep command, which allows users to find strings and patterns in text files, is probably familiar to Linux users who use the command line; however, its variants are less well known. For example, the following two commands are precisely equivalent:

egrep <term> <file(s)>
grep -e <term> <file(s)>

The tool interprets the <term> as an extended regular expression. By contrast, fgrep interprets this as being equivalent to grep -f, where all components in <term> are normal characters, thus ignoring their potential regex meanings. It thus works a little faster than a plain vanilla grep, and this is especially noticeable when searching large volumes of data. A third candidate, rgrep, works like grep -r, recursively parsing folder structures, which affects its speed. All these commands have one thing in common: They only find direct hits for <term>; a search for similar terms only works if you design a matching regex.

agrep

The grep variant agrep implements the methods provided by the Levenshtein algorithm (see the "Levenshtein Distance" box) to find strings or patterns in text files. In contrast to grep, agrep can only interpret relatively simple patterns, but it can also find similar terms and often works faster than the original.

Levenshtein Distance

Consider the strings "grep" and "gerp," which differ by two letters in different positions, whereas "grap" or "grip" change one letter, and "egrep" adds a letter. To define such deviations in a mathematically precise way, the Russian mathematician Vladimir Iosifovich Levenshtein defined the Levenshtein distance [1] in 1965. This value, also known as the edit distance, is used as a measure of the minimum number of insertions, deletions, and replacement operations for converting one string to another [2]. If you weight the algorithmic "cost" of the necessary operations, you arrive at the weighted Levenshtein distance (WLD). Evaluating the Levenshtein distance makes it possible to find similar words, compensate for spelling errors, and generally determine minor word differences and ignore them where appropriate.

Originally, agrep [3] was written for an index system named Glimpse [4] – a kind of predecessor of today's desktop search engine – at the University of Arizona. Today, Glimpse is hardly used; on the desktop, Recoll [5] or similar, much more powerful programs have taken over its role. The syntax of agrep closely follows that of grep and adheres to the pattern:

$ agrep [<options>] <pattern> [<file>] ...

Linux currently uses two versions of agrep: the classic version from the Glimpse suite [6] and a newer one based on the TRE library [7]. The former often works faster, but does not support Unicode characters and other multibyte encoding [8]. Therefore, most distributions now provide the TRE version of the program. The two versions differ not only in performance but also in terms of options (Table 1).

Table 1

agrep Options

Function

TRE agrep

Glimpse agrep

Define patterns; useful for patterns that start with "-"

-e <pattern>

Use content of stated file as pattern

-f <filename>

Ignore case

-i

-i

Do not ignore case

-i0

Ignore case; replace numbers with numbers, letters with letters

-i#

Do not evaluate non-standard characters in the pattern

-k

-k

Pattern describes a whole word

-w

-w

Pattern describes a whole line

-w

Recursive processing

-r

Set details (preset: 1)

Cost(1) of deleting characters

-D<value>

-D<value>

Cost of inserting characters

-I<value>

-I<value>

Cost of replacing characters

-S<value>

-S<value>

Maximum cost of a match

-E<value>

-E<value>

Maximum permissible error count for match (independent of cost)

-#<number>

-<number>

Manage Output

Show cost for a match

-s

Show match with lowest cost(2)

-B

-B

Suppress prompt for -B

-y

Color highlight matches

--color

Show match count

-c

-c

Show matches without filename

-h

-h

Show matches with filename

-H

-G

Show filenames instead of matches

-l

-l

Always show filenames with matches

-A

Enumerate matches

-n

-n

No output

-q(3)

-s

Show position of rist character of a match(4)

--show-position

-b

Output line breaks beind matches instead of in front of matches

-M

Line break characters

-d <pattern>(5)

Only show lines without matches

-v

-v

Output additional details

-V<number>

(1) The "cost" <value> weights the operation, influencing its effect on the Levenshtein distance.(2) Requires twice the time (two runs) and does not work for input from STDIN.(3) Exit with return code 0 for a match.(4) Important for developing patterns.(5) Default: \n.

Which variant of agrep you use depends on the purpose and availability. When processing Unicode characters, there is no alternative to the TRE variant; but, because this usually requires significantly more resources and computing time than the classic version, there are good reasons to use the original as well.

Hands On

In principle, you use agrep like the standard grep command and can thus search arbitrary data streams (i.e., data from files as well as the STDIN channel):

$ agrep-tre --col -s -E1 -I 2 Tst *
wizard.pdf:1:stream
wizard.pdf:1:endstream

Like grep, agrep parses streams record by record, searching for patterns. The delimiter decides what constitutes a record. By default, this is a line break (\n), which means that agrep views each line as a record and applies the pattern defined by -e <pattern>.

However, you will see several significant differences between the (GNU) grep and agrep variants. For example, agrep does not support many of the options normal grep command has at its disposal (e.g., context control of the output -A, -B, -C) or the respective include or exclude options as for regular expressions.

The option -d <pattern> for separating the records also deserves special attention. It does not work as you would expect in all cases. For example, if you define -d '', the records should be delimited by spaces, but this does not work in all cases. For such problems with the record boundaries, the options --color, --show-position, and -n (TRE variant) and -b, -n, and -V (classic version) can help you discover the problem.

The tr (translate) tool is also a relatively simple workaround for these problems. You can use the simple command to wrap data streams quickly and reliably to create lines. The syntax is quite simple:

$ tr <options> <search_char> <replace_char>

To convert spaces to line breaks, for example, you would use the command,

tr ' ' '\n'

and to replace several <search_chars>, use:

$ echo "Hello World" | tr "A-Za-z" "a-zA-Z"
hELLO wORLD

The advantage of this variant is that because agrep uses the new lines as record boundaries, the results then match your expectations.

Behind the Scenes

Because almost all distributions provide agrep as one of its standard tools, some shell scripts and GUIs use the command. A typical example is the build program ding [9]. Spellcheckers like Aspell also apply the agrep method implicitly. A phonetic transcription occurs first; the program uses the Levenshtein algorithm to find the best approximation. More complex search engines such as Elasticsearch (Recoll cannot do this) also use the Levenshtein algorithm to generate the best possible results.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Command Line – tre-agrep

    Tre-agrep has all of grep's functionality but can also do ambiguous or fuzzy searches without deep knowledge of regular expressions.

  • Charly's Column: Biabam and Tre-agrep

    Most of the tools that show up in this column are small, smart, fast, and easily explained. This month is no exception; we feature a dynamic duo of tools.

  • ICgrep

    One of the most common tasks when working on computers involves browsing texts for search patterns. Here, ICgrep offers a modern, parallel, and Unicode-enabled alternative to the classic grep.

  • Command Line – Pgrep

    Pgrep is a valuable tool for tracking down processes.

  • McFly

    When it comes to working at the command line, using Bash history effectively can save you time. McFly extends the Bash history's features and helps you find past commands more quickly.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News