Fuzzy text searches with agrep
Better Finds
The agrep tool expands on grep by adding fuzzy search capabilities to text string-matching operations.
The grep command, which allows users to find strings and patterns in text files, is probably familiar to Linux users who use the command line; however, its variants are less well known. For example, the following two commands are precisely equivalent:
egrep <term> <file(s)> grep -e <term> <file(s)>
The tool interprets the <term>
as an extended regular expression. By contrast, fgrep
interprets this as being equivalent to grep -f
, where all components in <term>
are normal characters, thus ignoring their potential regex meanings. It thus works a little faster than a plain vanilla grep, and this is especially noticeable when searching large volumes of data. A third candidate, rgrep
, works like grep -r
, recursively parsing folder structures, which affects its speed. All these commands have one thing in common: They only find direct hits for <term>
; a search for similar terms only works if you design a matching regex.
agrep
The grep variant agrep implements the methods provided by the Levenshtein algorithm (see the "Levenshtein Distance" box) to find strings or patterns in text files. In contrast to grep, agrep can only interpret relatively simple patterns, but it can also find similar terms and often works faster than the original.
Levenshtein Distance
Consider the strings "grep" and "gerp," which differ by two letters in different positions, whereas "grap" or "grip" change one letter, and "egrep" adds a letter. To define such deviations in a mathematically precise way, the Russian mathematician Vladimir Iosifovich Levenshtein defined the Levenshtein distance [1] in 1965. This value, also known as the edit distance, is used as a measure of the minimum number of insertions, deletions, and replacement operations for converting one string to another [2]. If you weight the algorithmic "cost" of the necessary operations, you arrive at the weighted Levenshtein distance (WLD). Evaluating the Levenshtein distance makes it possible to find similar words, compensate for spelling errors, and generally determine minor word differences and ignore them where appropriate.
Originally, agrep [3] was written for an index system named Glimpse [4] – a kind of predecessor of today's desktop search engine – at the University of Arizona. Today, Glimpse is hardly used; on the desktop, Recoll [5] or similar, much more powerful programs have taken over its role. The syntax of agrep closely follows that of grep and adheres to the pattern:
$ agrep [<options>] <pattern> [<file>] ...
Linux currently uses two versions of agrep: the classic version from the Glimpse suite [6] and a newer one based on the TRE library [7]. The former often works faster, but does not support Unicode characters and other multibyte encoding [8]. Therefore, most distributions now provide the TRE version of the program. The two versions differ not only in performance but also in terms of options (Table 1).
Table 1
agrep Options
Function | TRE agrep | Glimpse agrep |
---|---|---|
Define patterns; useful for patterns that start with " |
|
– |
Use content of stated file as pattern |
– |
|
Ignore case |
|
|
Do not ignore case |
– |
|
Ignore case; replace numbers with numbers, letters with letters |
– |
|
Do not evaluate non-standard characters in the pattern |
|
|
Pattern describes a whole word |
|
|
Pattern describes a whole line |
– |
|
Recursive processing |
– |
|
Set details (preset: 1) |
||
Cost(1) of deleting characters |
|
|
Cost of inserting characters |
|
|
Cost of replacing characters |
|
|
Maximum cost of a match |
|
|
Maximum permissible error count for match (independent of cost) |
|
|
Manage Output |
||
Show cost for a match |
|
– |
Show match with lowest cost(2) |
|
|
Suppress prompt for |
– |
|
Color highlight matches |
|
– |
Show match count |
|
|
Show matches without filename |
|
|
Show matches with filename |
|
|
Show filenames instead of matches |
|
|
Always show filenames with matches |
– |
|
Enumerate matches |
|
|
No output |
|
|
Show position of rist character of a match(4) |
|
|
Output line breaks beind matches instead of in front of matches |
|
– |
Line break characters |
|
– |
Only show lines without matches |
|
|
Output additional details |
|
– |
(1) The "cost" <value> weights the operation, influencing its effect on the Levenshtein distance.(2) Requires twice the time (two runs) and does not work for input from STDIN.(3) Exit with return code 0 for a match.(4) Important for developing patterns.(5) Default: \n. |
Which variant of agrep you use depends on the purpose and availability. When processing Unicode characters, there is no alternative to the TRE variant; but, because this usually requires significantly more resources and computing time than the classic version, there are good reasons to use the original as well.
Hands On
In principle, you use agrep like the standard grep command and can thus search arbitrary data streams (i.e., data from files as well as the STDIN channel):
$ agrep-tre --col -s -E1 -I 2 Tst * wizard.pdf:1:stream wizard.pdf:1:endstream
Like grep, agrep parses streams record by record, searching for patterns. The delimiter decides what constitutes a record. By default, this is a line break (\n
), which means that agrep views each line as a record and applies the pattern defined by -e <pattern>
.
However, you will see several significant differences between the (GNU) grep and agrep variants. For example, agrep does not support many of the options normal grep command has at its disposal (e.g., context control of the output -A
, -B
, -C
) or the respective include or exclude options as for regular expressions.
The option -d <pattern>
for separating the records also deserves special attention. It does not work as you would expect in all cases. For example, if you define -d ''
, the records should be delimited by spaces, but this does not work in all cases. For such problems with the record boundaries, the options --color
, --show-position
, and -n
(TRE variant) and -b
, -n
, and -V
(classic version) can help you discover the problem.
The tr
(translate) tool is also a relatively simple workaround for these problems. You can use the simple command to wrap data streams quickly and reliably to create lines. The syntax is quite simple:
$ tr <options> <search_char> <replace_char>
To convert spaces to line breaks, for example, you would use the command,
tr ' ' '\n'
and to replace several <search_chars>
, use:
$ echo "Hello World" | tr "A-Za-z" "a-zA-Z" hELLO wORLD
The advantage of this variant is that because agrep uses the new lines as record boundaries, the results then match your expectations.
Behind the Scenes
Because almost all distributions provide agrep as one of its standard tools, some shell scripts and GUIs use the command. A typical example is the build program ding
[9]. Spellcheckers like Aspell also apply the agrep method implicitly. A phonetic transcription occurs first; the program uses the Levenshtein algorithm to find the best approximation. More complex search engines such as Elasticsearch (Recoll cannot do this) also use the Levenshtein algorithm to generate the best possible results.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.