Fuzzy text searches with agrep
Better Finds
The agrep tool expands on grep by adding fuzzy search capabilities to text string-matching operations.
The grep command, which allows users to find strings and patterns in text files, is probably familiar to Linux users who use the command line; however, its variants are less well known. For example, the following two commands are precisely equivalent:
egrep <term> <file(s)> grep -e <term> <file(s)>
The tool interprets the <term>
as an extended regular expression. By contrast, fgrep
interprets this as being equivalent to grep -f
, where all components in <term>
are normal characters, thus ignoring their potential regex meanings. It thus works a little faster than a plain vanilla grep, and this is especially noticeable when searching large volumes of data. A third candidate, rgrep
, works like grep -r
, recursively parsing folder structures, which affects its speed. All these commands have one thing in common: They only find direct hits for <term>
; a search for similar terms only works if you design a matching regex.
agrep
The grep variant agrep implements the methods provided by the Levenshtein algorithm (see the "Levenshtein Distance" box) to find strings or patterns in text files. In contrast to grep, agrep can only interpret relatively simple patterns, but it can also find similar terms and often works faster than the original.
Levenshtein Distance
Consider the strings "grep" and "gerp," which differ by two letters in different positions, whereas "grap" or "grip" change one letter, and "egrep" adds a letter. To define such deviations in a mathematically precise way, the Russian mathematician Vladimir Iosifovich Levenshtein defined the Levenshtein distance [1] in 1965. This value, also known as the edit distance, is used as a measure of the minimum number of insertions, deletions, and replacement operations for converting one string to another [2]. If you weight the algorithmic "cost" of the necessary operations, you arrive at the weighted Levenshtein distance (WLD). Evaluating the Levenshtein distance makes it possible to find similar words, compensate for spelling errors, and generally determine minor word differences and ignore them where appropriate.
Originally, agrep [3] was written for an index system named Glimpse [4] – a kind of predecessor of today's desktop search engine – at the University of Arizona. Today, Glimpse is hardly used; on the desktop, Recoll [5] or similar, much more powerful programs have taken over its role. The syntax of agrep closely follows that of grep and adheres to the pattern:
$ agrep [<options>] <pattern> [<file>] ...
Linux currently uses two versions of agrep: the classic version from the Glimpse suite [6] and a newer one based on the TRE library [7]. The former often works faster, but does not support Unicode characters and other multibyte encoding [8]. Therefore, most distributions now provide the TRE version of the program. The two versions differ not only in performance but also in terms of options (Table 1).
Table 1
agrep Options
Function | TRE agrep | Glimpse agrep |
---|---|---|
Define patterns; useful for patterns that start with " |
|
– |
Use content of stated file as pattern |
– |
|
Ignore case |
|
|
Do not ignore case |
– |
|
Ignore case; replace numbers with numbers, letters with letters |
– |
|
Do not evaluate non-standard characters in the pattern |
|
|
Pattern describes a whole word |
|
|
Pattern describes a whole line |
– |
|
Recursive processing |
– |
|
Set details (preset: 1) |
||
Cost(1) of deleting characters |
|
|
Cost of inserting characters |
|
|
Cost of replacing characters |
|
|
Maximum cost of a match |
|
|
Maximum permissible error count for match (independent of cost) |
|
|
Manage Output |
||
Show cost for a match |
|
– |
Show match with lowest cost(2) |
|
|
Suppress prompt for |
– |
|
Color highlight matches |
|
– |
Show match count |
|
|
Show matches without filename |
|
|
Show matches with filename |
|
|
Show filenames instead of matches |
|
|
Always show filenames with matches |
– |
|
Enumerate matches |
|
|
No output |
|
|
Show position of rist character of a match(4) |
|
|
Output line breaks beind matches instead of in front of matches |
|
– |
Line break characters |
|
– |
Only show lines without matches |
|
|
Output additional details |
|
– |
(1) The "cost" <value> weights the operation, influencing its effect on the Levenshtein distance.(2) Requires twice the time (two runs) and does not work for input from STDIN.(3) Exit with return code 0 for a match.(4) Important for developing patterns.(5) Default: \n. |
Which variant of agrep you use depends on the purpose and availability. When processing Unicode characters, there is no alternative to the TRE variant; but, because this usually requires significantly more resources and computing time than the classic version, there are good reasons to use the original as well.
Hands On
In principle, you use agrep like the standard grep command and can thus search arbitrary data streams (i.e., data from files as well as the STDIN channel):
$ agrep-tre --col -s -E1 -I 2 Tst * wizard.pdf:1:stream wizard.pdf:1:endstream
Like grep, agrep parses streams record by record, searching for patterns. The delimiter decides what constitutes a record. By default, this is a line break (\n
), which means that agrep views each line as a record and applies the pattern defined by -e <pattern>
.
However, you will see several significant differences between the (GNU) grep and agrep variants. For example, agrep does not support many of the options normal grep command has at its disposal (e.g., context control of the output -A
, -B
, -C
) or the respective include or exclude options as for regular expressions.
The option -d <pattern>
for separating the records also deserves special attention. It does not work as you would expect in all cases. For example, if you define -d ''
, the records should be delimited by spaces, but this does not work in all cases. For such problems with the record boundaries, the options --color
, --show-position
, and -n
(TRE variant) and -b
, -n
, and -V
(classic version) can help you discover the problem.
The tr
(translate) tool is also a relatively simple workaround for these problems. You can use the simple command to wrap data streams quickly and reliably to create lines. The syntax is quite simple:
$ tr <options> <search_char> <replace_char>
To convert spaces to line breaks, for example, you would use the command,
tr ' ' '\n'
and to replace several <search_chars>
, use:
$ echo "Hello World" | tr "A-Za-z" "a-zA-Z" hELLO wORLD
The advantage of this variant is that because agrep uses the new lines as record boundaries, the results then match your expectations.
Behind the Scenes
Because almost all distributions provide agrep as one of its standard tools, some shell scripts and GUIs use the command. A typical example is the build program ding
[9]. Spellcheckers like Aspell also apply the agrep method implicitly. A phonetic transcription occurs first; the program uses the Levenshtein algorithm to find the best approximation. More complex search engines such as Elasticsearch (Recoll cannot do this) also use the Levenshtein algorithm to generate the best possible results.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.
-
DebConf24 to be Held in South Korea
Busan will be the location of the latest DebConf running July 28 through August 4
-
Fedora Unleashes Atomic Desktops
Fedora has combined its solid distribution with rpm-ostree system to make it possible to deliver a new family of Fedora spins, called Fedora Atomic Desktops.
-
Bootloader Vulnerability Affects Nearly All Linux Distributions
The developers of shim have released a version to fix numerous security flaws, including one that could enable remote control execution of malicious code under certain circumstances.