Fuzzy text searches with agrep
Better Finds
The agrep tool expands on grep by adding fuzzy search capabilities to text string-matching operations.
The grep command, which allows users to find strings and patterns in text files, is probably familiar to Linux users who use the command line; however, its variants are less well known. For example, the following two commands are precisely equivalent:
egrep <term> <file(s)> grep -e <term> <file(s)>
The tool interprets the <term>
as an extended regular expression. By contrast, fgrep
interprets this as being equivalent to grep -f
, where all components in <term>
are normal characters, thus ignoring their potential regex meanings. It thus works a little faster than a plain vanilla grep, and this is especially noticeable when searching large volumes of data. A third candidate, rgrep
, works like grep -r
, recursively parsing folder structures, which affects its speed. All these commands have one thing in common: They only find direct hits for <term>
; a search for similar terms only works if you design a matching regex.
agrep
The grep variant agrep implements the methods provided by the Levenshtein algorithm (see the "Levenshtein Distance" box) to find strings or patterns in text files. In contrast to grep, agrep can only interpret relatively simple patterns, but it can also find similar terms and often works faster than the original.
Levenshtein Distance
Consider the strings "grep" and "gerp," which differ by two letters in different positions, whereas "grap" or "grip" change one letter, and "egrep" adds a letter. To define such deviations in a mathematically precise way, the Russian mathematician Vladimir Iosifovich Levenshtein defined the Levenshtein distance [1] in 1965. This value, also known as the edit distance, is used as a measure of the minimum number of insertions, deletions, and replacement operations for converting one string to another [2]. If you weight the algorithmic "cost" of the necessary operations, you arrive at the weighted Levenshtein distance (WLD). Evaluating the Levenshtein distance makes it possible to find similar words, compensate for spelling errors, and generally determine minor word differences and ignore them where appropriate.
Originally, agrep [3] was written for an index system named Glimpse [4] – a kind of predecessor of today's desktop search engine – at the University of Arizona. Today, Glimpse is hardly used; on the desktop, Recoll [5] or similar, much more powerful programs have taken over its role. The syntax of agrep closely follows that of grep and adheres to the pattern:
$ agrep [<options>] <pattern> [<file>] ...
Linux currently uses two versions of agrep: the classic version from the Glimpse suite [6] and a newer one based on the TRE library [7]. The former often works faster, but does not support Unicode characters and other multibyte encoding [8]. Therefore, most distributions now provide the TRE version of the program. The two versions differ not only in performance but also in terms of options (Table 1).
Table 1
agrep Options
Function | TRE agrep | Glimpse agrep |
---|---|---|
Define patterns; useful for patterns that start with " |
|
– |
Use content of stated file as pattern |
– |
|
Ignore case |
|
|
Do not ignore case |
– |
|
Ignore case; replace numbers with numbers, letters with letters |
– |
|
Do not evaluate non-standard characters in the pattern |
|
|
Pattern describes a whole word |
|
|
Pattern describes a whole line |
– |
|
Recursive processing |
– |
|
Set details (preset: 1) |
||
Cost(1) of deleting characters |
|
|
Cost of inserting characters |
|
|
Cost of replacing characters |
|
|
Maximum cost of a match |
|
|
Maximum permissible error count for match (independent of cost) |
|
|
Manage Output |
||
Show cost for a match |
|
– |
Show match with lowest cost(2) |
|
|
Suppress prompt for |
– |
|
Color highlight matches |
|
– |
Show match count |
|
|
Show matches without filename |
|
|
Show matches with filename |
|
|
Show filenames instead of matches |
|
|
Always show filenames with matches |
– |
|
Enumerate matches |
|
|
No output |
|
|
Show position of rist character of a match(4) |
|
|
Output line breaks beind matches instead of in front of matches |
|
– |
Line break characters |
|
– |
Only show lines without matches |
|
|
Output additional details |
|
– |
(1) The "cost" <value> weights the operation, influencing its effect on the Levenshtein distance.(2) Requires twice the time (two runs) and does not work for input from STDIN.(3) Exit with return code 0 for a match.(4) Important for developing patterns.(5) Default: \n. |
Which variant of agrep you use depends on the purpose and availability. When processing Unicode characters, there is no alternative to the TRE variant; but, because this usually requires significantly more resources and computing time than the classic version, there are good reasons to use the original as well.
Hands On
In principle, you use agrep like the standard grep command and can thus search arbitrary data streams (i.e., data from files as well as the STDIN channel):
$ agrep-tre --col -s -E1 -I 2 Tst * wizard.pdf:1:stream wizard.pdf:1:endstream
Like grep, agrep parses streams record by record, searching for patterns. The delimiter decides what constitutes a record. By default, this is a line break (\n
), which means that agrep views each line as a record and applies the pattern defined by -e <pattern>
.
However, you will see several significant differences between the (GNU) grep and agrep variants. For example, agrep does not support many of the options normal grep command has at its disposal (e.g., context control of the output -A
, -B
, -C
) or the respective include or exclude options as for regular expressions.
The option -d <pattern>
for separating the records also deserves special attention. It does not work as you would expect in all cases. For example, if you define -d ''
, the records should be delimited by spaces, but this does not work in all cases. For such problems with the record boundaries, the options --color
, --show-position
, and -n
(TRE variant) and -b
, -n
, and -V
(classic version) can help you discover the problem.
The tr
(translate) tool is also a relatively simple workaround for these problems. You can use the simple command to wrap data streams quickly and reliably to create lines. The syntax is quite simple:
$ tr <options> <search_char> <replace_char>
To convert spaces to line breaks, for example, you would use the command,
tr ' ' '\n'
and to replace several <search_chars>
, use:
$ echo "Hello World" | tr "A-Za-z" "a-zA-Z" hELLO wORLD
The advantage of this variant is that because agrep uses the new lines as record boundaries, the results then match your expectations.
Behind the Scenes
Because almost all distributions provide agrep as one of its standard tools, some shell scripts and GUIs use the command. A typical example is the build program ding
[9]. Spellcheckers like Aspell also apply the agrep method implicitly. A phonetic transcription occurs first; the program uses the Levenshtein algorithm to find the best approximation. More complex search engines such as Elasticsearch (Recoll cannot do this) also use the Levenshtein algorithm to generate the best possible results.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
New Slimbook EVO with Raw AMD Ryzen Power
If you're looking for serious power in a 14" ultrabook that is powered by Linux, Slimbook has just the thing for you.
-
The Gnome Foundation Struggling to Stay Afloat
The foundation behind the Gnome desktop environment is having to go through some serious belt-tightening due to continued financial problems.
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.