Using fuzzy searches with tre-agrep
A Grep Replacement
Tre-agrep has all of grep's functionality but can also do ambiguous or fuzzy searches without deep knowledge of regular expressions.
Grep [1] is a standard command-line tool. It searches files for regular expressions, then displays any lines that include a match. In expert hands, grep
can be a flexible tool, but gaining expertise can take years of practice. Nor do related commands like egrep
[2] or fgrep
[3] make grep
any easier to use. For these reasons, those lacking expertise might want to check out TRE [4], which includes a reimplementation of agrep
(approximate grep) [5] as a command-line utility. tre-agrep
is a grep-like tool that has all of grep's functionality but can also do ambiguous or fuzzy searches that are much easier to learn.
Grep and tre-agrep share similar options, such as --ignore-case
and --count
. However, the logic of their searches can be different. (I say "can be" because often both commands have multiple ways of getting the same result.) To give a simple example, imagine that you are searching for files that contain both "Linux," and "Linus." Using grep, you would probably use regular expressions one way or the other. Probably the simplest would be:
grep 'Linu.' *.txt
Here, the period in practi.e
indicates that any character can be substituted for it, giving results with both "practice" and "practise." This use of a regular expression is relatively simple, but it must be entered and positioned accurately. If it were more complicated, newer users might be put off by a series of familiar and unfamiliar characters used with non-standard syntax.
By contrast, with tre-agrep, the command is more likely to use an option for ambiguity:
tre-agrep -1 'Linux' *.txt
The option here means that the results should include those with one character different from the string "Linux" – a command that requires both less precision and less user knowledge, but perhaps at the price of more irrelevant results (Figure 1). Moreover, the entered command would find typos anywhere in the string, not just in the second-to-last letter. Notice, too, that, both commands begin displaying the results with the name of the file and end with the current account and the file path.
Usually, tre-agrep displays the first result that matches the search. If you want more than the first search result, you can specify the number of errors. For example, if you set the command to look for results with four errors, results with three errors will not show, so you might want to make several searches with minor differences.
The original version of agrep was developed in 1988-2001 by Udi Manber and Sun Wu. Originally written for Unix, this version was widely ported to other operating systems, but it's rare in Linux distributions, because for years, it was released under a non-free license. Since 2014, it has been released under the ISC Open Source License [6], but either the new license is not recognized as free, or the change has gone unnoticed, because Debian still includes it in the non-free section of its repositories.
Today, the most common version is tre-agrep, written in 2002-2004 by Ville Laurikari. Tre-agrep uses a different library from the Manber and Wu version and is released under a BSD license. Most distributions include it in their repositories, although not as part of the default installation.
When used without any options, tre-agrep's output is identical to grep's. However, it is the options that make tre-agrep's results different. All tre-agrep's options come under one of three categories: options for approximations, regular expressions, and output filtering and formatting..
Options for Setting Approximations
Approximations or fuzzy logic are at the heart of tre-agrep. The man page describes the number of differences as the cost (based on the Levenshtein distance [7]), which is a count of the number of characters that a command using approximation options can depart from the precise string entered in the command. By default, a missing, an extra, or a substituted character all have a cost of 1, although you change these costs with --delete-cost=NUM
(-D NUMBER
), --insert-cost=NUMBER
(-I NUMBER
), or --substitute-cost=NUMBER
(-s NUMBER
) to reflect your needs.
The concept of cost is used without explanation in the command's help, but its usefulness of the concept soon becomes clear enough. Cost is a way to judge output records and sort through them. Most of the time – although not always – the lower the cost, the closer the result is likely to be to your intention. Conversely, the higher the cost, the greater the chance that an output record is relevant. However, if you know, for example, that relevant results are most likely to be a substitution, you can set the cost of substitutions to
, lowering their cost and making them easy to find with an output option such as --best-match
or --show-costs
(see below).
If you are not interested in changing the cost of approximations, the concept of fuzzy results is straightforward. The most useful option for approximations is -#
, which should be replaced in a command by a digit between
(an exact match) and 9
errors – with "error" being the name for any deviation from the string entered as part of the command. You can also further filter output records via --max-errors=NUMBER
(-E NUMBER
). These are simple but powerful options, and they are easily remembered.
Options for Regular Expressions
Regular expressions are search patterns, in which characters stand for other groups of characters in files, the contents of files, or locations in a file [8]. Both grep and tre-agrep can use the same standard set of regular expressions (Table 1).
Table 1
Common Regular Expressions
Character Keys | Meaning |
---|---|
|
Any single character |
|
Any any number of characters, or none |
|
The following regular expression at the start of a line |
|
The following regular expression at the end of a line |
|
Any of the characters in the brackets |
|
Turn off the next character's meaning as a regular expression |
|
Characters at the start of a word |
|
Characters at the end of the word |
|
One or zero instances of the preceding regular expression |
Regular expressions can be entered directly into the string part of the command. However, ambiguity sometimes can be reduced by using the option --regexp=PATTERN
(-e PATTERN
). In particular, this option can be useful if a search includes a hyphen (-), which might be misinterpreted as introducing an option, or a forward slash (/), which might be read as introducing a directory.
As in grep, a search for regular expressions can be refined in several ways. With --ignore-case
(-i
), a regular expression treats lower and upper case letters the same, both in a search pattern and in the names of input files. With --literal
(-k
), the search pattern is read as though it has no special characters in it. You can also use --word-regexp
(-w
) to match only whole words, or --invert-match
(-v
) to select records that do not match the regular expression you entered. These refinements can help filter results, but they can add another level of complexity; therefore, unless you have a special need, you might first prefer to focus only on using regular expressions until you are comfortable with basic patterns.
Output Options
Some of tre-agrep's output options are less well known than those for approximations, but some can be almost as useful. Some are identical to grep's, such as --quiet
(-q
), which suppresses output, letting you know only that a match has been found, or --files-with-matches
(-l
), which lists only the names of files with matching results. Still another option shared with grep is --count
(-c
), which only tells you the number of matches in each file, but does not display them (Figure 2).
However, by far the most useful option for filtering results is --best-match
(-B
), whose option displays only the records with the lowest cost – that is, those closest to the string you entered in the command. By using this option, especially with approximations, you can reduce the results through which to scroll, although possibly at the cost of missing serendipitous results.
Another way to judge results is to add --show-cost
(-s
), which displays the cost directly after the file name at the start of the result. By seeing how far a result differs from the string you enter, you might be able to judge each result's reliability and usefulness.
Other output options format rather than filter results. For example, --color
(--colour
) is almost always useful, because it highlights results in the output strings, using the GREP_COLOR
environment variable. Similarly, you can use --show-position
(Figure 3) to prefix each output record with the start and end of the record (the first character of the record and the first character after the match). You might also help organize results by prefacing each output reference with the name of the file in which it is located, using --with-filename
(-H
). As you continue with your work, you might also find it useful to number each output record by adding the option --record-number
(-n
).
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.