Faster search filter: ICgrep
Well Filtered
One of the most common tasks when working on computers involves browsing texts for search patterns. Here, ICgrep offers a modern, parallel, and Unicode-enabled alternative to the classic grep.
Grep is arguably the most import text-browsing tool in Unix. Although its origins are not entirely clear, likely, the name for this tool developed from the command g/re/p
(global, regular expression, print) from the Unix standard editor Ed.
In any case, grep searches entries line by line for certain formulated patterns, as regular expressions. Depending on the options used, the tool can display the matches, their location, number, and so on.
With today's texts, this concept is stretched to its limits in several respects. For example, modern systems no longer use the antiquated ASCII character set with its 128 characters or its larger, 256-character siblings such as Latin 1 or increasingly Unicode [1] (see the "Using Unicode Characters" box).
Using Unicode Characters
Unicode aims to assign a digital code to each existing meaningful character or text element of all known writing cultures and character systems. The character range is equally extensive – encoding a single character requires several bytes.
Due to the large scope of the Unicode character range, it often isn't possible to enter characters on the keyboard directly because there simply aren't any keys for them. However, in Linux, Qt and GTK allow you to enter characters using key codes. To do so, you just need to precede the codes with the keyboard shortcut Ctrl+U or Ctrl+Shift+U – but not all programs support this.
It is often easier just to search for these – especially if you don't know the code for a particular character. There are two utilities for GTK-based desktops that are both used regularly: Gucharmap (Figure 1) and Gnome Characters (Figure 2). The latter is particularly useful for finding special characters and copying their codes.
Another problem is the definition of input lines. A plethora of variants already existed for identifying or inducing a line break with the previous, non-Unicode-enabled systems [2]. The default in Unix is NL
(New Line), and there are other options with Unicode, in particular NEL
(New Line), but also LS
(Line Separator) or PS
(Paragraph Separator).
The relatively low operating speed of grep – with large input volumes and complex regular expressions – makes matters worse. Even in the best case scenario, regexes [3] are pretty complicated to enter. When editing complex expressions, the computer requires a corresponding amount of computing power. On top of that, there are also conditional, non-greedy, and greedy regexes – which attempt to generate matches for samples to include as much text as possible – and many other advanced variants. All grep varieties process regular expressions at run time so that the complexity is mapped directly in the calculation time.
An online study [4] shows just how quickly ICgrep [5] works in comparison to the standard grep. It describes a scenario in which the standard grep needs more than a minute to search for a pattern, whereas ICgrep handles the same tasks in just over a second. According to the ICgrep developers, there might be differences in speed with a factor of up to 100.
Installation
Installing ICgrep proved to be a little tricky and quite time consuming in the test; its prerequisites are Subversion, Cmake, and Clang (Listing 1, line 1).
Listing 1
Creating ICgrep
To begin, load the latest ICgrep version onto your computer using Subversion. The message Checked out, revision
Number
marks the end of this action (line 5). Now switch to the icgrep1.0/icgrep-devel/
directory. There is an installation guide, among other things, in the README-icgrep-1.00a.txt
file. It essentially consists of two parts: the interpretation and the installation of Low Level Virtual Machine (LLVM) and the subsequent construction of ICgrep itself.
To install LLVM, it is important to generate the required Makefiles. To do this, switch to the llvm-build/
directory (line 6) and let Cmake perform its role there (line 7). After a short time, you can initiate the build of the LLVM using make && make install
(line 10). Depending on the computing power, the interpretation process takes about an hour, and the subsequent installation only takes a few seconds to complete.
The second step involves creating ICgrep (Listing 2). To do so, switch to the icgrep1.0/icgrep-devel/icgrep-build/
directory and create the Makefiles there using the command from the second line. The compiler automatically detects which CPU extensions the program can use to speed up processing. Alternatively, you can present one of the options SSE2
, SSE3
, SSE4_1
, SSE4_2
, AVX1,
or AVX2
using the -DSIMD_SUPPORT
button.
Listing 2
Generate the Executable File
The next step is to generate the executable file (Listing 2, line 15) without options using make
– this again takes a little longer. The generated program icgrep
can be made smaller using strip
. Last but not least, you can install the program manually in the /usr/local/bin/
directory.
In Practice
You can largely control ICgrep – and its cousins grep or agrep – using options. As this tool is a piece of research software, it contains two or three sets of options with different relevance for typical users. As usual, -help
shows the most important ones; -help-list
goes a step further: It generates a short, compact list of important buttons that, however, do require quite a bit of knowledge of regular expressions. The -help-list-hidden
option ultimately gives you the full list of options; however, many of these are only important in special cases.
You can define patterns, that is, regular expressions, using the -e
button ("expression"). If there is a risk that the shell will detect and evaluate or replace metacharacters in it, you should include the pattern between the single or double quotes. An equals sign can (but does not have to) follow the call, which may occur more than once:
$ icgrep -e Patt -e=ern
These types of expressions then stand for a logical OR: Condition 1 OR Condition 2 OR both must be met. If the number of patterns is too large and the commands get too confusing, it is advisable to outsource the pattern to an external file, one per line. You can specify this file using icgrep -f <SampleFile>
. Table 1 summarizes the most important buttons.
Table 1
Important Options in ICgrep
Option | Function |
---|---|
|
Defines a sample, can be specified multiple times |
|
Sample file |
|
Only counts matches, doesn't show them |
|
Displays the line numbers of matches |
|
Displays file name |
|
Ignores upper/lowercase |
|
Only |
|
Normalizes line breaks |
|
Shows the regular expressions used |
|
Shows named expressions |
One of ICgrep's significant shortcomings is that there's no option for recursively editing, like GNU grep does with the -r
button. Instead, you have to resort to the combination of Find, Xargs, and ICgrep:
$ find <Path> -type f | xargs icgrep <Sample>
Alternatively, today many shells, including Bash, support recursive directory searching using **
. However, this functionality must be activated via shopt -s globstar
. The shopt globstar
command then shows globstar on
. Caution: This process might lead to overly long argument lists and error messages like -bash: /usr/local/bin/icgrep: The argument list is too long
.
The other important options currently missing in ICgrep include -v
("invert match"), -C
, -A
, -B
(displays context lines), -m
(defines maximum number of matches), -a
, -b
(interprets inputs as text or binary), and --color
(highlights matches in color). The developers are planning to implement some of them soon [6].
Testing Regexes
With large data streams, it is advisable to test regular expressions first before applying them to all the data. Tools that visualize regular expressions make life easier for less experienced users. Two such tools are available in Ubuntu in the repositories: Kodos and Kiki [7].
Although Kodos can only be installed on older distributions with difficulty, Kiki is less demanding (Figure 3). It uses regular expressions defined and generated by Python, which it then applies to a previously entered text.
The regexp testing tool regexper
[8] works somewhat differently. The Java program naturally uses Java-based regular expressions and applies them to a previously loaded test file (Listing 3). The test data may be relatively large, which sometimes proves advantageous.
Listing 3
Test File
All of these tools, however, have weaknesses when it comes to processing Unicode characters. Therefore, when constructing regular expressions, they are only suitable for initial tests.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.