Search more efficiently with ugrep

Filters

Ugrep tries to determine the type of an examined file based on the data it contains, the file name extension, and the signature (the "magic byte"). In this way, the search can be specially prepared for certain file types (i.e., filtered).

Here the filter extracts the text components from the data streams. These filters execute a command, a script, or a specific function, with pipes if necessary. They are prepended to the search process via the --filter=<Filter> or --filter-magic-label=<Label>:<MagicByte> option.

In the form --filter=<filter>, the <filter> consists of an expression of the form <Ext>:<command line>. <Ext> is a comma-separated list of file name extensions for which you want the filter to apply, such as .doc,.docx,.xls. The * character is a special case that acts on all files, especially those for which there are no other filters.

The <command> line must be constructed to read input via the standard input channel and write the results to the standard output channel. Typical commands include cat (pass everything) and head (pass the first lines of text), but tools like exiftool (extract and pass metadata) or pdftotext (extract text from PDFs) can also be included this way. Some commands, like pdftotext, require options to work correctly – in this case pdftotext % -. You then need to quote spaces in the command lines to protect them:

--filter='pdf:pdftotext % -'

The --filter-magic-label=<Label>:<Magic> option lets you extend the filtering mechanism to data streams that ugrep then classifies by reference to the magic byte. Details can be found in the man page.

Multiple filters can be specified as comma-separated lists. A combined definition for PDF and Office documents might look like the one shown in Listing 3.

Listing 3

Combined Filter Definition

--filter="pdf:pdftotext % -,odt,doc,docx,rtf,xls,xlsx,ppt,pptx:soffice --headless --cat %"

Conclusions

Ugrep belongs on every computer. It replaces and complements the standard commands quite excellently, and anyone who has to deal with text searches should familiarize themselves with it. The incremental search alone is so useful that it more than justifies the minimal training time.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Regex Generators

    As regular expressions grow in complexity, regex generators can make the job easier by computing the patterns for you.

  • Command Line: Grep

    Once you understand the intricacies of grep, you can find just about anything.

  • Command Line – tre-agrep

    Tre-agrep has all of grep's functionality but can also do ambiguous or fuzzy searches without deep knowledge of regular expressions.

  • Command Line – vim-abolish

    Whether you are writing code or text, vim-abolish can help you customize search and replace functions in Vim.

  • Command Line: Archives

    Gzip and bzip2 not only compress files, they also provide lean and powerful tools for viewing, searching, and comparing text files.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News