Needle in a Haystack
How odfgrep Works
If you thought that an odfgrep
script would have to be very complicated, think again. The code in Listing 1 is the odfgrep
that I use from time to time on my own GNU/Linux computers, and it is less than 30 lines.
Listing 1
odfgrep Script
1 #! /bin/bash 2 3 OPTIONS=$@ 4 ODFOPTIONS=`echo $@ | sed -e 's/\.\(odt\|odp\|ods\)\b/\.\1\.odfgrep\.txt/g'` 5 6 for FF in $OPTIONS 7 do 8 if [ -f "$FF" ] 9 then 10 case "$FF" in 11 *.odt|*odp*|*ods) 12 odt2txt --width=-1 $FF > $FF.odfgrep.txt 13 FILES2REMOVE="$FILES2REMOVE $FF.odfgrep.txt" 14 ;; 15 *) # non-ODF file found, nothing to do 16 ;; 17 esac 18 fi 19 done 20 21 grep $ODFOPTIONS | sed -e 's/\.odfgrep\.txt//' 22 23 if [[ -n "${FILES2REMOVE// }" ]] 24 then 25 rm $FILES2REMOVE 26 fi 27 exit
There are surely many other ways to write an odfgrep
script, and many of those ways may be faster than the example here or handle some weird combinations of search patterns and ODF files better. Personally, however, I have not had any problems yet, and the simple odfgrep
discussed here should be enough for the needs of the great majority of Linux desktop users. Its high-level flow diagram (Figure 2) is easy to describe:

- Get the list of all the files to analyze from the user.
- Figure out which of those files are in ODF format.
- Make a temporary plain text version of each of those files, each in the same folder as the original.
- Pass to the standard
grep
the same options specified by the user, but a different list of files, in which plain text versions of the ODF files are used. - "Massage" the output of
grep
on the fly so that the user sees the names of the original ODF files. - Remove all the temporary plain text files created in step 3.
I look at the assumptions behind this algorithm, its limits, and some ways to expand it at the end of the tutorial. For now, I'll just look at how the code that implements it works, line by line.
Commands typed at a prompt are interpreted and executed on the spot, line by line, by special programs called "shells" in the Linux world. You can save long sequences of commands in files that a shell may then execute automatically, one line at a time. These special files are called scripts, and odfgrep
is just that: a script.
The weird stuff on line 1 is the standard header of every shell script. The two initial characters ("shebang" in Unix slang) mean that the file uses the syntax of the default shell on GNU/Linux systems, called Bourne Again Shell (Bash), and therefore must be interpreted by the bash
program that is in the /bin
folder.
Each shell script can receive options, or switches, that modify its default behavior. In Bash, those options are saved in the special variable called $@
. Lines 3 and 4 copy all those switches, for readability, in two string variables called OPTIONS
and ODFOPTIONS
. The first will only be used to figure out which of the files that grep
is to scan are in ODF format (line 6).
The $ODFOPTIONS
variable is filled in line 4 with the customized file list mentioned above. In that line, in fact, sed
(Stream EDitor) receives the original options and appends on the fly the string .odfgrep.txt
to each occurrence of the .odt
, .odp
, and .ods
file extensions.
In other words, if you asked odfgrep
to find all the occurrences of "Linux" in two files called thesis.odt
and thesis-slides.odp
#> odfgrep Linux thesis.odt thesis-slides.odp
then $ODFOPTIONS
would assume the value Linux thesis.odt.odfgrep.txt thesis-slides.odp.odfgrep.txt.
sed
achieves this by substituting each occurrence of the text pattern between the first two forward slashes with the other pattern between second and third forward slashes. A complete sed
tutorial would not fit (and be off topic) here, but you need to understand two pieces of line 4: The \1
means "put here the string just found with the pattern in the set of parentheses to the left" (i.e., .odt
, .ods
, or .odp
).
The \b
pattern modifier makes sed
only act on word boundaries: Without it, line 4 would modify a file name like notes.odtconference.odt
to notes.odt.odfgrep.txtconference.odt.odfgrep.txt
. Not a 100% bulletproof solution, since it would also work, say, on strings like my.odt.notes.txt. In practice, that has never been a problem for me.
The loop in lines 6 to 19 creates the plain text copies of each ODF file, saving their names (which are the same previously written in $ODFOPTIONS
, remember?) in the $FILES2REMOVE
variable. To do this, odfgrep
copies the substrings inside $OPTIONS
, one at a time, in the variable $FF
(line 6) and looks at them. But nothing happens unless:
$FF
is the name of an actual file (line 8), and- Its extension is
.odt
,.odp
, or.ods
(line 11).
In that case, and only in that case, the odt2txt
utility (line 12) writes a plain text copy of $FF
in a temporary file with the same suffix used in line 4 to build $ODFOPTIONS
– that is, .odfgrep.txt
.
Please note that, even if I just called it a "file name," $FF
includes the path to a file (i.e., it may have values like work/essays/phd-thesis.odt
). In this case, odt2txt
would save the plain text copy as work/essays/phd-thesis.odt.odfgrep.txt
, so it is in the same folder as the original. The same string is also appended, in line 13, to the variable $FILES2REMOVE
, which is necessary for reasons that will be clear in a moment.
The --width
option in line 12 tells odt2txt
the width at which text lines should be wrapped. Its default value is 65
characters. Setting it to -1
means "do not wrap lines." This adjustment is necessary because grep
works line by line. If you were searching for a sentence like Linux is great, but odt2txt
split it across two consecutive lines, grep
would not find it.
Once the loop that started in line 6 ends, $ODFOPTIONS
contains three types of "objects":
- The options and search patterns that the ordinary
grep
should use. - The paths to all the non-ODF files passed by the users.
- The paths to all the plain text copies of ODF files generated in line 12.
The objects of the first two types are not modified in any way, because they were not file names with ODF extensions; therefore, the loop did nothing to them!
At this point, you can finally run grep
with the $ODFOPTIONS
(line 21), but with one trick: Filter its output with sed
in a way that makes all the .odfgrep.txt
strings disappear. This will make odfgrep
always return the names of the original ODF files, instead of their plain text copies, which are the only ones that grep
sees. Without that sed
command, the output of grep
could be something like
phd-thesis.odt.odfgrep.txt: Linux is great and I love it..
and this would confuse the users. The sed
part of line 21, instead, transforms the output line above in this way, pointing to the original ODF file:
phd-thesis.odt: Linux is great and I love it..
After this, the only thing left to do is clean up (lines 23 to 26). Line 23 means "check if, after removing all whitespaces from the FILES2REMOVE
variable, it has a number of characters greater than 0." If that is true, it means that at least one plain text file was created, and its name was appended to $FILES2REMOVE
(lines 12 and 13). In that case, execute line 25, which removes all the files listed therein. Done!
As an example, Listing 2 shows the output of odfgrep
on a test directory that contains several files of different kinds in different subfolders. The command says "show me all the lines containing the word linux (case insensitive) in all the files inside testdir
and all its subfolders" (some lines were truncated for better formatting).
Listing 2
odfgrep for "linux"
#> find testdir -type f -exec odfgrep -i linux {} /dev/null \; testdir/references/mfioretti.odp:Writer for several Linux magazines testdir/references/mfioretti.odp:any Gnu/Linux distribution is OK testdir/references/open-business-models.odt:Yochai Benkler, Linux and the Nature testdir/notes/go-linux.md:what trouble? Why not check your data table inside a spreadsheet or database? Because it's often... testdir/notes/go-linux.txt:Linux(1) is the best kernel around testdir/notes/go-linux.txt:Linux,1 testdir/notes/go-linux.txt:he actually said "Linux is the best kernel around"
As you can see, odfgrep
works and generates output in the same way as the standard grep
, always returning the right file names, both on actual plain text files like go-linux.txt
and on ODF slide shows (mfioretti.odp
) or document files (open-business-models.odt
).
In another example, I ask odfgrep
to tell me how many times the word politics appears in the ODF text documents inside a certain folder:
#> odfgrep -c politics testing/references/*odt testing/references/conference-proceedings.odt:2 testing/references/openness-essay.odt:3
Here, odfgrep
found two matching documents in that folder; the word politics appeared two times in one file and three times in another file.
Installing odt2txt and odfgrep
The odt2txt
program [2] is present in the repositories of the main GNU/Linux distributions. On Ubuntu, for example, you can install it by simply typing:
#> sudo apt-get install odt2txt
To install odfgrep
, first save the code in Listing 1 (except the line numbers [3]) in a plain text file called odfgrep
with the use of an editor like Gedit, Kate, or the venerable Vi or Emacs. Then, copy that file to a directory (e.g., /usr/local/bin
), where all users of your computer can access it, and make it executable with the
#> sudo mv odfgrep /usr/local/bin #> sudo chmod 755 /usr/local/bin/odfgrep
commands.
Caveats and Limits
The odfgrep
script explained here is simple but very useful, provided you acknowledge some of its limitations or underlying assumptions.
The first things to know are about folder and file names. Depending on the language settings of your computer, this odfgrep
may fail on names containing characters that are not ASCII alphanumerical characters, periods, underscores, or hyphens. It will surely fail if it comes across folder or file names containing spaces. At the same time, it will not detect, and therefore convert, ODF files that do not have their own default extensions (e.g., .odt
, .ods
, or .odp
).
Personally, I consider these "sure failures" more of a feature than a bug for one simple reason: In my humble opinion, "limiting" yourself to files and folder names without spaces, apostrophes, and non-ASCII letters guarantees that any software or filesystem on the planet will deal with them without surprises, and it makes it much simpler to write all sorts of file-managing scripts for any purpose.
A more substantial limitation is the inability to work as intended in folders where you have no permission to create new files. Running odfgrep
with sudo
or giving it special SUID powers, as explained in an article online [4], would solve this problem. However, even that will not be enough to work on non-writable media like DVD archives, in which the normal grep
tool would work just fine.
You must also take into account that odt2txt
cannot fix stuff that "disturbs" the main text flow, like footnotes. If, for example, your ODF text contains a sentence like Linux(1) is the best kernel around, and (1) is a footnote with the text a Unix-like kernel by Linus Torvalds, then odt2txt
will split the original text over five lines:
Linux,1 ** a Unix-like kernel by Linus Torvalds ** is the best kernel around
If you were looking for the exact phrase Linux is the best kernel around, odfgrep
would miss it, exactly because it was spread over multiple lines with extra text in it.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
News
-
An All-Snap Version of Ubuntu is In The Works
Along with the standard deb version of the open-source operating system, Canonical will release an-all snap version.
-
Mageia 9 Beta 2 Ready for Testing
The latest beta of the popular Mageia distribution now includes the latest kernel and plenty of updated applications.
-
KDE Plasma 6 Looks to Bring Basic HDR Support
The KWin piece of KDE Plasma now has HDR support and color management geared for the 6.0 release.
-
Bodhi Linux 7.0 Beta Ready for Testing
The latest iteration of the Bohdi Linux distribution is now available for those who want to experience what's in store and for testing purposes.
-
Changes Coming to Ubuntu PPA Usage
The way you manage Personal Package Archives will be changing with the release of Ubuntu 23.10.
-
AlmaLinux 9.2 Now Available for Download
AlmaLinux has been released and provides a free alternative to upstream Red Hat Enterprise Linux.
-
An Immutable Version of Fedora Is Under Consideration
For anyone who's a fan of using immutable versions of Linux, the Fedora team is currently considering adding a new spin called Fedora Onyx.
-
New Release of Br OS Includes ChatGPT Integration
Br OS 23.04 is now available and is geared specifically toward web content creation.
-
Command-Line Only Peropesis 2.1 Available Now
The latest iteration of Peropesis has been released with plenty of updates and introduces new software development tools.
-
TUXEDO Computers Announces InfinityBook Pro 14
With the new generation of their popular InfinityBook Pro 14, TUXEDO upgrades its ultra-mobile, powerful business laptop with some impressive specs.