Metadata in ODF Files
A Simple ODF Metadata Reader
Listing 1 shows a script, called odfmetareader.sh
that follows the Unix philosophy of small tools that each do just one thing but can be connected in a pipeline. It just prints out, one per line, all the explicit and hidden metadata it finds in the single ODF file passed to it as an argument. Analysis of the output, or its insertion into some database or spreadsheet, is delegated to other tools. You can use this script inside a loop to work on as many files as you like, as shown later in the tutorial. Of course, you also can, and should, change the script to format its output to best suit your needs. Listing 1 shows how the code works.
Listing 1
odfmetareader.sh
01 #! /bin/bash 02 03 rm -rf /tmp/odfmetareader 04 mkdir /tmp/odfmetareader 05 cp $1 /tmp/odfmetareader/odf.zip 06 cd /tmp/odfmetareader 07 08 unzip odf.zip >& /dev/null 09 10 echo "## METADATA DOC START for document $1;" 11 echo "## METADATA ODF START for document $1;" 12 13 # extract explicit ODF metadata 14 15 cat meta.xml | perl -e 'while (<>) {s/document-statistic//g; s/<(meta|dc):([^>]+)>/\n$2=/g; s/user-defined /user-defined-/g; s/<\/(meta|dc).*//g; s/ meta:value-type=/ value-type/g; s/ meta:/\n/g; s/\/=//g; s/<\/office:[^>]+>//g; print} print "\n"' | grep -v '<office:document' | grep -v '^<?xml version' | grep -v '^generator=' | grep '=' 16 17 echo "## METADATA ODF END for document $1;" 18 echo 19 20 # extract metadata about macros 21 if [ -d "Basic" ] 22 then 23 echo "## METADATA MACRO START for document $1;" 24 25 MACRONUM=`find Basic -type f -name "*xml" | grep -v /script- | wc -l` 26 27 echo "macronumber=$MACRONUM" 28 for M in `find Basic -type f -name "*xml" | grep -v /script-` 29 do 30 echo macrofile:$M 31 grep 'sub ' $M 32 done 33 echo "## METADATA MACRO END for document $1;" 34 echo 35 fi 36 37 # extract metadata from images 38 39 if [ -d "Pictures" ] 40 then 41 for P in `find Pictures -type f` 42 do 43 N=`basename $P` 44 echo "## METADATA PICTURE START for document $1 / Picture $N;" 45 echo picturename: $N 46 exiftool $P | egrep '^(Artist|GPS)' 47 echo "## METADATA PICTURE END for document $1 / Picture $N;" 48 done 49 fi 50 # final cleanup 51 52 echo 53 echo "## METADATA DOC END for document $1;" 54 echo 55 #rm -rf /tmp/odfmetareader 56 57 exit
The overall flow is very simple: The script makes a copy of the given file and unzips it in the temporary folder /tmp/odfmetareader
(lines 3-8). The final command on line 55 removes that folder, but I recommend leaving it commented until you have figured out (by looking into that same folder) the internal structure of ODF files.
The central part of Listing 1 prints out the variables in the meta.xml
files and two lists: one of macros and one of pictures, with all their own embedded metadata.
The echo
commands containing the ## METADATA
string (e.g., lines 10 and 11) have the same purpose: They separate the several output sections (one hopes) making them more readable and easier to parse by other scripts.
Line 15 extracts all the metadata from the meta.xml
file. It does seem like ancient Martian, but it is less obscure than it may seem at first sight. It is a concatenation of one long command in Perl and four invocations of the grep
utility.
The Perl part is, basically, a series of regular expressions separated by semicolons that remove all the XML markup you don't need to see in the output. For example, this part
s/<\/(meta|dc).*//g;
replaces, with an empty string, every string that begins with </meta
or </dc
, plus all the characters that follow it until the end of the current line (that is what the .*
part means). The four grep
commands just remove header and footer lines in the XML file that don't contain any metadata. The best way to understand what line 15 actually does, and how to customize it for your needs, is to run the script on any ODF file and compare its output with the original content of the meta.xml
file.
Native macros in ODF files are stored, if present, inside the Basic
folder of the ZIP archive, and line 21 checks if this folder exists. If it does, the script finds all the macro files inside the folder and prints the value in the variable MACRONUM
(lines 25-27). The loop in lines 28 to 25 finds and prints all the lines in the macro files that contain macro names.
The last loop of the script, in lines 39 to 49, checks if a Pictures
folder exists. If the answer is yes, it scans all the pictures inside it (line 41), to print their names (lines 43-45) and then runs the exiftool
command on them (line 46). exiftool
is free software capable or reading and writing all the metadata stored inside today's digital photographs that use Exif and other similar standards.
When given a file name, as in line 46, exiftool
just prints all the metadata in that file, one per line. The egrep
command in line 46 discards all lines, except those that begin with either Artist or GPS, probably the most sensitive data.
Listing 2 shows a small excerpt, heavily edited for clarity, of the odfmetareader.sh
output from the sample ODF document shown in Figure 5, which contains one macro and one photograph.
Listing 2
odfmetareader Results
01 ## METADATA ODF START for document odf-sample-text.odt; 02 initial-creator=Marco Fioretti 03 creation-date=2018-07-22T17 04 date=2018-07-22T18:07 05 creator=Marco Fioretti 06 editing-duration=PT33M32S 07 editing-cycles=9 08 description=Let's see where all these metadata end up... 09 keyword=ODF 10 keyword=Metadata 11 keyword=text processing 12 keyword=text mining 13 subject=showing the way in which ODF format stores metadata 14 title=Just A Sample ODF Text Document 15 image-count="1" 16 word-count="81" 17 character-count="468" 18 user-defined-meta:name="Approved" value-type"boolean"=false 19 user-defined-meta:name="Status"=Confidential 20 21 ## METADATA MACRO START for document odf-sample-text.odt; 22 macronumber=1 23 macrofile:Basic/Standard/samplemodule.xml 24 sub Main 25 ## METADATA MACRO END for document odf-sample-text.odt; 26 27 ## METADATA PICTURE START for document odf-sample-text.odt / Picture sample-picture.jpg; 28 picturename: sample-picture.jpg 29 Artist : Marco Fioretti 30 GPS Latitude : 47 deg 30' 20.53" N 31 GPS Longitude : 19 deg 2' 43.75" E
Publishing online ODF files (or office files in general, probably) without "cleaning" them first may mean letting everybody know where, and by whom, each photograph contained in the file was taken (as shown, starting in line 27). Sometimes this is OK; sometimes it is not.
The macro section (lines 21-25), as commented, lists number, location, and names of all the macros inside the document. The initial section (lines 1 to 19), is just a plain text version of the metadata shown in Figures 1 to 4. It is easy to imagine how many of the lines above, from editing cycles and duration to word count and keywords, may be filtered or fed to some other script to answer any kind of question.
As an example, the following lines show how you may discover which ODF files in a whole directory tree have Linux Magazine as the creator:
for F in `find . -type f | egrep '(odt|ods|odp)$` do FOUND=`odfmetareader $F | grep -i ^creator | grep -i -c 'Linux Magazine'` if [ $FOUND gt 0 ] then # = "there was at least one line with that string" echo found $F fi done
Writing ODF Metadata
Extracting metadata from ODF files is great. Being able to erase or modify it is even better. You can learn how to do so by playing with the odfmetawriter
script in Listing 3, which was written to order for didactical purposes. To begin, it only performs one operation per run for simplicity, always in the same way: Extract the file(s) that must be changed, process them, and then put them back in a copy of the zipped ODF file. Then, to give you an idea of how you might alter both explicit and "hidden" ODF metadata, the script can do the following:
Listing 3
odfmetawriter.sh
01 #! /bin/bash 02 03 if [ ! -e "$1" ] 04 then 05 echo "script launched on non-existing file: $1; aborting" 06 exit 07 fi 08 09 STARTINGDIR=`pwd` 10 11 rm -rf /tmp/odfmetawriter 12 mkdir /tmp/odfmetawriter 13 cp $1 /tmp/odfmetawriter/odf.zip 14 cp $1 /tmp/odfmetawriter/new-$1 15 cd /tmp/odfmetawriter 16 17 unzip odf.zip >& /dev/null 18 cp meta.xml meta.orig.xml 19 20 case "$2" in 21 creator|title|description) 22 echo "Changing $2 to: $3" 23 sed -i -- "s/<dc:$2>.*<\/dc:$2>/<dc:$2>$3<\/dc:$2>/" meta.xml 24 zip -f new-$1 meta.xml 25 ;; 26 27 addkeyword) 28 sed -i -- "s/<meta:keyword>/<meta:keyword>$3<\/meta:keyword><meta:keyword>/" meta.xml 29 zip -f new-$1 meta.xml 30 ;; 31 32 addcustom) 33 sed -i -- "s/<meta:user-defined/<meta:user-defined meta:name=\"$3\">$4<\/meta:user-defined><meta:user-defined/" meta.xml 34 zip -f new-$1 meta.xml 35 ;; 36 37 renamefromtitle) 38 EXT="${1##*.}" 39 TITLE=`perl -e 'while (<>) {next unless m/.*<dc:title>(.*)<\/dc:title>/; $T = $1;} $T =~ s/\W+/-/g; print $T' meta.xml` 40 mv -i new-$1 $STARTINGDIR/$TITLE.$EXT 41 exit 42 ;; 43 44 watermark) 45 if [ -d "Pictures" ] 46 then 47 for P in `find Pictures -type f` 48 do 49 convert $P -font Arial -pointsize 60 -draw "gravity center fill yellow text 1,11 '$3' " temp-watermarked 50 mv temp-watermarked $P 51 zip -f new-$1 $P 52 done 53 else 54 echo "No Pictures in this ODF Document!" 55 exit 56 fi 57 ;; 58 59 removepicsdata) 60 if [ -d "Pictures" ] 61 then 62 for P in `find Pictures -type f` 63 do 64 exiftool -all= $P 65 zip -f new-$1 $P 66 done 67 else 68 echo "No Pictures in this ODF Document!" 69 exit 70 fi 71 ;; 72 73 *) 74 echo "unknown or unsupported option, please retry: $2;" 75 rm -rf /tmp/odfmetawriter 76 exit 77 ;; 78 esac 79 80 mv -i new-$1 $STARTINGDIR/ 81 82 #rm -rf /tmp/odfmetawriter 83 84 exit
- Rewrite title, creator, or description
- Add an extra keyword
- Add a custom field
- Rename the file to match the document title
- Insert a textual watermark in all pictures
- Remove Exif data from pictures
The script must be launched always in the same way:
#> odfmetawriter <ODF-file-name> <operation> <options>
The beginning and end are almost the same as odfmetareader
: Create a temporary folder, work inside it, and remove it when done. Pay attention to line 14, though, which makes a copy of the file passed as an argument with the new-
prefix: It is this file that will be "filled" with the new metadata and eventually (line 80) copied in the same directory where the script was launched.
The core of the script is the case
statement (lines 20-78). It has seven branches: one for each of the operations listed above and a final one (lines 74-77) that exits with an error message in all other cases.
Lines 21 to 30 all do the same thing – that is, update or add a variable in the meta.xml
file.
If the variable passed as a second argument ($2
) is creator
, title
, or description
, the first branch (lines 21-25) of the case
statement finds the corresponding variable and, using the sed
command, replaces its value with the string passed as the third argument.
The two other branches add keywords or custom fields (with a value equal to $3
) when $2
is equal to addkeyword
or, respectively, addcustom
. They work almost in the same way as the first one, with the only difference being that they prepend the XML markup defining the new variable to the other variables of the same kind.
In all cases, after the meta.xml
file has been "updated," it is put back in the copy of the ODF file (lines 24 and 29).
The fourth supported operation does not change anything in the file. When the $2
parameter is equal to renamefromtitle
, the script:
- Takes note of the original file extension (
EXT
, line 38) - Uses Perl to extract the title string from
meta.xml
, replace all of its non-alphanumeric characters with single dashes (line 39), and save the result in theTITLE
variable - Makes a copy of the original file, with the name
TITLE.EXT
, in the original directory
The last two operations supported by odfmetawriter
are insertion of the textual watermark passed as the third parameter inside all the pictures (lines 44-57) and removal of all Exif metadata from the same pictures (lines 59-71).
The watermark is inserted with the ImageMagick's convert
tool. The code in line 49 is copied almost verbatim from the relevant ImageMagick documentation [1]. Line 64, instead, tells exiftool
to give all Exif variables in the current picture an empty value [2]. As before, the modified pictures ($P
) are zipped back in the right place, in the copy of the original document (lines 51 and 65).Running the following commands, in sequence, on the sample ODF document shown in Figure 6
#> odfmetawriter odf-sample.odt title 'New title for Linux Magazine' #> odfmetawriter odf-sample.odt description 'Here is an ODT file with its metadatachanged by a script' #> odfmetawriter odf-sample.odt addkeyword 'ODF metadata processing' #> odfmetawriter odf-sample.odt renamefromtitle #> odfmetawriter New-title-for-Linux-Magazine.odt watermark'Watermarked for Linux Magazine'
produces the results shown in Figure 7. (For simplicity, the renaming commands after each operation have been omitted.) As you can see for yourself, the metadata has the new values, and the picture is properly watermarked. Isn't ODF great to hack?
Code Limits
I already said this, but let me repeat it: The two scripts above do work, but they are not perfect or robust. As a minimum, they would need extra checks to refuse input files not in ODF format, or to handle properly non-alphabetic languages or strings with quotes inside them. In odfmetawriter
, for example, addcustom
will fail if there isn't already at least one custom field present. Also, odfmetawriter
does not change the initial-creator
of an ODF file. Another issue is dates: It is trivial to alter dates in the meta.xml
file, but unless you do it right, you will end up with inconsistent documents (e.g., having ODF files with last-modified
timestamps that are earlier than some of the revisions they contain). Finally, neither script is optimized for performance.
Still, look at the result in Figure 7: A quick and dirty mix of a few standard Linux commands and utilities is all you need to analyze or produce automatically any number of perfectly valid documents with just the metadata you want (or don't want). Is this cool, or what?
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Gnome 48 Debuts New Audio Player
To date, the audio player found within the Gnome desktop has been meh at best, but with the upcoming release that all changes.
-
Plasma 6.3 Ready for Public Beta Testing
Plasma 6.3 will ship with KDE Gear 24.12.1 and KDE Frameworks 6.10, along with some new and exciting features.
-
Budgie 10.10 Scheduled for Q1 2025 with a Surprising Desktop Update
If Budgie is your desktop environment of choice, 2025 is going to be a great year for you.
-
Firefox 134 Offers Improvements for Linux Version
Fans of Linux and Firefox rejoice, as there's a new version available that includes some handy updates.
-
Serpent OS Arrives with a New Alpha Release
After months of silence, Ikey Doherty has released a new alpha for his Serpent OS.
-
HashiCorp Cofounder Unveils Ghostty, a Linux Terminal App
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.