Metadata in ODF Files
Tutorials – ODF Metadata
It is no secret that the native file format of LibreOffice and OpenOffice, the OpenDocument Format (ODF), is a truly open standard for word processing documents, spreadsheets, and presentations. What most people do not know is that ODF files contain lots of metadata that is very easy to read or modify.
Metadata means "data about data." The text messages you exchange using your phone, for example, are a form of data. The people with whom you exchange those messages, when, how often, from where, and so on are metadata about your messaging habits and connections.
Metadata is really important. I once heard French philosopher Bernard Stiegler observe that "the production of metadata has been the principal activity of those in power from the time of the proto-historical empires right up to today."
On a less philosophical and more practical level, lots of metadata is stored in your office documents, and you'll find many valid reasons for messing with the metadata in office files. This tutorial describes the most common of those reasons and offers a general approach to reading and writing metadata in ODF files – an approach that is quite easy and really extendable, because an ODF file is really just a standard ZIP archive of different kinds of plain text or image files.
Why Read and Write ODF Metadata?
Analyzing ODF metadata can help you work better and sometimes learn more about your organization than you thought possible. Editing the same metadata means controlling what everybody else knows about you. Together, these two procedures help to identify and fix many problems, from privacy and security to compliance and indexing. You may, among other things, automatically find, report, and "fix" (see below) ODF files that contain:
- Dangerous, obsolete, or redundant macros
- Information not compliant with your company policies
- Images containing location, author name, or other sensitive information
The raw metadata in ODF files can also be aggregated to create statistics, graphs, or report about whole collections of documents or to feed the same data into some external database. Numeric data that may be averaged goes from word counts to the number and overall duration of edits to each document. This, in turn, may facilitate both simple decisions ("which documents should be updated first?") and more complex ones ("is our team working in the most efficient way?").
On the editing side, you may do the following, for example:
- Normalize and complete metadata (e.g., insert missing author names or titles, all with the same spelling, or change company or department names after a reorganization)
- Hide sensitive data (e.g., remove authors or comments inserted for internal use before sharing documents online, as an ODF, or even as a PDF)
- Add or update disclaimers for compliance with new regulations or company rules
- Add custom properties for better indexing
- Give files names that match the title of the document (or vice versa)
- Insert watermarks into pictures
- Remove metadata from inside pictures
Methodology and Scope
In this tutorial, I introduce a relatively simple way to read or write ODF metadata that works even on systems where LibreOffice or OpenOffice are not installed, including systems running Windows or Mac OS. All you need is support for shell scripts and a few other command-line utilities like grep
, sed
, exiftool
, and ImageMagick: they are all included, or installable as binary packages, on almost every Linux distribution. Besides, this ODF metadata processing approach that you are going to learn can be useful in many other text-processing contexts.
When I say "introduce" or "approach," I mean that, while I provide working code, it is not a complete solution, but rather a collection of examples to use as inspiration and as building blocks for your own ODF metadata problems. One reason for this is that the mere printing of a script that could handle all possible cases with optimal performance would be longer than this whole article.
The other, more important reason is that almost nobody would need such a solution or "top" performance. ODF metadata hacks can save you many days of works, if not many weeks. They did for me. However, unless you really have to process thousands of files every day, you (like me) will only use these hacks in two ways:
- A few times a year, maybe in a different way every time
- Regularly, once per day or less, but as jobs that can run slowly in the background only on the files that have changed since the previous run
In scenarios like these, it is more efficient to put some code together quickly that just works, instead of optimizing it to death. What matters is knowing how to put that code together when the need suddenly arises.
ODF Metadata
Mainly, there are two types of metadata in ODF files. The first consists of the data that you may read or set in the LibreOffice File | Properties tabs shown Figures 1 to 4. Some of those variables are present in every ODF file, others only in certain types, but they are all saved in a file called metadata.xml
inside the ODF ZIP archive.
In addition to this, so to speak, "official" metadata, there is what I would call "hidden" metadata – metadata in, or about, the "non textual" content of an ODF document, which is mainly macros and images. I will now show you how to read, and then write, both types of ODF metadata.
A Simple ODF Metadata Reader
Listing 1 shows a script, called odfmetareader.sh
that follows the Unix philosophy of small tools that each do just one thing but can be connected in a pipeline. It just prints out, one per line, all the explicit and hidden metadata it finds in the single ODF file passed to it as an argument. Analysis of the output, or its insertion into some database or spreadsheet, is delegated to other tools. You can use this script inside a loop to work on as many files as you like, as shown later in the tutorial. Of course, you also can, and should, change the script to format its output to best suit your needs. Listing 1 shows how the code works.
Listing 1
odfmetareader.sh
01 #! /bin/bash 02 03 rm -rf /tmp/odfmetareader 04 mkdir /tmp/odfmetareader 05 cp $1 /tmp/odfmetareader/odf.zip 06 cd /tmp/odfmetareader 07 08 unzip odf.zip >& /dev/null 09 10 echo "## METADATA DOC START for document $1;" 11 echo "## METADATA ODF START for document $1;" 12 13 # extract explicit ODF metadata 14 15 cat meta.xml | perl -e 'while (<>) {s/document-statistic//g; s/<(meta|dc):([^>]+)>/\n$2=/g; s/user-defined /user-defined-/g; s/<\/(meta|dc).*//g; s/ meta:value-type=/ value-type/g; s/ meta:/\n/g; s/\/=//g; s/<\/office:[^>]+>//g; print} print "\n"' | grep -v '<office:document' | grep -v '^<?xml version' | grep -v '^generator=' | grep '=' 16 17 echo "## METADATA ODF END for document $1;" 18 echo 19 20 # extract metadata about macros 21 if [ -d "Basic" ] 22 then 23 echo "## METADATA MACRO START for document $1;" 24 25 MACRONUM=`find Basic -type f -name "*xml" | grep -v /script- | wc -l` 26 27 echo "macronumber=$MACRONUM" 28 for M in `find Basic -type f -name "*xml" | grep -v /script-` 29 do 30 echo macrofile:$M 31 grep 'sub ' $M 32 done 33 echo "## METADATA MACRO END for document $1;" 34 echo 35 fi 36 37 # extract metadata from images 38 39 if [ -d "Pictures" ] 40 then 41 for P in `find Pictures -type f` 42 do 43 N=`basename $P` 44 echo "## METADATA PICTURE START for document $1 / Picture $N;" 45 echo picturename: $N 46 exiftool $P | egrep '^(Artist|GPS)' 47 echo "## METADATA PICTURE END for document $1 / Picture $N;" 48 done 49 fi 50 # final cleanup 51 52 echo 53 echo "## METADATA DOC END for document $1;" 54 echo 55 #rm -rf /tmp/odfmetareader 56 57 exit
The overall flow is very simple: The script makes a copy of the given file and unzips it in the temporary folder /tmp/odfmetareader
(lines 3-8). The final command on line 55 removes that folder, but I recommend leaving it commented until you have figured out (by looking into that same folder) the internal structure of ODF files.
The central part of Listing 1 prints out the variables in the meta.xml
files and two lists: one of macros and one of pictures, with all their own embedded metadata.
The echo
commands containing the ## METADATA
string (e.g., lines 10 and 11) have the same purpose: They separate the several output sections (one hopes) making them more readable and easier to parse by other scripts.
Line 15 extracts all the metadata from the meta.xml
file. It does seem like ancient Martian, but it is less obscure than it may seem at first sight. It is a concatenation of one long command in Perl and four invocations of the grep
utility.
The Perl part is, basically, a series of regular expressions separated by semicolons that remove all the XML markup you don't need to see in the output. For example, this part
s/<\/(meta|dc).*//g;
replaces, with an empty string, every string that begins with </meta
or </dc
, plus all the characters that follow it until the end of the current line (that is what the .*
part means). The four grep
commands just remove header and footer lines in the XML file that don't contain any metadata. The best way to understand what line 15 actually does, and how to customize it for your needs, is to run the script on any ODF file and compare its output with the original content of the meta.xml
file.
Native macros in ODF files are stored, if present, inside the Basic
folder of the ZIP archive, and line 21 checks if this folder exists. If it does, the script finds all the macro files inside the folder and prints the value in the variable MACRONUM
(lines 25-27). The loop in lines 28 to 25 finds and prints all the lines in the macro files that contain macro names.
The last loop of the script, in lines 39 to 49, checks if a Pictures
folder exists. If the answer is yes, it scans all the pictures inside it (line 41), to print their names (lines 43-45) and then runs the exiftool
command on them (line 46). exiftool
is free software capable or reading and writing all the metadata stored inside today's digital photographs that use Exif and other similar standards.
When given a file name, as in line 46, exiftool
just prints all the metadata in that file, one per line. The egrep
command in line 46 discards all lines, except those that begin with either Artist or GPS, probably the most sensitive data.
Listing 2 shows a small excerpt, heavily edited for clarity, of the odfmetareader.sh
output from the sample ODF document shown in Figure 5, which contains one macro and one photograph.
Listing 2
odfmetareader Results
01 ## METADATA ODF START for document odf-sample-text.odt; 02 initial-creator=Marco Fioretti 03 creation-date=2018-07-22T17 04 date=2018-07-22T18:07 05 creator=Marco Fioretti 06 editing-duration=PT33M32S 07 editing-cycles=9 08 description=Let's see where all these metadata end up... 09 keyword=ODF 10 keyword=Metadata 11 keyword=text processing 12 keyword=text mining 13 subject=showing the way in which ODF format stores metadata 14 title=Just A Sample ODF Text Document 15 image-count="1" 16 word-count="81" 17 character-count="468" 18 user-defined-meta:name="Approved" value-type"boolean"=false 19 user-defined-meta:name="Status"=Confidential 20 21 ## METADATA MACRO START for document odf-sample-text.odt; 22 macronumber=1 23 macrofile:Basic/Standard/samplemodule.xml 24 sub Main 25 ## METADATA MACRO END for document odf-sample-text.odt; 26 27 ## METADATA PICTURE START for document odf-sample-text.odt / Picture sample-picture.jpg; 28 picturename: sample-picture.jpg 29 Artist : Marco Fioretti 30 GPS Latitude : 47 deg 30' 20.53" N 31 GPS Longitude : 19 deg 2' 43.75" E
Publishing online ODF files (or office files in general, probably) without "cleaning" them first may mean letting everybody know where, and by whom, each photograph contained in the file was taken (as shown, starting in line 27). Sometimes this is OK; sometimes it is not.
The macro section (lines 21-25), as commented, lists number, location, and names of all the macros inside the document. The initial section (lines 1 to 19), is just a plain text version of the metadata shown in Figures 1 to 4. It is easy to imagine how many of the lines above, from editing cycles and duration to word count and keywords, may be filtered or fed to some other script to answer any kind of question.
As an example, the following lines show how you may discover which ODF files in a whole directory tree have Linux Magazine as the creator:
for F in `find . -type f | egrep '(odt|ods|odp)$` do FOUND=`odfmetareader $F | grep -i ^creator | grep -i -c 'Linux Magazine'` if [ $FOUND gt 0 ] then # = "there was at least one line with that string" echo found $F fi done
Writing ODF Metadata
Extracting metadata from ODF files is great. Being able to erase or modify it is even better. You can learn how to do so by playing with the odfmetawriter
script in Listing 3, which was written to order for didactical purposes. To begin, it only performs one operation per run for simplicity, always in the same way: Extract the file(s) that must be changed, process them, and then put them back in a copy of the zipped ODF file. Then, to give you an idea of how you might alter both explicit and "hidden" ODF metadata, the script can do the following:
Listing 3
odfmetawriter.sh
01 #! /bin/bash 02 03 if [ ! -e "$1" ] 04 then 05 echo "script launched on non-existing file: $1; aborting" 06 exit 07 fi 08 09 STARTINGDIR=`pwd` 10 11 rm -rf /tmp/odfmetawriter 12 mkdir /tmp/odfmetawriter 13 cp $1 /tmp/odfmetawriter/odf.zip 14 cp $1 /tmp/odfmetawriter/new-$1 15 cd /tmp/odfmetawriter 16 17 unzip odf.zip >& /dev/null 18 cp meta.xml meta.orig.xml 19 20 case "$2" in 21 creator|title|description) 22 echo "Changing $2 to: $3" 23 sed -i -- "s/<dc:$2>.*<\/dc:$2>/<dc:$2>$3<\/dc:$2>/" meta.xml 24 zip -f new-$1 meta.xml 25 ;; 26 27 addkeyword) 28 sed -i -- "s/<meta:keyword>/<meta:keyword>$3<\/meta:keyword><meta:keyword>/" meta.xml 29 zip -f new-$1 meta.xml 30 ;; 31 32 addcustom) 33 sed -i -- "s/<meta:user-defined/<meta:user-defined meta:name=\"$3\">$4<\/meta:user-defined><meta:user-defined/" meta.xml 34 zip -f new-$1 meta.xml 35 ;; 36 37 renamefromtitle) 38 EXT="${1##*.}" 39 TITLE=`perl -e 'while (<>) {next unless m/.*<dc:title>(.*)<\/dc:title>/; $T = $1;} $T =~ s/\W+/-/g; print $T' meta.xml` 40 mv -i new-$1 $STARTINGDIR/$TITLE.$EXT 41 exit 42 ;; 43 44 watermark) 45 if [ -d "Pictures" ] 46 then 47 for P in `find Pictures -type f` 48 do 49 convert $P -font Arial -pointsize 60 -draw "gravity center fill yellow text 1,11 '$3' " temp-watermarked 50 mv temp-watermarked $P 51 zip -f new-$1 $P 52 done 53 else 54 echo "No Pictures in this ODF Document!" 55 exit 56 fi 57 ;; 58 59 removepicsdata) 60 if [ -d "Pictures" ] 61 then 62 for P in `find Pictures -type f` 63 do 64 exiftool -all= $P 65 zip -f new-$1 $P 66 done 67 else 68 echo "No Pictures in this ODF Document!" 69 exit 70 fi 71 ;; 72 73 *) 74 echo "unknown or unsupported option, please retry: $2;" 75 rm -rf /tmp/odfmetawriter 76 exit 77 ;; 78 esac 79 80 mv -i new-$1 $STARTINGDIR/ 81 82 #rm -rf /tmp/odfmetawriter 83 84 exit
- Rewrite title, creator, or description
- Add an extra keyword
- Add a custom field
- Rename the file to match the document title
- Insert a textual watermark in all pictures
- Remove Exif data from pictures
The script must be launched always in the same way:
#> odfmetawriter <ODF-file-name> <operation> <options>
The beginning and end are almost the same as odfmetareader
: Create a temporary folder, work inside it, and remove it when done. Pay attention to line 14, though, which makes a copy of the file passed as an argument with the new-
prefix: It is this file that will be "filled" with the new metadata and eventually (line 80) copied in the same directory where the script was launched.
The core of the script is the case
statement (lines 20-78). It has seven branches: one for each of the operations listed above and a final one (lines 74-77) that exits with an error message in all other cases.
Lines 21 to 30 all do the same thing – that is, update or add a variable in the meta.xml
file.
If the variable passed as a second argument ($2
) is creator
, title
, or description
, the first branch (lines 21-25) of the case
statement finds the corresponding variable and, using the sed
command, replaces its value with the string passed as the third argument.
The two other branches add keywords or custom fields (with a value equal to $3
) when $2
is equal to addkeyword
or, respectively, addcustom
. They work almost in the same way as the first one, with the only difference being that they prepend the XML markup defining the new variable to the other variables of the same kind.
In all cases, after the meta.xml
file has been "updated," it is put back in the copy of the ODF file (lines 24 and 29).
The fourth supported operation does not change anything in the file. When the $2
parameter is equal to renamefromtitle
, the script:
- Takes note of the original file extension (
EXT
, line 38) - Uses Perl to extract the title string from
meta.xml
, replace all of its non-alphanumeric characters with single dashes (line 39), and save the result in theTITLE
variable - Makes a copy of the original file, with the name
TITLE.EXT
, in the original directory
The last two operations supported by odfmetawriter
are insertion of the textual watermark passed as the third parameter inside all the pictures (lines 44-57) and removal of all Exif metadata from the same pictures (lines 59-71).
The watermark is inserted with the ImageMagick's convert
tool. The code in line 49 is copied almost verbatim from the relevant ImageMagick documentation [1]. Line 64, instead, tells exiftool
to give all Exif variables in the current picture an empty value [2]. As before, the modified pictures ($P
) are zipped back in the right place, in the copy of the original document (lines 51 and 65).Running the following commands, in sequence, on the sample ODF document shown in Figure 6
#> odfmetawriter odf-sample.odt title 'New title for Linux Magazine' #> odfmetawriter odf-sample.odt description 'Here is an ODT file with its metadatachanged by a script' #> odfmetawriter odf-sample.odt addkeyword 'ODF metadata processing' #> odfmetawriter odf-sample.odt renamefromtitle #> odfmetawriter New-title-for-Linux-Magazine.odt watermark'Watermarked for Linux Magazine'
produces the results shown in Figure 7. (For simplicity, the renaming commands after each operation have been omitted.) As you can see for yourself, the metadata has the new values, and the picture is properly watermarked. Isn't ODF great to hack?
Code Limits
I already said this, but let me repeat it: The two scripts above do work, but they are not perfect or robust. As a minimum, they would need extra checks to refuse input files not in ODF format, or to handle properly non-alphabetic languages or strings with quotes inside them. In odfmetawriter
, for example, addcustom
will fail if there isn't already at least one custom field present. Also, odfmetawriter
does not change the initial-creator
of an ODF file. Another issue is dates: It is trivial to alter dates in the meta.xml
file, but unless you do it right, you will end up with inconsistent documents (e.g., having ODF files with last-modified
timestamps that are earlier than some of the revisions they contain). Finally, neither script is optimized for performance.
Still, look at the result in Figure 7: A quick and dirty mix of a few standard Linux commands and utilities is all you need to analyze or produce automatically any number of perfectly valid documents with just the metadata you want (or don't want). Is this cool, or what?
Final Thoughts and Warnings and a Request
In general, metadata hacking has issues that have nothing to do with the code or with ODF, as such. As Spider-Man's Uncle Ben would put it (and Voltaire did), "With great power comes great responsibility." Years ago, in a discussion over this same topic, someone commented "maybe we shouldn't teach our documents lying." Use the techniques you learned here responsibly. Be aware that digital signatures are the only way to guarantee that no part of an ODF file has been modified.
Last, but not least, even other parts of an ODF file contain stuff that maybe should count as metadata, even some people (including me, to some extent) may disagree: I'm talking of multiple revisions, but also of hidden paragraphs (or cells in spreadsheets), and of the content, author, and timestamps of embedded comments. All of this stuff may still be analyzed or "updated" with the same general approach presented here, thanks to the ODF format's openness and simplicity, but that is a different problem left as an exercise for the reader, with the suggestion that you use my ODF scripting examples [3] as a basis.
What's left? The request, of course: Please share how you use or modify these scripts for your own ODF metadata processing!
Infos
- Watermarking: http://www.imagemagick.org/Usage/annotating/#wmark_text
- Removing Exif metadata: http://www.linux-magazine.com/Online/Blogs/Productivity-Sauce/Remove-EXIF-Metadata-from-Photos-with-exiftool
- ODF scripting: http://freesoftware.zona-m.net/tag/odf-scripting