Metadata in ODF Files

Tutorials – ODF Metadata

Author(s):

It is no secret that the native file format of LibreOffice and OpenOffice, the OpenDocument Format (ODF), is a truly open standard for word processing documents, spreadsheets, and presentations. What most people do not know is that ODF files contain lots of metadata that is very easy to read or modify.

Metadata means "data about data." The text messages you exchange using your phone, for example, are a form of data. The people with whom you exchange those messages, when, how often, from where, and so on are metadata about your messaging habits and connections.

Metadata is really important. I once heard French philosopher Bernard Stiegler observe that "the production of metadata has been the principal activity of those in power from the time of the proto-historical empires right up to today."

On a less philosophical and more practical level, lots of metadata is stored in your office documents, and you'll find many valid reasons for messing with the metadata in office files. This tutorial describes the most common of those reasons and offers a general approach to reading and writing metadata in ODF files – an approach that is quite easy and really extendable, because an ODF file is really just a standard ZIP archive of different kinds of plain text or image files.

Why Read and Write ODF Metadata?

Analyzing ODF metadata can help you work better and sometimes learn more about your organization than you thought possible. Editing the same metadata means controlling what everybody else knows about you. Together, these two procedures help to identify and fix many problems, from privacy and security to compliance and indexing. You may, among other things, automatically find, report, and "fix" (see below) ODF files that contain:

  • Dangerous, obsolete, or redundant macros
  • Information not compliant with your company policies
  • Images containing location, author name, or other sensitive information

The raw metadata in ODF files can also be aggregated to create statistics, graphs, or report about whole collections of documents or to feed the same data into some external database. Numeric data that may be averaged goes from word counts to the number and overall duration of edits to each document. This, in turn, may facilitate both simple decisions ("which documents should be updated first?") and more complex ones ("is our team working in the most efficient way?").

On the editing side, you may do the following, for example:

  • Normalize and complete metadata (e.g., insert missing author names or titles, all with the same spelling, or change company or department names after a reorganization)
  • Hide sensitive data (e.g., remove authors or comments inserted for internal use before sharing documents online, as an ODF, or even as a PDF)
  • Add or update disclaimers for compliance with new regulations or company rules
  • Add custom properties for better indexing
  • Give files names that match the title of the document (or vice versa)
  • Insert watermarks into pictures
  • Remove metadata from inside pictures

Methodology and Scope

In this tutorial, I introduce a relatively simple way to read or write ODF metadata that works even on systems where LibreOffice or OpenOffice are not installed, including systems running Windows or Mac OS. All you need is support for shell scripts and a few other command-line utilities like grep, sed, exiftool, and ImageMagick: they are all included, or installable as binary packages, on almost every Linux distribution. Besides, this ODF metadata processing approach that you are going to learn can be useful in many other text-processing contexts.

When I say "introduce" or "approach," I mean that, while I provide working code, it is not a complete solution, but rather a collection of examples to use as inspiration and as building blocks for your own ODF metadata problems. One reason for this is that the mere printing of a script that could handle all possible cases with optimal performance would be longer than this whole article.

The other, more important reason is that almost nobody would need such a solution or "top" performance. ODF metadata hacks can save you many days of works, if not many weeks. They did for me. However, unless you really have to process thousands of files every day, you (like me) will only use these hacks in two ways:

  • A few times a year, maybe in a different way every time
  • Regularly, once per day or less, but as jobs that can run slowly in the background only on the files that have changed since the previous run

In scenarios like these, it is more efficient to put some code together quickly that just works, instead of optimizing it to death. What matters is knowing how to put that code together when the need suddenly arises.

ODF Metadata

Mainly, there are two types of metadata in ODF files. The first consists of the data that you may read or set in the LibreOffice File | Properties tabs shown Figures 1 to 4. Some of those variables are present in every ODF file, others only in certain types, but they are all saved in a file called metadata.xml inside the ODF ZIP archive.

Figure 1: Almost all the metadata stored in the meta.xml component of any ODF file is accessible through the File | Properties tab in LibreOffice or OpenOffice. These are the general variables.
Figure 2: These are the descriptive metadata variables.
Figure 3: Users can also add custom metadata fields of several types, as they like.
Figure 4: Some metadata, especially that in the Statistics category, is only defined for certain types of ODF files. Number of Sheets and Number of Cells, for example, only exist for spreadsheets.

In addition to this, so to speak, "official" metadata, there is what I would call "hidden" metadata – metadata in, or about, the "non textual" content of an ODF document, which is mainly macros and images. I will now show you how to read, and then write, both types of ODF metadata.

A Simple ODF Metadata Reader

Listing 1 shows a script, called odfmetareader.sh that follows the Unix philosophy of small tools that each do just one thing but can be connected in a pipeline. It just prints out, one per line, all the explicit and hidden metadata it finds in the single ODF file passed to it as an argument. Analysis of the output, or its insertion into some database or spreadsheet, is delegated to other tools. You can use this script inside a loop to work on as many files as you like, as shown later in the tutorial. Of course, you also can, and should, change the script to format its output to best suit your needs. Listing 1 shows how the code works.

Listing 1

odfmetareader.sh

01 #! /bin/bash
02
03 rm -rf /tmp/odfmetareader
04 mkdir  /tmp/odfmetareader
05 cp $1  /tmp/odfmetareader/odf.zip
06 cd     /tmp/odfmetareader
07
08 unzip odf.zip >& /dev/null
09
10 echo "## METADATA DOC START     for document $1;"
11 echo "## METADATA ODF START     for document $1;"
12
13 # extract explicit ODF metadata
14
15 cat meta.xml | perl -e 'while (<>) {s/document-statistic//g; s/<(meta|dc):([^>]+)>/\n$2=/g; s/user-defined /user-defined-/g; s/<\/(meta|dc).*//g; s/ meta:value-type=/ value-type/g; s/ meta:/\n/g; s/\/=//g; s/<\/office:[^>]+>//g; print} print "\n"' | grep -v '<office:document' | grep -v '^<?xml version' | grep -v '^generator=' | grep '='
16
17 echo "## METADATA ODF END       for document $1;"
18 echo
19
20 # extract metadata about macros
21 if [ -d "Basic" ]
22 then
23   echo "## METADATA MACRO START   for document $1;"
24
25   MACRONUM=`find Basic -type f -name "*xml" | grep -v /script- | wc -l`
26
27   echo "macronumber=$MACRONUM"
28   for M in `find Basic -type f -name "*xml" | grep -v /script-`
29   do
30   echo macrofile:$M
31   grep 'sub ' $M
32   done
33   echo "## METADATA MACRO END for document $1;"
34   echo
35 fi
36
37 # extract metadata from images
38
39 if [ -d "Pictures" ]
40 then
41   for P in `find Pictures -type f`
42   do
43   N=`basename $P`
44   echo "## METADATA PICTURE START for document $1 / Picture $N;"
45   echo picturename: $N
46   exiftool $P | egrep '^(Artist|GPS)'
47   echo "## METADATA PICTURE END   for document $1 / Picture $N;"
48   done
49 fi
50 # final cleanup
51
52 echo
53 echo "## METADATA DOC END       for document $1;"
54 echo
55 #rm -rf /tmp/odfmetareader
56
57 exit

The overall flow is very simple: The script makes a copy of the given file and unzips it in the temporary folder /tmp/odfmetareader (lines 3-8). The final command on line 55 removes that folder, but I recommend leaving it commented until you have figured out (by looking into that same folder) the internal structure of ODF files.

The central part of Listing 1 prints out the variables in the meta.xml files and two lists: one of macros and one of pictures, with all their own embedded metadata.

The echo commands containing the ## METADATA string (e.g., lines 10 and 11) have the same purpose: They separate the several output sections (one hopes) making them more readable and easier to parse by other scripts.

Line 15 extracts all the metadata from the meta.xml file. It does seem like ancient Martian, but it is less obscure than it may seem at first sight. It is a concatenation of one long command in Perl and four invocations of the grep utility.

The Perl part is, basically, a series of regular expressions separated by semicolons that remove all the XML markup you don't need to see in the output. For example, this part

s/<\/(meta|dc).*//g;

replaces, with an empty string, every string that begins with </meta or </dc, plus all the characters that follow it until the end of the current line (that is what the .* part means). The four grep commands just remove header and footer lines in the XML file that don't contain any metadata. The best way to understand what line 15 actually does, and how to customize it for your needs, is to run the script on any ODF file and compare its output with the original content of the meta.xml file.

Native macros in ODF files are stored, if present, inside the Basic folder of the ZIP archive, and line 21 checks if this folder exists. If it does, the script finds all the macro files inside the folder and prints the value in the variable MACRONUM (lines 25-27). The loop in lines 28 to 25 finds and prints all the lines in the macro files that contain macro names.

The last loop of the script, in lines 39 to 49, checks if a Pictures folder exists. If the answer is yes, it scans all the pictures inside it (line 41), to print their names (lines 43-45) and then runs the exiftool command on them (line 46). exiftool is free software capable or reading and writing all the metadata stored inside today's digital photographs that use Exif and other similar standards.

When given a file name, as in line 46, exiftool just prints all the metadata in that file, one per line. The egrep command in line 46 discards all lines, except those that begin with either Artist or GPS, probably the most sensitive data.

Listing 2 shows a small excerpt, heavily edited for clarity, of the odfmetareader.sh output from the sample ODF document shown in Figure 5, which contains one macro and one photograph.

Listing 2

odfmetareader Results

01 ## METADATA ODF START     for document odf-sample-text.odt;
02 initial-creator=Marco Fioretti
03 creation-date=2018-07-22T17
04 date=2018-07-22T18:07
05 creator=Marco Fioretti
06 editing-duration=PT33M32S
07 editing-cycles=9
08 description=Let's see where all these metadata end up...
09 keyword=ODF
10 keyword=Metadata
11 keyword=text processing
12 keyword=text mining
13 subject=showing the way in which ODF format stores metadata
14 title=Just A Sample ODF Text Document
15 image-count="1"
16 word-count="81"
17 character-count="468"
18 user-defined-meta:name="Approved" value-type"boolean"=false
19 user-defined-meta:name="Status"=Confidential
20
21 ## METADATA MACRO START   for document odf-sample-text.odt;
22 macronumber=1
23 macrofile:Basic/Standard/samplemodule.xml
24 sub Main
25 ## METADATA MACRO END     for document odf-sample-text.odt;
26
27 ## METADATA PICTURE START for document odf-sample-text.odt / Picture sample-picture.jpg;
28 picturename: sample-picture.jpg
29 Artist                          : Marco Fioretti
30 GPS Latitude                    : 47 deg 30' 20.53" N
31 GPS Longitude                   : 19 deg 2' 43.75" E
Figure 5: Basic macros in ODF documents can be organized in groups, which correspond to subfolders in the Basic folder of an ODF file. The macro in this figure will be saved in the file Basic/Standard/sample.xml.

Publishing online ODF files (or office files in general, probably) without "cleaning" them first may mean letting everybody know where, and by whom, each photograph contained in the file was taken (as shown, starting in line 27). Sometimes this is OK; sometimes it is not.

The macro section (lines 21-25), as commented, lists number, location, and names of all the macros inside the document. The initial section (lines 1 to 19), is just a plain text version of the metadata shown in Figures 1 to 4. It is easy to imagine how many of the lines above, from editing cycles and duration to word count and keywords, may be filtered or fed to some other script to answer any kind of question.

As an example, the following lines show how you may discover which ODF files in a whole directory tree have Linux Magazine as the creator:

for F in `find . -type f | egrep '(odt|ods|odp)$`
  do
  FOUND=`odfmetareader $F | grep -i ^creator | grep -i -c 'Linux Magazine'`
  if [ $FOUND gt 0 ]
    then # = "there was at least one line with that string"
    echo found $F
  fi
done

Writing ODF Metadata

Extracting metadata from ODF files is great. Being able to erase or modify it is even better. You can learn how to do so by playing with the odfmetawriter script in Listing 3, which was written to order for didactical purposes. To begin, it only performs one operation per run for simplicity, always in the same way: Extract the file(s) that must be changed, process them, and then put them back in a copy of the zipped ODF file. Then, to give you an idea of how you might alter both explicit and "hidden" ODF metadata, the script can do the following:

Listing 3

odfmetawriter.sh

01 #! /bin/bash
02
03 if [ ! -e "$1" ]
04 then
05   echo "script launched on non-existing file: $1; aborting"
06   exit
07 fi
08
09 STARTINGDIR=`pwd`
10
11 rm -rf /tmp/odfmetawriter
12 mkdir /tmp/odfmetawriter
13 cp $1 /tmp/odfmetawriter/odf.zip
14 cp $1 /tmp/odfmetawriter/new-$1
15 cd    /tmp/odfmetawriter
16
17 unzip odf.zip >& /dev/null
18 cp meta.xml meta.orig.xml
19
20 case "$2" in
21   creator|title|description)
22   echo "Changing $2 to: $3"
23   sed -i -- "s/<dc:$2>.*<\/dc:$2>/<dc:$2>$3<\/dc:$2>/" meta.xml
24   zip -f new-$1 meta.xml
25   ;;
26
27   addkeyword)
28   sed -i -- "s/<meta:keyword>/<meta:keyword>$3<\/meta:keyword><meta:keyword>/" meta.xml
29   zip -f new-$1 meta.xml
30   ;;
31
32   addcustom)
33   sed -i -- "s/<meta:user-defined/<meta:user-defined meta:name=\"$3\">$4<\/meta:user-defined><meta:user-defined/" meta.xml
34   zip -f new-$1 meta.xml
35   ;;
36
37   renamefromtitle)
38   EXT="${1##*.}"
39   TITLE=`perl -e  'while (<>) {next unless m/.*<dc:title>(.*)<\/dc:title>/; $T = $1;} $T =~ s/\W+/-/g; print $T' meta.xml`
40   mv -i new-$1 $STARTINGDIR/$TITLE.$EXT
41   exit
42   ;;
43
44   watermark)
45     if [ -d "Pictures" ]
46   then
47     for P in `find Pictures -type f`
48     do
49     convert $P  -font Arial -pointsize 60 -draw "gravity center   fill yellow  text 1,11 '$3' " temp-watermarked
50     mv temp-watermarked $P
51     zip -f new-$1 $P
52     done
53   else
54     echo "No Pictures in this ODF Document!"
55     exit
56   fi
57   ;;
58
59   removepicsdata)
60     if [ -d "Pictures" ]
61   then
62     for P in `find Pictures -type f`
63     do
64     exiftool -all= $P
65     zip -f new-$1 $P
66     done
67   else
68     echo "No Pictures in this ODF Document!"
69     exit
70   fi
71   ;;
72
73   *)
74   echo "unknown or unsupported option, please retry: $2;"
75   rm -rf /tmp/odfmetawriter
76   exit
77   ;;
78 esac
79
80 mv -i new-$1 $STARTINGDIR/
81
82 #rm -rf /tmp/odfmetawriter
83
84 exit
  • Rewrite title, creator, or description
  • Add an extra keyword
  • Add a custom field
  • Rename the file to match the document title
  • Insert a textual watermark in all pictures
  • Remove Exif data from pictures

The script must be launched always in the same way:

#> odfmetawriter <ODF-file-name> <operation> <options>

The beginning and end are almost the same as odfmetareader: Create a temporary folder, work inside it, and remove it when done. Pay attention to line 14, though, which makes a copy of the file passed as an argument with the new- prefix: It is this file that will be "filled" with the new metadata and eventually (line 80) copied in the same directory where the script was launched.

The core of the script is the case statement (lines 20-78). It has seven branches: one for each of the operations listed above and a final one (lines 74-77) that exits with an error message in all other cases.

Lines 21 to 30 all do the same thing – that is, update or add a variable in the meta.xml file.

If the variable passed as a second argument ($2) is creator, title, or description, the first branch (lines 21-25) of the case statement finds the corresponding variable and, using the sed command, replaces its value with the string passed as the third argument.

The two other branches add keywords or custom fields (with a value equal to $3) when $2 is equal to addkeyword or, respectively, addcustom. They work almost in the same way as the first one, with the only difference being that they prepend the XML markup defining the new variable to the other variables of the same kind.

In all cases, after the meta.xml file has been "updated," it is put back in the copy of the ODF file (lines 24 and 29).

The fourth supported operation does not change anything in the file. When the $2 parameter is equal to renamefromtitle, the script:

  • Takes note of the original file extension (EXT, line 38)
  • Uses Perl to extract the title string from meta.xml, replace all of its non-alphanumeric characters with single dashes (line 39), and save the result in the TITLE variable
  • Makes a copy of the original file, with the name TITLE.EXT, in the original directory

The last two operations supported by odfmetawriter are insertion of the textual watermark passed as the third parameter inside all the pictures (lines 44-57) and removal of all Exif metadata from the same pictures (lines 59-71).

The watermark is inserted with the ImageMagick's convert tool. The code in line 49 is copied almost verbatim from the relevant ImageMagick documentation [1]. Line 64, instead, tells exiftool to give all Exif variables in the current picture an empty value [2]. As before, the modified pictures ($P) are zipped back in the right place, in the copy of the original document (lines 51 and 65).Running the following commands, in sequence, on the sample ODF document shown in Figure 6

#> odfmetawriter odf-sample.odt title 'New title for Linux Magazine'
#> odfmetawriter odf-sample.odt description 'Here is an ODT file with its metadatachanged by a script'
#> odfmetawriter odf-sample.odt addkeyword 'ODF metadata processing'
#> odfmetawriter odf-sample.odt renamefromtitle
#> odfmetawriter New-title-for-Linux-Magazine.odt watermark'Watermarked for Linux Magazine'
Figure 6: A sample ODF text file, with metadata and pictures inserted manually.

produces the results shown in Figure 7. (For simplicity, the renaming commands after each operation have been omitted.) As you can see for yourself, the metadata has the new values, and the picture is properly watermarked. Isn't ODF great to hack?

Figure 7: The same ODF text file, after the odfmetawriter script has automatically updated some metadata and watermarked the picture.

Code Limits

I already said this, but let me repeat it: The two scripts above do work, but they are not perfect or robust. As a minimum, they would need extra checks to refuse input files not in ODF format, or to handle properly non-alphabetic languages or strings with quotes inside them. In odfmetawriter, for example, addcustom will fail if there isn't already at least one custom field present. Also, odfmetawriter does not change the initial-creator of an ODF file. Another issue is dates: It is trivial to alter dates in the meta.xml file, but unless you do it right, you will end up with inconsistent documents (e.g., having ODF files with last-modified timestamps that are earlier than some of the revisions they contain). Finally, neither script is optimized for performance.

Still, look at the result in Figure 7: A quick and dirty mix of a few standard Linux commands and utilities is all you need to analyze or produce automatically any number of perfectly valid documents with just the metadata you want (or don't want). Is this cool, or what?

Final Thoughts and Warnings and a Request

In general, metadata hacking has issues that have nothing to do with the code or with ODF, as such. As Spider-Man's Uncle Ben would put it (and Voltaire did), "With great power comes great responsibility." Years ago, in a discussion over this same topic, someone commented "maybe we shouldn't teach our documents lying." Use the techniques you learned here responsibly. Be aware that digital signatures are the only way to guarantee that no part of an ODF file has been modified.

Last, but not least, even other parts of an ODF file contain stuff that maybe should count as metadata, even some people (including me, to some extent) may disagree: I'm talking of multiple revisions, but also of hidden paragraphs (or cells in spreadsheets), and of the content, author, and timestamps of embedded comments. All of this stuff may still be analyzed or "updated" with the same general approach presented here, thanks to the ODF format's openness and simplicity, but that is a different problem left as an exercise for the reader, with the suggestion that you use my ODF scripting examples [3] as a basis.

What's left? The request, of course: Please share how you use or modify these scripts for your own ODF metadata processing!

The Author

Marco Fioretti (http://mfioretti.com) is a freelance author, trainer, and researcher based in Rome, Italy. He has been working with free/open source software since 1995 and on open digital standards since 2005. Marco also is a Board Member of the Free Knowledge Institute (http://freeknowledge.eu).