Reinventing file storage with semantic tagging

Tag It

Article from Issue 236/2020
Author(s):

Assigning file names based on defined criteria saves time and maximizes your chances of finding the file later.

Everyone is familiar with the problem of losing data neatly stored on your own computer. Navigating through countless directories does not lead to the desired result, because the data you are searching for might be in a file with a random name – or a name you thought was logical at the time but proved forgettable later.

A full-text search would seem to offer a remedy, but full-text search usually requires additional – and often quite considerable – resources. Moreover, despite a sophisticated full-text search, you might overlook the desired document in the flood of results if the keywords are too broadly defined.

Many studies show that almost all computer users have experienced this situation. This problem is often not due to having a bad memory, or a lack of computer skills, but is instead attributable to the design of modern environments. All too often, the environment requires the user to adapt to the computer, rather than the computer adapting to the situation.

The foundations of "modern" file management were laid by developers in the middle of the last century. Nevertheless, today's systems follow largely the same premises. The concept of nested directories was introduced to make it easier to manage a few dozen, or at most, a few hundred files; following the advent of the desktop metaphor [1], directories were referred to as folders.

At the level of the filesystem, the concept is still the directory, whereas the folder as a concept is more apt for the level of the graphical user interface. In line with this, file is a conceptual term on the filesystem level, and the term document is used more often at the user-interface level.

The constantly increasing volume of lost information, in combination with the massively increasing number of files per user, require a fundamentally new way of thinking when managing files. Research in the field of Personal Information Management (PIM) has achieved very good improvements for three decades. However, virtually no findings from this field have found their way into computer systems as we use them today (see box entitled "Only in Research").

Only in Research

Over the last decades, with the exception of the now-established local search engines, file management has undergone few fundamental changes. When it comes to data on your own computer or on the local network, users still prefer to browse with the file manager and very rarely use local search engines. In contrast, the research discipline Personal Information Management (PIM), which emerged in the 1980s, has focused on searching rather than navigation over the last two decades.

The industry has an enormous need for research and new strategies. For many decades, scientists have been aware that managing files in strict hierarchies of directories unnecessarily restricts users. In addition, there is a massive increase in the number of files one user has to manage. These complicating factors lead to frustration, lost information, and redundant data. The volume of redundant data alone is in the range of 15 to 50 percent of shared storage in both private and corporate environments.

Although technical solutions help to reduce such redundancies through deduplication, it is not always possible to eliminate them. In addition, deduplication techniques do not improve the situation when searching for information or where problems result from different versions. Every day, searching causes an unnecessary loss of time for everyone; in my experience, this amounts to at least 15 to 30 minutes. With a fundamentally new file management strategy, it might even be possible to save several times this amount of time, depending on scope and foresight [5]. The only promising advance in this direction came from Microsoft with WinFS, but it did not find its way into everyday working environments.

Backwards compatibility still outweighs advanced concepts. Inadequate education in the field of PIM on the one hand, and a lack of problem awareness among the majority of users on the other, further aggravate the situation.

Research results such as those from the Tagstore Project [2] show that even small, incremental improvements to current computing environments have a huge amount of potential. Tagstore is the result of file management research at the Institute for Software Technology at the Graz University of Technology in Austria. The purpose of the Tagstore project is to create "…a better method to manage files and folders on the local hard drive."

This article describes a collection of scripts developed from lessons learned working with the Tagstore project. The goal of the script collection is to provide an easy way for interested users to get started with applying the principles of semantic file tagging. You'll find all the scripts described in this article at GitHub. The methods I'll discuss work on a small scale – even if you use only part of the total package.

The Problem

An example at the Tagstore website best illustrates the problem of the traditional file storage architecture. Suppose a user called Bob sends you a file with his thoughts about a project called MyProject. In the classic storage paradigm, you have to decide whether to store the file in a directory with other files containing thoughts from Bob (say, the People/Bob directory), or whether to save it with other files associated with MyProject (say, the Projects/MyProject directory). In other words, you need to choose whether to file the information with Bob stuff or with MyProject stuff – there is no universal and practical way to put it in both places.

Of course, you could make a copy of the file and paste it into both directories, but duplicating files wastes space and, even worse, invites version control problems. Some operating systems let you create a symbolic link or shortcut from one directory to a file in another directory, but links and shortcuts are difficult to manage, easy to lose track of, and cumbersome to create and configure.

A better approach is to build a system around attributes or tags that let you associate a single file with both Bob and MyProject. Certain file formats, such as image file formats, allow you to associate metadata with a file in a way that would support tagging operations, but this approach only works for the particular file format. Some Linux filesystems offer the possibility of adding metadata through extended file attributes, but this kind of tinkering can require some significant programming skill – and the results aren't portable if you copy the file to a different filesystem.

A simple, portable, and easily extensible solution for adding tags to a file is simply to append the tag to the file name. The tag then follows the file wherever it goes – without the need for additional complexity or metadata conventions that will not translate across filesystems or file types.

The scripts described in this article offer a uniform framework for attaching tags to a file by modifying the file name. As you will learn, the collection also includes options for visualizing file lists sorted by tag, thus creating a virtual directory of files called a TagTree. In the preceding example, you could simply store Bob's notes on MyProject in a general Storage directory and call up TagTrees to display the file with both Bob stuff and MyProject stuff.

Conventions

The concept begins with a convention for file names. In most cases, a date or timestamp in the adapted ISO-8601 format [3] introduces the file name. It is necessary to adapt the timestamp because Microsoft systems do not allow the colon contained in the standard in file names.

If possible, start by asking yourself what timestamp you want to include in the file name. I usually rely on one that is related to the origin or publication of the information. As a fallback, the date of system entry is used; this is usually the download or digitization date [4].

The optional date or timestamp is followed by the actual file name. The most meaningful name possible needs to be long enough to clearly describe the file and short enough to be readable in a list.

The base file name is followed by an optional part consisting of a separator and a series of keywords (tags) (see the box entitled "Tagging"). In the example, the separator consists of one space, two minus signs and one more space. Spaces are inserted between the tags; in the best case, they consist only of lowercase letters and numbers. For an example that follows this convention, see Listing 1.

Listing 1

Naming Convention

/a/path/Picknick in Graz -- food graz.jpg
/a/path/2014-04-20 Picknick in Graz -- food graz.jpg
/a/path/2014-04-20T17.09 Picknick in Graz -- food graz.jpg
/a/path/2014-04-20T17.09 Picknick in Graz.jpg

Tagging

Keywording files is a science in itself. This article does not consider the many implications of sharing files and directories among multiple users. Both from personal practice and based on the findings of some scientific work, I recommend the following guidelines:

  • Limit yourself to a predefined set of tags, or, to use the scientific term, a controlled vocabulary (CV). Its scope should be as small as possible. A CV of several hundred entries is more confusing than it is helpful.
  • If you need many tags per file, a full-text search is the better choice. The tags used are not intended to supplement the actual file name but merely extend it to include generalized concepts. If you limit the number of tags, you will also prevent problems caused by synonyms and indirectly by homonyms.
  • By convention, the tags are defined in plural to eliminate problems with questions of singular and plural – i.e. manuals instead of manual or templates instead of template.
  • Tags that result directly from the file type, such as images and movies for files with the extensions .jpeg and .avi, do not add any significant value. In practice, I made an exception to this rule. The presentations tag is useful for LibreOffice Impress files and for the corresponding photos, movies, or audio files.

It is also better not to mark versions in file names, such as Document v2.pdf. Instead, it is worthwhile to use mnemonic tags such as  final paper -- draft.pdf. If you need even more detailed versioning, it may make sense to use a (local) Git repository.

In contrast to other approaches, the metadata appears here in the form of tags directly in the file name. This offers several advantages. First of all, it offers compatibility with any application. No special software is needed to access specific data, as would be necessary with Exif and IPTC (images) or ID3 (music).

Furthermore, the data is immune to editing with programs. With the standards for images and music mentioned above, there is the possibility of losing the metadata as soon as you edit the file with a tool that does not transfer it correctly to the result.

This method also ensures that there are no difficulties when exchanging data or copying between operating systems. Metadata stored in Alternate Data Streams (ADS, NTFS) or their equivalents, HFS+ or APFS, can be lost. Copying often creates sidecar files, which you may need to separate from the corresponding file when editing.

Make no mistake: The convention described here involves additional overhead, but you might find that you save time in the long run. The following sections provide assistance and introduce tools that make your digital life easier. You can find a series of videos online that demonstrate the main functions of the tools [6].

Sample Environment

I currently work with Debian, Xubuntu, and Windows 10. On Linux, I use the Z-Shell, the Thunar graphical file browser, and the Geeqie image viewer. Integration with other graphical tools means that they offer a possibility to call external tools.

Most of the programs mentioned in this article run equally well on Linux, Windows, and macOS. It makes sense to embed the tools in your own environment for quick and easy use with files. The README files explain how to install and integrate the tools.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News