Analyzing file metadata in the shell

Taking Stock

© Lead Image © Sebastian Duda, 123RF.com

© Lead Image © Sebastian Duda, 123RF.com

Author(s):

Armed with the right shell commands, you can quickly identify and evaluate file and directory metadata.

Imagine you have a directory with hundreds or even thousands of files (without uniform extensions) that you want to organize. Or maybe you want to know the last access date of a file for backup, forensics, or version management purposes.

Instead of tediously clicking your way through the files in a graphical file manager, a shell script with the test command can help identify filesystem objects as well as provide additional information about the files.

Determining File Type

The file command provides information about a file's contents (Figure 1). Because it tests for patterns in the content, file cannot be misled by file extensions (Figure 2).

Figure 1: By adding the -i option, file determines that a file is a plain text file.
Figure 2: file reliably identifies files whose extensions do not match the actual content.

file gets its pattern information from the magic file /usr/share/misc/magic.mgc. For special cases, you can create your own magic file and pass it in to file by calling it with the option -m <filename>. If you are particularly interested in MIME files, use the --mime-type option.

You can even access and evaluate device files with file. Additionally, it also outputs the major and minor numbers (Figure 3) – the major number specifies the kernel's device driver, while the minor number specifies the individual device managed by the device driver.

Figure 3: If you call up the information for a device file, file shows the major and minor numbers if possible.

Given the appropriate privileges, you can obtain information about the filesystem (Figure 4). To do so, use the -s <device file> option. If you only enter the device for the hard disk without the partition number, the output contains block size details, among other things.

Figure 4: Because devices in Linux act like files, file can also output information on them.

file usually outputs the information in the form <filename>: <data>. You can take advantage of this when using the tool in shell scripts. If you use a for loop in the script, you need unique file names. You can obtain these by typing ls -1. This gives you one file name per line. The subshell in the loop header of the for loop thus provides reliable arguments until a space occurs in the file name. To avoid this, you have to convert or quote the name.

Listing 1 shows a sample script that specifically searches for PDF files and displays them for selection (Figure 5). The dynamic selection menu is created with the help of Smenu.

Listing 1

pdflist.sh

01 #! /bin/bash
02
03 # Default quit menu
04 menu="E-N-D"
05
06 cd $HOME/Data
07
08 # Search for PDF files
09 for i in $(ls -1); do
10   file $i | cut -d \: -f2 | grep -q PDF
11   if [ $? -eq 0 ]; then
12     # Show selection
13     menu=$(echo $menu $i)
14   fi
15 done
16
17 # Selection menu with Smenu and PDF display
18 while true; do
19   choice=$(echo $menu | smenu -n 10 -t1 )
20   if [ "$choice" = "E-N-D" ]; then
21     exit
22   fi
23   atril $choice
24   clear
25 done
Figure 5: With just a couple of lines of shell code and Smenu, you can create a simple selection menu for a specific file type.

Status Information

Similar to ls, the stat command provides file and directory details. Without specifying any other options, stat outputs a full set of data for the listed files (Figure 6). Figure 6 also shows the effect of read access – note the Access line with the date and time information. However, this feature does not work for filesystems mounted with the noatime option. noatime speeds up data access, because the filesystem does not have to create an entry whenever something is read.

Figure 6: Without specifying any options, stat's output contains a large amount of information.

Using stat -c <format> you can read specific information about a file (or a filesystem) and evaluate it in a script. Table 1 shows formatting information for stat. Figure 7 shows some calls, including querying access rights in numerical form. This information could be useful for an installation script.

Table 1

Stat Format Information

Syntax

Meaning

%a

Access rights in numerical format

%A

Access rights in detailed format

%d

Device number (decimal) for a device file

%t

Major number device file (hexadecimal)

%T

Minor number device file (hexadecimal)

%F

File type

%m

Filesystem where the file resides

%u

Owner UID

%U

Owner username

%g

GID

%G

Group name

%x

Last read access (plain text)

%X

Last read access (Unix seconds)

%y

Last change (plain text)

%Y

Last change (Unix seconds)

%z

Last access (plain text)

%Z

Last access (Unix seconds)

Figure 7: Examples of formatted stat queries.

Note that stat's output is a bit ambiguous: For example, Access means the last read access, but you should note that mount options like noatime influence this value. Modify refers to the contents of the file, so it may contain the creation date, but always includes the last write access. Change shows you information about changing access rights (the owner or similar). The value for Birth is currently not determined by stat on Linux due to a program error.

Listing 2 shows a small script that reads the access rights of a file and then changes them if they are too permissive. It calls the command to change permissions with the -v option so you can see what it is doing in the terminal. Figure 8 shows the database for this; Figure 9 shows the script running.

Listing 2

restrictive.sh

01 #! /bin/bash
02
03 # Define $1 as directory,
04 # else cancel
05 if [ -z $1 ]; then
06   exit
07 fi
08
09 # Change to directory
10 cd $1
11
12 for i in $(ls -1); do
13   # Evaluate access permissions
14   stat -c %a $i | grep -q 75
15   if [ $? -eq 0 ]; then
16     # Change if group
17     # or anyone can execute
18     # the file.
19     chmod -v 700 $i
20   fi
21 done
Figure 8: Access rights that are too permissive can give unauthorized persons access to the data under certain circumstances.
Figure 9: The shell script from Listing 2 showing the changed files.

Changing Timestamps

Applications that work with a file will typically modify the timestamp information. You can do this manually with the touch command (see Table 2). If you run touch for a nonexistent file name, the system creates a corresponding entry in the filesystem (i.e., it creates a file without any content). If you call touch <file> without any options, the program updates all the timestamps in the file to the system time.

Table 2

Touch Options

Option

Action

-a

Change access time

-m

Change last change time

-t <time>

Use <time> instead of system time

-r <file>

Provides a reference file from which touch takes the timestamp

The time specification for the -t option takes the form of <MMDDhhmm>. You can also add the calendar year and seconds to the specification: <YYYYMMDDhhmm.ss>. Figure 10 shows how to change the access time using touch. Figure 11 shows an example of referencing an existing file for the timestamp. The stat command's resulting output shows the special access date set by the command in Figure 10.

Figure 10: Using touch, you can modify a file's timestamp.
Figure 11: A timestamp taken from a reference file.

Testing Files

Usually, the test command is not used in full, but rather as a notation using square brackets and matching options:

if [ $? -eq 0 ];
# is the same as
if test $? -eq 0;

test is used to evaluate the type and timestamps of objects in the directory tree. It returns   if the tested condition is true. See Table 3 for test's options.

Table 3

Test Options

Test

True, if …

-e <Object>

Object exists

True, if <Object> exists and …

-b <Object>

It is a block device file

-c <Object>

It is a drawing device file

-d <Object>

It is a directory

-f <Object>

It is a standard file

-g <Object>

The group ID bit is set

-G <Object>

The group entries for the process and file match

-h <Object>

It is a symbolic link

-k <Object>

The sticky bit is set

-L <Object>

It is a symbolic link

-O <Object>

The query process points to the same owner

-p <Object>

It is a FIFO

-r <Object>

It is readable

-s <Object>

Its size is not 0

-S <Object>

It is a socket

-u <Object>

The UID bit is set

-w <Object>

It is writable

-x <Object>

It is executable

True, if <Object1> exists and …

<Object1> -ef <Object2>

<Object2> points to the same object

<Object1> -nt <Object2>

Is newer than <Object2>

<Object1> -ot <Object2>

Is older than <Object2>

The test command is useful for making a distinction in an if construct. The range of applications is extensive. In a script for saving data, for example, it would be possible to check whether a file named BACKUP.INFO exists. If so, the script creates a copy of all files that are newer than this file's timestamp. Otherwise, the script creates a full backup.

Listing 3 shows the code for a script that creates a directory named BACKUPTEST if it doesn't already exist, quickly performing a common task (Figure 12).

Listing 3

tester.sh

#! /bin/bash
uvznew () {
  # FILE?
  if [ -f BACKUPTEST ]; then
    read -p "File with same name exists! Rename (y)? " we
    if [ "$we" = "y" ]; then
      mv BACKUPTEST BACKUPTEST.file
    else
      echo "Either delete or move the BACKUPTEST file!"
      echo "END OF SCRIPT"
    fi
  fi
  if [ -d BACKUPTEST ]; then
    cd BACKUPTEST
    echo -n "The current directory is $PWD"
    echo " "
  else
    echo "Creating BACKUPTEST"
    mkdir BACKUPTEST
  fi
  return 0
}
uvznew
Figure 12: With automation, the tester.sh script in Listing 3 handles tasks more quickly and reliably.

Timestamps and Rights

The find tool not only searches for file and directory names, but it also includes timestamps, access rights, and the file size as a filter if required. See find's man page to learn about the full scope of this command. For the most important find options, see Table 4.

Table 4

Find Options

Action

Option

Note

-type <type>

Search by type

-size +/-<size>

Search by size

- = maximum size; + = minimum size; nothing = same size

-perm <file permissions>

Search for file permissions

-newer <files>

Search for files newer than <file>

-mtime -/+<N>

Search for files not modified for <N>days

+ = at least; - = within

-atime -/+<N>

Search for files not accessed for <N> days

+ = at least, - = within

-execdir <command> "{}" +

Execute <command> for found file

Safer method

If necessary, you can forward find's output to a pipe or process it using xargs, which passes the result to other commands, like tar. As an example, Listing 4 creates subdirectories for files as a function of their modification date and moves them there. Figure 13 shows the directory's contents before running the script. Figure 14 shows the script in action, and Figure 15 shows the output.

Listing 4

sortme.sh

01 #!/bin/bash
02 # $1 = directory to process
03 if [ -z $1 ]; then
04   echo "No input"
05   exit
06 fi
07
08 cd $1
09
10 for i in $(stat -c %y:%n * | sort -r | tr \  \:); do
11   # Populate variables subdir (subdirectory)
12   # and fn (filename)
13   subdir=$(echo $i | cut -d \: -f1)
14   fn=$(echo $i | cut -d \: -f6)
15   # Do not process if directory
16   # restart loop
17   if [ -d $fn ]; then
18     echo "$fn: Skipping directory"
19     continue
20   fi
21   # Create subdir if needed, and
22   # move file to it
23   if [ -d $subdir ]; then
24     mv -v $fn $subdir
25   else
26     mkdir $subdir
27     mv -v $fn $subdir
28   fi
29 done
Figure 13: The directory content before running sortme.sh.
Figure 14: sortme.sh evaluates the metadata and sorts files into the appropriate directory structure.
Figure 15: Files end up in the corresponding subdirectories in the new directory structure.

The for loop receives the data courtesy of stat. Since spaces are used as separators, it replaces them with colons. The output is sorted by date, starting with the newest files.

In the for loop, the subdir variable contains the subdirectory to be created and fn takes the file name. The script evaluates whether or not fn is a directory and, in this instance, aborts processing. The loop then begins with a new pass. This prevents the script from processing a directory.

If, on the other hand, fn is a file, the routine then checks again by means of a test whether the subdirectory already exists. If this is not the case, it creates the directory and moves the file to it. If the directory already exists, it simply does the latter. Figure 15 shows a visual overview of tree.

Conclusions

With the appropriate shell commands, you can identify, evaluate, and change various file and directory metadata. As a result, many operations can be simplified using scripts, which avoid errors and save valuable time.

The Author

Harald Zisler has focused on FreeBSD and Linux since the early 1990s. He is the author of various articles and books on technology and IT topics. The fifth edition of his book Computer-Netzwerke (Computer Networks) was recently published by the Rheinwerk Verlag publishing company. He also works as an instructor, teaching Linux and database topics in small groups.