Checking for broken links in directory structures

Dead End

Photo by Adam Birkett on Unsplash

Photo by Adam Birkett on Unsplash

Author(s):

Broken links can wreak havoc in directory structures. This article shows you how to use scripts to avoid having your links lead to a dead end.

During a restore process, nothing is more disappointing than discovering that some of the links in your previously backed up data no longer work. Although the link is still there, the target no longer exists, resulting in the link pointing nowhere. These broken data structures can also cause problems when you are developing software and publishing it in the form of an archive, or if you need to install different versions of an application.

Finding and fixing broken links manually takes a lot of effort. You can avoid this scenario by using scripts and Unix/Linux tools to check for broken links in directory structures. In this article, I'll look at several ways to check the consistency of these data structures and detect broken links. Read on to avoid hiccups for you and your users.

Sample Data

As an example, I will use the directory structure shown in Figure 1, which is similar to a piece of software or a project directory that you might encounter in the wild. You can easily create a tree such as Figure 1 with the tree command [1].

Figure 1: The example project directory structure.

The directory tree in Figure 1 contains two versions of the software. There are three links: One points to the old version (named old), one to the current version (named current), and the third to a data file named dataset3 (which is missing).

Options

A small, manageable project structure like Figure 1 can be tested and checked manually. With larger projects, however, this quickly leads to errors because you are bound to overlook something. To automate the testing procedure, I rummaged around in my Unix/Linux toolbox and came up with four options that are suitable for everyday use: a shell script as a combination of a recursive function and a for loop over all the files and directories, a special call to find, a Python script, and the tools symlinks, FSlint, rmlint, and chase.

Shell Script

The shell script (Listing 1) uses a recursive function named check(). check() only expects one parameter: the directory you want it to check for broken links (line 16). In the function, a for loop iterates across all entries (lines 2 to 14).

Listing 1

find-broken-links.sh

01 function check {
02   for entry in $1/*; do
03     # echo "check $entry ... "
04     if [ -d "$entry" ]; then
05       check $entry
06     else
07       if [ -L "$entry" ]; then
08         target=$(readlink "$entry")
09         if [ ! -e "$target" ]; then
10           echo "broken link: from $entry to $target"
11         fi
12       fi
13     fi
14   done
15 }
16 check $1

For each entry, check() first checks whether the entry actually is a directory (line 4). If so, the function is called again with this directory as a parameter (line 5). Otherwise, two more tests are made: Is it a link (line 7), and, if so, where does it point to (lines 8 and 9)? In line 8, the readlink command returns the target to which the link points and stores the result in the local variable target.

In line 9, the script checks if the link target exists. If not, the function sends an error message to that effect to stdout (line 10). The routine ignores entries in the directory that are not links. Once the entire list has been processed, check() returns to the call point. After processing the entire original directory, the script exits.

If you now call the script, you will see output similar to that in Listing 2. I made the call using a period (.) for the current directory as the starting point. The output includes two lines because current points to version2 and my function follows the link.

Listing 2

Output of find-broken-links.sh

$ ./find-broken-links.sh .
broken link: from ./project/current/data/dataset3 to project/version1/data/dataset3
broken link: from ./project/version2/data/dataset3 to project/version1/data/dataset3

find

Old hands will now object to the complexity of the shell script option and contend that a far more elegant solution is available with the find tool. I'm happy to field this objection. With find, I will use the -xtype l switch, which is intended precisely for detecting broken links. Listing 3 shows that this option also works.

Listing 3

Searching with find

01 $ find . -xtype l
02 ./project/version2/data/dataset3

Now I know which link is broken, but not yet where it points. In Listing 4, I will now combine find with a for loop (shortening Listing 1 by half). In line 1, I let find do all the work and get a list of all the broken links found below the starting directory. In the for loop (lines 2 to 5), readlink then determines the respective link target.

Listing 4

find-broken-links2.sh

01 entrylist=$(find "$1" -xtype l)
02 for entry in $entrylist; do
03   target=$(readlink "$entry")
04   echo "broken link: from $entry to $target"
05 done

If you now call the script from Listing 4, the output is reduced to the single broken reference in the example project directory (Listing 5).

Listing 5

Output from find-broken-links2.sh

$ ./find-broken-links2.sh .
broken link: from ./project/version2/data/dataset3 to project/version1/data/dataset3

Python Script

If you don't like to use the shell for programming, Python may be a better option. Listing 6 shows a Python script that is very similar in operation to Listing 1. It uses functions from the two standard modules os and sys. Line 1 imports them into the current namespace. Lines 3 to 19 define a function named walk() that walks the directory passed in as the top parameter and checks all entries in it. The call to walk() is made in line 22, after previously evaluating the directory passed in as a parameter in line 21.

Listing 6

find-broken-links.py

01 import os,sys
02
03 def walk(top):
04   try:
05     entries = os.listdir(top)
06   except os.error:
07     return
08
09   for name in entries:
10     path = os.path.join(top, name)
11     if os.path.isfile(path):
12       pass
13     if os.path.isdir(path):
14       walk(path)
15     if os.path.islink(path):
16       destination = os.readlink(path)
17       if not os.path.exists(path):
18         print("broken link: from %s points to %s" % (path, destination))
19   return
20
21 startingDir = sys.argv[1]
22 walk(startingDir)

First, the program creates and validates a directory listing (lines 4 to 7). In case of an error, the walk() function terminates here. Then the code checks each entry in the directory to see if it is a file (line 11), a directory (line 13), or a link (line 15). The routine skips files. For directories, the walk() function is called recursively, again with the directory name as parameter.

For a link, however, the readlink() function from the os module in line 16 finds the target. If it is empty, it is a broken link, and the function returns an error message to that effect. After checking all the entries in the directory, the function returns to the call point. If you call the script in the directory tree in my example, you will see output with two of the entries shown in Listing 2.

symlinks

The symlinks [2] tool is designed to clean up symbolic links, for example, by converting absolute links to relative links and removing broken links. There are two parameters, -r and -v, that let you tell symlinks to recursively search a directory structure and output detailed information about the links.

Listing 7 shows the call for the example project directory: symlinks finds one link that it classifies as broken ("dangling") and two relative links. A look at the runtime shows no significant difference between Listing 1 and Listing 3. To filter out only the broken links, just combine the symlinks call with egrep (Listing 8).

Listing 7

symlinks

$ symlinks -rv .
dangling: /home/frank/project/version2/data/dataset3 -> project/version1/data/dataset3
relative: /home/frank/project/old -> project/version1
relative: /home/frank/project/current -> project/version2

Listing 8

symlinks and egrep

$ symlinks -rv . | egrep "^dangling:"
dangling: /home/frank/project/version2/data/dataset3 -> project/version1/data/dataset3

FSlint

The GUI-based FSlint [3] tool, which is based on the findbl command-line tool (in the fslint package) belongs in the same category as symlinks. If you call findbl without any other parameters (or -d), it searches the current directory for broken links and prints the matches one by one.

Listing 9 shows the result of the call, which is practically identical to those from Listing 2 and Listing 8. The behavior of findbl becomes clear after a closer look: It is simply a shell script that relies on find for searching.

Listing 9

findbl

01 $ /usr/share/fslint/fslint/findbl .
02 project/version2/data/dataset3 -> project/version1/data/dataset3

rmlint and chase

I combined the rmlint [4] and chase [5] tools as a final option. (Shredder, rmlint's graphical front end, looked really great on the rmlint website, but I could not reproduce it on Debian GNU/Linux 11.)

Similar to FSlint, rmlint aims to find and clean up inconsistencies in entries in the filesystem, including detecting broken links. You can see the call to do this in line 1 of Listing 10.

Listing 10

rmlint and chase

01 $ rmlint -T bl -o pretty:stdout .
02 $ chase old
03 /project/version1

The -T switch lets you select what rmlint will look for; bl is the abbreviation for "broken links." The -o option determines the output format, and the pretty:stdout value gives you a prettified display. Figure 2 shows a sample call in which rmlint detects two broken links (and that's how it's supposed to be).

Figure 2: The rmlint search results for the example.

The chase tool also performs an exciting task: It tracks down the file to which a symbolic link actually points. It returns 1 in case of an error if the reference target does not exist. Line 2 in Listing 10 shows the call to the old reference from our example, and the result is the filename.

Avoiding Mistakes

How do you prevent the occurrence of broken links in the first place? Basically, the only advice here is to be more careful because (apart from the filesystem) there is no place where all the links are stored. I'm not aware of a service that checks in the background to make sure that links remain intact and warns you before you break a link.

It makes sense to check, with any of the tools discussed in this article, to see if symbolic links for a file exist before proceeding to delete the file. If you only have access to find and readlink, follow the steps shown in Listing 11. The call lists both components – the link and the link target – side by side. To do this, find uses the -exec option to echo the name and then display the link destination determined via readlink.

Listing 11

find and readlink

$ find . -type l -exec echo -n {} "-> " ';' -exec readlink {} ';'
./project/version2/data/dataset3 -> project/version1/data/dataset3
./project/old -> project/version1
./project/current -> project/version2

Keep in mind that symbolic links can cross filesystem boundaries. Your only option is to check everything that is mounted in the filesystem.

Conclusions

Reliably detecting broken links in file structures involves some overhead, but it can be done. The tools discussed here will help you handle this situation with ease.

The Author

Frank Hofmann works on the road, preferably in Berlin, Geneva, and Cape Town, as a developer, trainer, and author. He is coauthor of the book Debian Package Management (http://www.dpmb.org/index.en.html)