Implementing fast queries for local files in Go

Programming Snapshot – Go

Author(s):

To find files quickly in the deeply nested subdirectories of his home directory, Mike whips up a Go program to index file metadata in an SQLite database.

Those were the days when you could simply enter a line of code or a variable name into Google's code search page (Figure 1), and the data crawler would reveal in next to no time the open source repositories in which it appeared. Unfortunately, Google discontinued the service several years ago, probably because it failed to generate sufficient profit.

Figure 1: This is what Google Code Search looked like in 2007.

But not all is lost, because the GitHub Codesearch [1] project, with its indexer built in Go, at least lets you browse locally available repositories, index them, and then search for code snippets in a flash. Its author, Russ Cox, then an intern at Google, explained later how the search works [2].

How about using a similar method to create an index of files below a start directory to perform quick queries such as: "Which files have recently been modified?" "Which are the biggest wasters of space?" Or "Which file names match the following pattern?"

Unix filesystems store metadata in inodes, which reside in flattened structures on disk that cause database-style queries to run at a snail's pace. To take a look at a file's metadata, run the stat command on it and take a look at the file size and timestamps, such as the time of the last modification (Figure 2).

Figure 2: Inode metadata of a file, here determined by stat, can be used to build an index.

Newer filesystems like ZFS or Btrfs take a more database-like approach in the way they organize the files they contain but do not go far enough to be able to support meaningful queries from userspace.

Fast Forward Instead of Pause

For example, if you want to find all files over 100MB on the disk, you can do this with a find call like:

find / -type f -size +100M

If you are running the search on a traditional hard disk, take a coffee break. Even on a fast SSD, you need to prepare yourself for long search times in the minute range. The reason for this is that the data is scattered in a query-unfriendly way across the sectors of the disk.

To speed this up, I want an indexer to collect the metadata at regular intervals (e.g., during quiet periods in the early morning hours) and store them in an SQLite database that I can run a query tool against. The updatedb utility on Linux does similar things so that the user can search for file names at lightning speed with locate.

Because the "Programming Snapshot" series never shies away from new challenges, and this particular edition was being written during a vacation in Hawaii with me in a good mood and plenty of time on my hands, I used the Go programming language for the implementation – just as in the Codesearch project. An SQLite database is used as the index; it stores all data in a single file but packs enough punch to support optimized and reasonably fast queries, even for medium-sized data volumes.

To do this, Listing 1 [3] creates an SQLite database named files.db that inserts a table entry with the fields path (full file path), size (file size), and modified (timestamp of the file's last modification) for each file found in the directory tree.

Listing 1

create.go

01 package main
02
03 import (
04   "database/sql"
05   _ "github.com/mattn/go-sqlite3"
06 )
07
08 func main() {
09   db, err := sql.Open("sqlite3", "./files.db")
10   if err != nil {
11     panic(err)
12   }
13
14   _, err = db.Exec("CREATE TABLE files " +
15       "(path text, modified int, size int)")
16   if err != nil {
17     panic(err)
18   }
19 }

To install the required SQLite driver, the go get tool included in every Go installation retrieves the driver's source code from GitHub, resolves any dependencies, and compiles and installs the result locally:

go get github.com/mattn/go-sqlite3

Scripts like Listing 1 can then load the driver in their import sections under the full name starting with github.com. I had to employ one minor quirk to prevent the rather adamant Go compiler from exploding, though: It apparently felt that loading the driver package imported code that does not appear to be referenced anywhere in Listing 1. Go does not tolerate unnecessary dependencies, but in this case, the driver is actually required, and it is indeed used under the covers in the database/sql libraries. To calm down the Go compiler, line 5 imports the driver under the name _ (underscore), which turns off the is-it-being-used check.

The start of each Go program is a package main with a function main(), as in line 8 of Listing 1. The Open() method there uses the database/sql package, Go's standard database interface (and the SQLite driver under the hood), to open a connection to the files.db flat file database later.

If the file already exists or another error occurs, the return value err contains a value not equal to nil, and line 11 sounds the alarm with panic() and abruptly terminates the program flow.

Old School Error Checking

The SQL command CREATE in line 14 creates a new database table. Here, too, line 16 checks whether an error was returned and terminates with a message if necessary. Go still believes in old-school-style checking of individual return values instead of allowing the programmer to throw exceptions that bubble up for processing at a higher level.

Also, take a closer look at the different assignments in lines 9 and 14, which first use := and then =. The first is Go's assignment with a declaration. If a variable has not yet been declared, the short form declaration with := does the trick.

Watch Out!

If the declaration list of variables with a := assignment contains only variables that have already been declared, though, Go stubbornly refuses to compile the code at all. In this case, the assignment must be made with an =, as in line 14.

Similarly unique to Go are functions with several return values. Often an error variable is contained in the list of values. If it is set to nil, everything is fine. If a return value is of no interest, such as the first return value of db.Exec() in line 14, Go developers type an underscore in place of an unnecessary variable. Picking up only one return value from a function that returns two would result in a compiler error.

The code is easily compiled by entering:

go build create.go

A fraction of a second later a create executable is available; it already contains all the dependencies and libraries, so it can be easily copied to another machine using the same operating system without a pile of dependencies having to be resolved. The resulting binary is about 4.5MB, which is not exactly lightweight either.

Closure as a Bridge

Listing 2 shows the indexer, which uses the Walk() method from the standard path/filepath package to navigate through a file hierarchy, starting with the start directory specified by the user on the command line. Arguments passed to the program are found in the os.Args array, as in C, with the program name as the first element and all of the call parameters in the following ones.

Listing 2

index.go

01 package main
02
03 import (
04   "database/sql"
05   _ "github.com/mattn/go-sqlite3"
06   "os"
07   "path/filepath"
08 )
09
10 type Walker struct {
11   Db *sql.DB
12 }
13
14 func main() {
15   if len(os.Args) != 2 {
16     panic("usage: " + os.Args[0] +
17           " start_dir")
18   }
19   root := os.Args[1]
20
21   db, err :=
22       sql.Open("sqlite3", "./files.db")
23
24   w := &Walker{
25     Db: db,
26   }
27
28   err = filepath.Walk(root, w.Visit)
29   checkErr(err)
30
31   db.Close()
32 }
33
34 func (w *Walker) Visit(path string,
35           f os.FileInfo, err error) error {
36   stmt, err := w.Db.Prepare(
37       "INSERT INTO files VALUES(?,?,?)")
38   checkErr(err)
39
40   _, err = stmt.Exec(
41       path, f.ModTime().Unix(), f.Size())
42   checkErr(err)
43
44   return nil
45 }
46
47 func checkErr(err error) {
48   if err != nil {
49     panic(err)
50   }
51 }

Browsing a file tree isn't rocket science, but Go uses the Visit() callback function to communicate with the traversing function in line 28. The problem here is that no database handle exists within the scope of this callback starting on line 34, which it needs to make the necessary changes to the database. The solution to this dilemma is to turn the Visit() function into a closure.

To do this, Visit() in line 34 defines a so-called receiver between the func keyword and the function name, thus telling Go to connect the Walker data structure (line 10), which contains a database handle, with the Visit() function. This allows Visit() to access the handle via the w variable used for defining the receiver. With the handle, it inserts new records into the database.

The actual work of setting up a database query is done by the Prepare() method, which prepares an SQL command and returns a statement handle. Line 40 then fires the Exec method at the latter and passes the parameters to be stored to the SQL command: the path to the file, its last modification timestamp, and its size.

To avoid the need for the program to check after each function call whether the error variable err has a value of nil, and thus everything is OK, line 47 defines a function named checkErr(), which does this and aborts the program with panic, if something unforeseen happens.

Finders, Keepers

After the indexer finished its work, the database table files on my computer had more than a million entries, as shown in Figure 3. The reason for the high number of files was probably numerous cloned Git repositories and Snapshot articles from more than 20 years. With this data in the files.db SQLite database, a SQLite client can now quickly fire off queries and determine which files in my home directory have recently changed, for example.

Figure 3: After the indexer has finished, there are more than one million file entries in the flat file SQLite database.

To do this, Listing 3 connects to the SQLite database and issues a SELECT command that queries all rows in the table, sorts them in descending order of the timestamp in the modified column, and then outputs the first 10 matches.

Listing 3

latest.go

01 package main
02
03 import (
04   "database/sql"
05   "fmt"
06   _ "github.com/mattn/go-sqlite3"
07 )
08
09 func main() {
10   db, err :=
11     sql.Open("sqlite3", "./files.db")
12   checkErr(err)
13
14   rows, err := db.Query("SELECT path, " +
15     "modified FROM files " +
16     "ORDER BY modified DESC LIMIT 10")
17   checkErr(err)
18
19   var path string
20   var mtime string
21
22   for rows.Next() {
23     err = rows.Scan(&path, &mtime)
24     checkErr(err)
25     fmt.Printf("%s %s\n", path, mtime)
26   }
27 }
28
29 func checkErr(err error) {
30   if err != nil {
31     panic(err)
32   }
33 }

The rows.Next() call in line 22 works its way step-by-step through the matches, and rows.Scan() retrieves the first two column values of each match and assigns them to the path and mtime variables passed in as pointers; both of these were previously declared as strings. Go supports pointers, but it does not leave memory management up to the user and does not blow up in smoke like C if an address is wrong because of a bug; instead, it quits with helpful error messages.

Which files in my home directory take up the most space? Listing 4 finds this out quickly by sorting all entries in descending order (ORDER BY size DESC) using the SELECT query from line 25 and LIMITing the output to a maximum number of matches. The user defines this number with the --max-files parameter at the command line, and Go provides a convenient interface for parsing the parameters of a command with the flag package.

Listing 4

max-size.go

01 package main
02
03 import (
04   "database/sql"
05   "fmt"
06   "flag"
07   "os"
08   "strconv"
09   _ "github.com/mattn/go-sqlite3"
10 )
11
12 func main() {
13   db, err :=
14     sql.Open("sqlite3", "./files.db")
15   checkErr(err)
16
17   max_files := flag.Int("max-files", 10,
18     "max number of files")
19
20   flag.Parse()
21   if len(flag.Args()) != 0 {
22     panic("usage: " + os.Args[0])
23   }
24
25   rows, err := db.Query("SELECT path," +
26     "size FROM files " +
27     "ORDER BY size DESC LIMIT " +
28     strconv.Itoa(*max_files))
29   checkErr(err)
30
31   var path string
32   var size string
33
34   for rows.Next() {
35     err = rows.Scan(&path, &size)
36     checkErr(err)
37     fmt.Printf("%s %s\n", path, size)
38   }
39 }
40
41 func checkErr(err error) {
42   if err != nil {
43     panic(err)
44   }
45 }

It first expects the declaration of the variable that will hold the value passed in from the command line (max_files in line 17). The call to the flag.Int() method specifies that only integers can be used as values. Then flag.Parse() (line 20) analyzes the existing command-line parameters and – if the user has set --max-files – assigns this value to a variable that the max_files pointer references.

The Itoa() function from the strconv package converts the integer behind the dereferenced *max_files pointer back into a string, and line 28 injects it into the SQL command using a LIMIT clause. The advantage of this conversion type is that an integer actually ends up in the query and not a character string that could be abused for SQL injection attacks.

In comparison, Listing 5 shows that a database client in a scripting language like Python is easier to program. Since SQLite also features a Python driver, the same database created by Go earlier can be used by Listing 5 without further ado. It digs out all database entries whose file paths correspond to a predefined pattern. It expects a regular expression at the command line, stuffs it into an SQL query, and outputs the matches.

Listing 5

like.py

01 #!/usr/bin/env python3
02 import sys
03 import sqlite3
04
05 try:
06   _, pattern = sys.argv
07 except:
08   raise SystemExit(
09      "usage: " + sys.argv[0] + " pattern")
10
11 conn = sqlite3.connect('files.db')
12 c = conn.cursor()
13 like = "%" + pattern + "%"
14 for row in c.execute('SELECT path,size FROM files WHERE path LIKE ?', [like]):
15   print(row)

More Luxury, More Lines

Go's type checking and the fact that it does not run inside a bytecode interpreter, but as a compiled binary with more elegant memory management than a C or C++ program, has its price: It requires more detailed instructions and generally more lines of code. Go programs run faster than Python scripts, but, as is so often the case, the bottleneck in the use case at hand is not in processing instructions, but in communicating with external systems. In this case, database calls consume most of the compute time. Whether the program code itself runs 10 or 100 percent faster is largely irrelevant.

However, the compact binary format with embedded libraries and no dependency worries is a big advantage, and probably one of the reasons Go has become the first choice for all types of system programming tasks.

Infos

  1. Google Code Search: https://github.com/google/codesearch
  2. Russ Cox, "Regular Expression Matching with a Trigram Index," 2012: https://swtch.com/~rsc/regexp/regexp4.html
  3. Listings for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/215/

The Author

Mike Schilli works as a software engineer in the San Francisco Bay area, California. Each month in his column, which has been running since 1997, he researches practical applications of various programming languages. If you email him at mailto:mschilli@perlmeister.com he will gladly answer any questions.