A download manager for the shell

Convenient Downloads

© Lead Image © xyzproject, 123RF.com

© Lead Image © xyzproject, 123RF.com

Author(s):

A few lines of shell code and the Gawk scripting language make downloading files off the web a breeze.

Almost everyone downloads data from the Internet. In most cases, you use the the web browser as the interface. Usually, websites offer files for download that you can click on individually in the browser. By default, the files usually end up in your Downloads/ folder. If you need to download a large number of files, manually selecting the storage location can quickly test your patience. Wouldn't it be easier if you had a tool to handle this job?

In this article, I will show you how to program a convenient download manager using Bash and a scripting language like Gawk. As an added benefit, you can add features that aren't available in similar programs, such as the ability to monitor the disk capacity or download files to the right folders based on their extensions.

Planning

Programming a download manager definitely requires good planning and a sensible strategy. Obviously, the download manager should download files, but it should also warn you when the hard disk threatens to overflow due to the download volume. Because most files have an extension (e.g., .jpg, .iso, or .epub), the download manager also can sort your files into appropriately named folders based on the extensions.

To build your own download manager, you will need two command-line programs that do not come preinstalled on modern Linux distributions by default: xclip and lynx. The xclip command-line tool monitors the clipboard, while lynx functions as a web browser for the terminal. Lynx also comes with several command-line options, making it usable as a link spider that offers many link display options.

During planning, the basic thing you need to consider is how to pass the download page's URL and links into the terminal. xclip and lynx can help you do this. This project also includes functions that handle specific subtasks, such as capturing and selecting the links. Even if you don't know in the beginning the content of these functions, I recommend creating a script that includes empty functions as placeholders for the time being (see Listing 1).

Listing 1

Script Framework Without Functions

#!/bin/bash
function capture () { :; }
function splitter () { :; }
function rename () { :; }
function download () { :; }
function menu () { :; }

The framework in Listing 1 serves as a basis for building the individual functions one by one. Focus on a single function first rather than on the big picture. For each function, you need to consider, separately, what tasks the function handles, how many and what kind of parameters the function needs, and whether there are any return values.

For convenience, you can outsource parts of the script and then include the source code using dot or source notation such as

. outsourcedFunction

By making these functions as abstract as possible and keeping them independent of the script, I was able to include a function for renaming files in other scripts without needing to modify them.

In abstract terms, a function takes either no parameters, one parameter, or multiple parameters and eventually returns something, regardless if you use the function in this script or in a completely different context.

Fundamentals

Listing 2 shows a couple of basic things you need to do for the download manager script: declare basic variables that will determine the program flow later on and store file types in an array.

Listing 2

Basic Variables

VERBOSE=true
LYNX="$(which lynx)"
XCLIP="$(which xclip)"
download_directory=~/Downloads7
filetypes=(jpg jpeg png tiff gif bmp swf svg)
filetypes+=(mp4 mp3 mpg mpeg vob m2p ts mov avi wmf asf mkv webm 3gp flv)
filetypes+=(gzip zip tar gz tar.gz 7zip)
filetypes+=(pdf doc xlsx odt ods epub txt mobi azw azw3)
filetypes+=(iso dmg exe deb rpm)
filetypes+=(java kt py sh zsh)
filetypes+=($(echo ${filetypes[@]} | sed -r 's/.+/\U&/'))
free=$(df /home | gawk 'NR == 2{print $4}')

Because you can download a variety of different file types off the web, it is up to you whether or not you add any file types to the array and what structure you choose. However, I recommend keeping some kind of order, for example, by putting the graphic files in one line in your script and video or other file types in another line. The last line records the available storage space in a variable.

Listing 3 contains two functions that your download manager will need to output messages if the VERBOSE variable is set to true. Also, notice the two if statements that check for the presence of lynx and xclip. If either tool is missing, the script outputs an error message and terminates with exit 1. If the download directory does not exist, the script creates it in the last line. If overly verbose warnings or error messages bother you, set VERBOSE to false.

Listing 3

Outputting Error Messages

function warn () {
  if ${VERBOSE}; then
    echo "WW: ${1}";
  fi;
}
function err () {
  if ${VERBOSE}; then
    echo "EE: ${1}";
  fi;
}
if [ -z ${LYNX} ]; then
  err "Lynx not available on the system."
  err "Cancel."
  exit 1
fi
if [ -z ${XCLIP} ]; then
  err "Xclip not available on the system."
  err "Cancel."
  exit 1
fi
[ ! -e ${download_directory} ] && mkdir -p ${download_directory}

Functions

It makes sense to follow the same approach for functions as you did for the variables. You should design functions so that they can coexist independently in this and any other scripts.

If it is not immediately apparent from a function how many parameters it takes and what it returns, you need to add a comment describing that. The capture function takes care of capturing the links from the web page and storing them. Listing 4 first creates three arrays that the functions fill with values. The first array stores and processes the URLs.

Listing 4

Functions

01 declare -a download_links
02 declare -a indexed_downloads
03 declare -a indexed_indexes
04
05 function capture () {
06   lynx_options="-dump -listonly -nonumbers" # further potential options
07   lynx_command="lynx $lynx_options $url"    # -hiddenlinks=[option], -image_links
08   grep_searchstring="http.+($(sed 's/ /|/g' <<<${filetypes[@]}))$"
09   grep_command="grep -Eoi $grep_searchstring"
10   download_links=(`$lynx_command | $grep_command`)
11   for x in ${download_links[@]}; do
12     file_size=$(wget --spider $x 2>&1 | gawk -F " " '/Length/{print $2}')
13     while true; do
14       [ -z ${indexed_downloads[$file_size]} ] &&
15       indexed_downloads[$file_size]=$x &&
16       break || (( file_size++ ))
17     done
18   done
19   indexed_indexes=(${!indexed_downloads[@]})
20 }

Because I want the script to arrange the downloads by size, wget uses the --spider option (in line 12) to discover the size. Then the indexed_downloads array captures the downloadable file, using the file size as the index and the name of the download itself as the value (line 15). This avoids the typical indexes ( , 1, 2, 3, and so on) for the array, and instead the file size gives you, say, 233, 1004, 780, and so on, which Bash prints in ascending order of size when listing all indexes. This also happens in line 19, where the indexed_indexes array stores the file sizes.

Later, you will see that the potential downloads appear in ascending order of size. Occasionally, two files are exactly the same size. However, the while loop in lines 13 to 17 fields this problem. To make this index suitable for the download, the script increases the displayed size by one (virtual) byte in this case.

Short and Painless

Listing 5 shows a function that sensibly splits up the download selection. You make the selection after listing the downloads by typing 1,2,3,4-10, for example. In this case, the script takes the downloads from 1 to 10, which are then sent to an array in the form of (1:*2 3:*4 ...).

Listing 5

Splitter Function

function splitter () {
  sed 's/,/\n/g' <<<$* | sed -r '/-/ s/([0-9]+)-([0-9]+)/seq \1 \2/e' | sort -nu
}

The splitter function strips the string to remove the commas and creates a sorted, space-separated string of numbers from ranges such as 4-10. You can use these selection numbers later on to find the downloads contained in the indexed_downloads array.

Keep in mind that the count for Bash arrays always starts at  . For example, to select file number 5, you need to find it by querying indexed_indexes[4]. You can use the same index with the indexed_downloads array to retrieve the associated value.

Listing 6 shows the function that renames the files at download time. You need to pass in the base name of the URL as a parameter. At this point, the script is already in the right directory for the file extension (e.g., jpg/). The parameters starting in line 3 find other files that are already in the directory before the download starts. Bash uses the command from line 4 to check if there are files with the same name. If so, the script inserts an underscore (_) between the name and the dot that separates it from the file extension. This will also tell you how many times the file has been renamed. If the name contains one underscore, the file was renamed once; if it contains two, the file was renamed twice; and so on.

Listing 6

Rename Function

01 function rename () {
02   filename=$1
03   other_filenames=`echo ${@:2}`
04   while grep -q -F "${filename}" <<<${other_filenames}; do
05     filename=$(sed -r 's/(.+)(\.)(.+)/\1_\2\3/' <<<${filename})
06   done
07   echo ${filename}
08 }

At this point, you should debug the functions in detail by isolating them. For example, using sed and the command from Listing 7, you could write the warn and rename functions to a separate file, where you would then subsequently debug them with, for example,

bash -x debug

Listing 7

Debug Function

$ cat <(sed -r -n '/function (warn|rename)/,/^}/p' downloader_optimized2.bash) > debug

by renaming the function within the file and calling it with the associated parameters.

Listing 8 shows the download function, which first filters out the base name from the download link. To do this, it deletes all path information, as well as http://... or https://..., until only the actual file name remains. The script then finds the file extension and, if it does not already exist, creates a directory with this name. Then it changes to the directory and starts the download after running the rename function.

Listing 8

Download Function

function download () {
  name=$(basename $1)
  suffix=$(cut -f 2 -d "." <<<${name})
  [ ! -e ${download_directory}/${suffix} ] && mkdir ${download_directory}/${suffix}
  cd ${download_directory}/${suffix} && files_in_directory=$(ls)
  future_name=$(rename $name $files_in_directory)
  wget -O $future_name $1
}

Listing 9 generates a menu that lists the available downloads. This function starts a loop that iterates across the indexed_downloads array and outputs the size and the base name from the array index one line at a time. At the end of the loop in line 6, everything is piped to gawk.

Listing 9

Menu Function

01 function menu () {
02   for index in ${!indexed_downloads[@]}; do
03     local base_name=$(basename ${indexed_downloads[$index]})
04     local size=${index}
05     echo "${size} ${base_name}"
06   done | gawk --assign free=${free} -F " " -f cutter.awk
07 }

Thanks to the -f cutter.awk option, gawk knows which AWK file to use as the program text. The call has an additional --assign free=${free} option, which ensures that the gawk script is aware of the free disk space previously determined in Bash. gawk then examines the file size and the base name one line at a time and evaluates both line by line.

Formatted Displays

The gawk script, cutter.awk, in Listing 10 starts with two functions I defined myself. The first function, cutter, truncates long basic names for the display by cutting them into two parts and dropping three dots into the middle. The second function, separating_line, generates separating lines in the display to improve clarity for download pages with a large number of links.

Listing 10

cutter.gawk

function cutter( word ){
  l = length(word)
  part1 = substr(word,1,8)
  part2 = substr(word,l-22)
  return part1"..."part2
}
function separating_line ( lesser_equal ) {
  for ( p = 0; p <= lesser_equal ; p++){
    printf "%s" (p == lesser_equal ? "\n" : "") ,"="
  }
}
BEGIN {
  i = 1
  printf "%8s %18s      %10s   %13s     %s\n", "Download", "Kilobytes", "Megabytes", "Gigabytes", "Filename"
  printf "%-5s %21.2f      %10.2f   %13.2f     %s\n", "Disc:", free, free/1024, free/(1024*1024),"home or /"
  separating_line(75)
}
{
  if ( length($2) > 40 ) {
    $2 = cutter($2)
  }
  printf "%2i => %21.2f   %13.2f   %13.2f     %s\n", i++, $1/1024, $1/(1024*1024), $1/(1024*1024*1024), $2
  total += $1
}
END {
  separating_line(75)
  printf "Totals: %19.2f   %13.2f   %13.2f     All downloads together\n", total/1024, total/(1024*1024), total/(1024*1024*1024)
}

The BEGIN block defines some basics as well as header formats. In addition, it shows the free space on the hard disk, or in your home directory, in the line following the header.

Finally, the command block without a pattern specification in lines 21 to 27 iterates through the individual lines, displaying the download size (in kilobytes, megabytes, and gigabytes) in the first field and the basic name in the second. This can be useful if you are downloading smaller files, such as wallpapers, ebooks, or MP3 files – output in gigabytes only, for example, would not make much sense here. This command block also computes the total size of all downloads and finally outputs it via the END block (Figure 1).

Figure 1: The formatted output shows the files provided by a website. To select a file, enter a number from the left.

Control

Finally, Listing 11 shows the main function that controls the entire program flow. As soon as you copy a URL from the browser by pressing Ctrl+C, xclip accesses it (line 1).

Listing 11

Main Function

01 url=$(xclip -o 2>/dev/null)
02
03 if [ -z $url ]; then
04   warn "No URL present."
05   warn "Use [Ctrl]+[C] to copy the URL from the browser to the clipboard."
06   exit 1
07 fi
08
09 capture
10
11 if [ ${#download_links[@]} -gt 0 ]; then
12   menu
13   read -p "Select files (Example: 1,2,3-8,10 ): " selection
14 else
15   warn "NO DOWNLOADS PRESENT" && exit 1
16 fi
17
18 declare -i total_size_dowloads=0
19
20 for selection in $(splitter $selection); do
21   total_size_dowloads+=${indexed_indexes[((select - 1))]}
22   current_download=${indexed_downloads[${indexed_indexes[((select - 1))]}]}
23   if [[ ${frei}-5000 -lt ${total_size_dowloads}/1024 ]]; then
24     warn "Not enough free disc space."
25     warn "Canceling ${Current_download}."
26     exit 1
27   else
28     download $current_download
29   fi
30 done

If the graphical interface uses multiple clipboards (like FVWM), you need to use the -selection primary or secondary option to explicitly specify which clipboard xclip should use. After capturing the clipboard, the script checks whether the URL was also filled with the clipboard contents. If the URL has a length of zero, the program cancels the operation.

Lines 11 to 16 then check whether any download links have been captured on the page and whether they can be output. If there are no files to download, the script also terminates at this point.

Finally, a for loop processes all the downloads, checking their sizes and comparing them against the free disk space. If the size of the selected downloads exceeds the free disk space size, the script terminates.

Because all you need to do is copy the URL of the desired site for downloading files from the browser's address bar for the script to start automatically, this script works with all web browsers. The script retrieves the URL from the clipboard, evaluates the page for files that can be downloaded, and shows them to you sorted by size. You can then conveniently choose which of them you want to download.

Conclusions

Generally speaking, this custom download manager offers a simple and reliable approach for conveniently downloading files off the web. You just need to copy the download page's URL to the clipboard and then select the desired files in the shell.

Because I designed the scripts to use a very structured approach, and the functions included here mostly work on their own, they can also be used in other shell programs. The scripts for this article (download_optimized2.bash) along with cutter.awk are available for download at [1].

The Author

Goran Mladenovic is a hobby developer and inventor, who believes programming is a passion.