Process structured text files with Miller

Ordering and Sorting

You can use the Miller commands group-by, group-like, head, join, rename, reorder, sample, sort, tail, and uniq to sort and group your data.

The example from Figure 7 demonstrates the difference between the commands group by and join (Listing 3, lines 1 and 2). group by groups identical lines; join returns only matching lines. With join, you need to specify the -u option in order to handle arbitrarily sorted files.

Figure 7: The group by and join commands let you summarize lines with the same content.

Listing 3

group by and join

$ mlr --csv --rs lf group-by Name csv3.txt csv4.txt
$ mlr --csv --rs lf join -u -j Name -f csv3.txt csv4.txt
$ mlr --csv --rs lf head -n 1 csv3.txt
$ mlr --csv --rs lf tail -n 1 csv3.txt

Keep in mind that both methods require you to specify the columns on which the output is based. Miller always outputs the headers. You need to know that the headers are present for statements that reference specific line numbers, such as the head and tail commands in the last two lines of Listing 3, the result of which appears in Figure 8. If the headers interfere with downstream processing, you might be better off with using the classic Unix head and tail commands instead of Miller.

Figure 8: Miller always outputs the headers.

If you using rename to change the name of the column, you will need to enter the identifier in the following form: oldname,newname. Renaming works for several identifiers at the same time, but sometimes the command looks confusing, because you need to specify the names without breaks, separated only by commas.

Make sure the number of old and new column identifiers gives you an even number – if so, you have not forgotten anything. The command does not change the field names in the file header; the new names only appear in the output. The first line of Listing 4 shows an example of renaming, where the --opprint option gives the output a cleaner appearance (Figure 9).

Listing 4


$ mlr --csv --rs lf --opprint rename name,last_name,amount,annual_charge csv3.txt
$ mlr --csv --rs lf reorder -f amount,name,first_name csv3.txt
$ mlr --csv --rs lf sort -f name,first_name csv3.txt csv4.txt
$ mlr --csv --rs lf --opprint uniq -c -g name,first_name,amount csv3.txt csv4.txt
Figure 9: If necessary, you can change the names of the columns and their order. In this case, the --opprint option provides a clearer overview.

To change the column position, use the reorder command. If you specify all existing column names (Listing 4, line 2), the output occurs in the desired order. Miller otherwise puts the changes at the end or beginning.

Using the sort and uniq commands, you can influence the sequence of records or lines. Whereas sort and group-by output equivalent lines in succession, uniq summarizes them (Listing 4, lines 3 and 4).

For the example with uniq, I used the -c option, which gives you a frequency count of the identical rows (Figure 10).

Figure 10: If you need the records in a particular order, without duplicates, use the Miller commands sort and uniq.

Miller Computes

For calculations and statistical analyses, Miller offers the commands stats1, step, and top. You can use stats1 to perform statistical evaluations. The commands in Listing 5 determine the total, the number, the average, the minimum, and the maximum of the amount column; Figure 11 shows the lines of code with the results.

Listing 5


$ mlr --csv --rs lf stats1 -a sum -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf stats1 -a count -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf stats1 -a mean -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf stats1 -a min -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf stats1 -a max -f amount csv3.txt csv4.txt
Figure 11: Miller can count and perform simple statistical calculations.

Using step gives you evaluations for each line. The accrued sum per line is given by rsum. You can output the difference between two consecutive lines with delta and the ratio with ratio.

counter gives you a running count of the lines. from-first outputs the difference between the first and the nth line. For examples of such commands, see Listing 6 and Figure 12.

Listing 6

More Stat Tricks

$ mlr --csv --rs lf --opprint step -a rsum -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf --opprint step -a delta -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf --opprint step -a ratio -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf --opprint step -a counter -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf --opprint step -a from-first -f amount csv3.txt csv4.txt
Figure 12: Using Miller operators for linewise calculations.

top lets you determine the highest numeric value in a column. To output the full line with the result, use the -a option. --min gives you the minimum instead. If you want to output several lines by value, use -n lines (Listing 7); Figure 13 shows the process flow.

Listing 7

High and Low

$ mlr --csv --rs lf top -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf top -a -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf top --min -f amount csv3.txt csv4.txt
$ mlr --csv --rs lf --opprint top -n 3 -a -f amount csv3.txt csv4.txt@LI:§§nonumber
Figure 13: Output the highest or lowest value in a column.

Descriptive Data

Use the bar command to visualize numeric values. The options --connect (start display), --hi (maximum value), and -w (number of positions in display) make sure the results fit on the screen or paper (Listing 8). Figure 14 shows examples of simple charts.

Listing 8

Graphing with bar

$ mlr --csv --rs lf --opprint bar --lo 0 --hi 100 -f amount csv3.txt
$ mlr --csv --rs lf --opprint bar -c "*" --lo 20 --hi 50 -w 50 -f amount csv3.txt csv4.txt
Figure 14: The bar command creates simple bar charts from the data.

Depending on the application, you can output the column either as a graphic with bar or with its numeric value, but not both. To get around this problem, use the data twice with different names for the columns.

Listing 9 shows an example that determines the usage of the folders in the home directory and outputs the output file usage.txt twice, once in the usage(MB) column, and additionally in the value column. A different separator is also used for the fields. Then, the script uses Miller to evaluate the resulting file. An example of the output file is shown in Listing 10.

Listing 10

Output of Disk Usage


Listing 9

Disk Usage

#! /bin/sh
# Determine disk usage...
echo "VZ-Name:usage(MB):value" > /home/tmp/belegung.txt "
cd /home
for i in $(ls -1 /home); do
  # ... and save as CSV
  k=$(du -shm $i | cut -f1)
  echo $i:$k:$k >> /home/tmp/usage.txt
# Format output
mlr --csv --ifs : --rs lf --opprint bar -c "*" --lo 0 --hi 500 -w 50 -f value /home/tmp/usage.txt

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Statistics with gawk

    With very little overhead, you can access statistics on the spread of COVID-19 using gawk scripts and simple shell commands.

  • Zenity Dialogs

    The Zenity command-line utility lets you create simple dialog boxes with your own data or with the output of utilities and applications.

  • Command Line: sort

    sort helps you organize file lists and program

    output. And if you like, you can even use this small

    but powerful tool to merge and sort multiple files.

  • Tool Tips

    We test Yuck, Uftpd, Guncat, Kiwix, Miller, and Debian Package Search.

  • gdu, godu, duf

    Three modern tools, gdu, godu, and duf, make the task of checking the utilization level of hard disks easier thanks to fast execution speed and a good graphical implementation.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Find SysAdmin Jobs