Process structured text files with Miller
Ordering and Sorting
You can use the Miller commands group-by
, group-like
, head
, join
, rename
, reorder
, sample
, sort
, tail
, and uniq
to sort and group your data.
The example from Figure 7 demonstrates the difference between the commands group by
and join
(Listing 3, lines 1 and 2). group by
groups identical lines; join
returns only matching lines. With join
, you need to specify the -u
option in order to handle arbitrarily sorted files.
Listing 3
group by and join
$ mlr --csv --rs lf group-by Name csv3.txt csv4.txt $ mlr --csv --rs lf join -u -j Name -f csv3.txt csv4.txt $ mlr --csv --rs lf head -n 1 csv3.txt $ mlr --csv --rs lf tail -n 1 csv3.txt
Keep in mind that both methods require you to specify the columns on which the output is based. Miller always outputs the headers. You need to know that the headers are present for statements that reference specific line numbers, such as the head
and tail
commands in the last two lines of Listing 3, the result of which appears in Figure 8. If the headers interfere with downstream processing, you might be better off with using the classic Unix head
and tail
commands instead of Miller.
If you using rename
to change the name of the column, you will need to enter the identifier in the following form: oldname,newname
. Renaming works for several identifiers at the same time, but sometimes the command looks confusing, because you need to specify the names without breaks, separated only by commas.
Make sure the number of old and new column identifiers gives you an even number – if so, you have not forgotten anything. The command does not change the field names in the file header; the new names only appear in the output. The first line of Listing 4 shows an example of renaming, where the --opprint
option gives the output a cleaner appearance (Figure 9).
Listing 4
Renaming
$ mlr --csv --rs lf --opprint rename name,last_name,amount,annual_charge csv3.txt $ mlr --csv --rs lf reorder -f amount,name,first_name csv3.txt $ mlr --csv --rs lf sort -f name,first_name csv3.txt csv4.txt $ mlr --csv --rs lf --opprint uniq -c -g name,first_name,amount csv3.txt csv4.txt
To change the column position, use the reorder
command. If you specify all existing column names (Listing 4, line 2), the output occurs in the desired order. Miller otherwise puts the changes at the end or beginning.
Using the sort
and uniq
commands, you can influence the sequence of records or lines. Whereas sort
and group-by
output equivalent lines in succession, uniq
summarizes them (Listing 4, lines 3 and 4).
For the example with uniq
, I used the -c
option, which gives you a frequency count of the identical rows (Figure 10).
Miller Computes
For calculations and statistical analyses, Miller offers the commands stats1
, step
, and top
. You can use stats1
to perform statistical evaluations. The commands in Listing 5 determine the total, the number, the average, the minimum, and the maximum of the amount
column; Figure 11 shows the lines of code with the results.
Listing 5
Statistics
$ mlr --csv --rs lf stats1 -a sum -f amount csv3.txt csv4.txt $ mlr --csv --rs lf stats1 -a count -f amount csv3.txt csv4.txt $ mlr --csv --rs lf stats1 -a mean -f amount csv3.txt csv4.txt $ mlr --csv --rs lf stats1 -a min -f amount csv3.txt csv4.txt $ mlr --csv --rs lf stats1 -a max -f amount csv3.txt csv4.txt
Using step
gives you evaluations for each line. The accrued sum per line is given by rsum
. You can output the difference between two consecutive lines with delta
and the ratio with ratio
.
counter
gives you a running count of the lines. from-first
outputs the difference between the first and the nth line. For examples of such commands, see Listing 6 and Figure 12.
Listing 6
More Stat Tricks
$ mlr --csv --rs lf --opprint step -a rsum -f amount csv3.txt csv4.txt $ mlr --csv --rs lf --opprint step -a delta -f amount csv3.txt csv4.txt $ mlr --csv --rs lf --opprint step -a ratio -f amount csv3.txt csv4.txt $ mlr --csv --rs lf --opprint step -a counter -f amount csv3.txt csv4.txt $ mlr --csv --rs lf --opprint step -a from-first -f amount csv3.txt csv4.txt
top
lets you determine the highest numeric value in a column. To output the full line with the result, use the -a
option. --min
gives you the minimum instead. If you want to output several lines by value, use -n lines
(Listing 7); Figure 13 shows the process flow.
Listing 7
High and Low
$ mlr --csv --rs lf top -f amount csv3.txt csv4.txt $ mlr --csv --rs lf top -a -f amount csv3.txt csv4.txt $ mlr --csv --rs lf top --min -f amount csv3.txt csv4.txt $ mlr --csv --rs lf --opprint top -n 3 -a -f amount csv3.txt csv4.txt@LI:§§nonumber
Descriptive Data
Use the bar
command to visualize numeric values. The options --connect
(start display), --hi
(maximum value), and -w
(number of positions in display) make sure the results fit on the screen or paper (Listing 8). Figure 14 shows examples of simple charts.
Listing 8
Graphing with bar
$ mlr --csv --rs lf --opprint bar --lo 0 --hi 100 -f amount csv3.txt $ mlr --csv --rs lf --opprint bar -c "*" --lo 20 --hi 50 -w 50 -f amount csv3.txt csv4.txt
Depending on the application, you can output the column either as a graphic with bar
or with its numeric value, but not both. To get around this problem, use the data twice with different names for the columns.
Listing 9 shows an example that determines the usage of the folders in the home directory and outputs the output file usage.txt
twice, once in the usage(MB)
column, and additionally in the value
column. A different separator is also used for the fields. Then, the script uses Miller to evaluate the resulting file. An example of the output file is shown in Listing 10.
Listing 10
Output of Disk Usage
VZ-Name:usage(MB):value jefe:108:108 tmp:1:1
Listing 9
Disk Usage
#! /bin/sh # Determine disk usage... echo "VZ-Name:usage(MB):value" > /home/tmp/belegung.txt " cd /home for i in $(ls -1 /home); do # ... and save as CSV k=$(du -shm $i | cut -f1) echo $i:$k:$k >> /home/tmp/usage.txt done # Format output mlr --csv --ifs : --rs lf --opprint bar -c "*" --lo 0 --hi 500 -w 50 -f value /home/tmp/usage.txt
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs