Using Bash one-liners for stats

Bash Out Some Stats

© Photo by Carlos Muza on Unsplash

© Photo by Carlos Muza on Unsplash

Article from Issue 294/2025
Author(s):

With just one line of Bash you can use tools like AWK and gnuplot to quickly analyze and plot your data.

Typically when I'm looking to do some data analysis, I'll import the data files into Pandas DataFrames or an SQL database. During a recent project, I was happily surprised to learn that I could do a lot of basic statistics with only one line of Bash code.

For simple applications, Bash tools such as sort and bc (the arbitrary precision calculator) can be used to find maximums, minimums, averages, and sums from arrays or columns of data (Listing 1).

Listing 1

Basic Stats Using sort and bc

$ # Basic stats using sort and bc
$ data=(3 4 18 7 2 19 15)
$
$ # Find the Max value in an array
$ printf "%s\n" "${data[@]}" | sort -n | tail -n 1
19
$ # Find the Min value in an array
$ printf "%s\n" "${data[@]}" | sort -n | head -n 1
2
$ # Sum up an array
$ IFS="+" ; bc<<<"${data[*]}"
68
$ # Average from an array, with 2 decimals
$ sum=$(IFS="+" ; bc<<<"${data[*]}")
$ bc  <<<"scale=2; ${sum}/${#data[@]}"
9.71

For CSV data files, a single line of Bash that combines AWK [1] and gnuplot [2] can be used to view statistics or graph a column of data (Figure 1).

Figure 1: Use Bash for stats and plotting data on a graph.

In this article, I will cover using AWK to filter and extract data from CSV files and then turn to gnuplot to gather statistics and present charts.

Mimicking SQL SELECT Statements

Both a Linux command-line tool and a programming language, AWK can be used for data extraction and reporting. AWK can work directly on CSV files and output results based on both column and row filtering conditions. The syntax for creating an SQL SELECT-style statement in AWK is:

awk -F, 'condition {print column_numbers}' filename

Figure 2 shows an example comparing an SQL SELECT statement with an equivalent AWK statement. The first parameter in the AWK line is -F,, which sets the column format separator as a comma. In AWK, the conditions (or the WHERE statement) come first, followed by print to output the required columns.

Figure 2: Use AWK like an SQL SELECT statement.

Unlike SQL, AWK uses column numbers instead of column names, so $1 for the first column, $2 for the second column, and so on.

While this wouldn't be my first choice, AWK can be used to do stats on a column of data. Listing 2 shows an example of how to get some basic stats on the second ($2) column of a CSV file, as well as some additional AWK features.

Listing 2

Basic Stats Using AWK

$ # Use AWK to get stats on a CSV file
$ cat numbers.csv
Monday, 1.1
Tuesday, -3.6
Wednesday, 9.81
Thursday, 6.0
$ # find a min, use a large starting value
$ awk -F, -v min=9999 '{if ($2<min) min=$2} END {print min}' numbers.csv
-3.6
$ # find a max, use a small starting value
$ awk -F, -v max=-9999 '{if ($2>max) max=$2} END {print max}' numbers.csv
9.81
$ # find the sum of row 2
$ awk -F, '{ sum += $2 } END {print sum }' numbers.csv
13.31
$ # find the average of row 2
$ awk -F, '{ sum += $2 } END {print sum/NR }' numbers.csv
3.3275

In the min and max calculation in Listing 2, variables are predefined and defaulted with the -v option. An if statement can be used to check and set variables on a row-by-row basis. The average calculation uses a two-step pass. The first pass totalizes column $2 into a variable called sum. An END statement defines the end of the first step, and then the second step prints the average result. For complex AWK scripts, multiple steps can be defined within BEGIN and END blocks.

The beauty of AWK is that it can filter or preprocess the data for other Bash commands. For example, AWK can be used to extract column $2 data from a CSV file, and then the results can be piped to sort and head to find the maximum value:

$ awk -F, '{print $2}' numbers.csv | sort -n | tail -n1
9.81

It should be noted that there are several statistical command-line methods available. The sta [3] tool is an excellent utility for finding basic stats on a column of data. Listing 3 uses AWK to send column $5 data to sta.

Listing 3

Using AWK with sta

$ # Use AWK with the sta utility
$ awk -F, '{print $5}' london_weather.csv | sta
N      min   max   sum     mean     sd      sderr
15336  -6.2  37.9  235987  15.3878  6.5555  0.0529

Now that you know how to filter and extract data from a CSV file, the next step is to use gnuplot to do some advanced statistics and charting.

gnuplot

Gnuplot's statistical option can be used as a standalone tool or integrated with Bash commands. To use gnuplot with CSV files, the data separator will need to be set before the stats can be calculated:

$ gnuplot
gnuplot> set datafile separator ','
gnuplot> # Get stats on a column 3 in a file
gnuplot> stats 'filename.csv' using 3

Gnuplot natively supports data filtering by rows and columns. However, the filtering syntax is not as user friendly or as complete as AWK. To pipe AWK results to gnuplot, you can use

awk -F, 'condition {print column}' filename | gnuplot -e 'stats "<cat" '

The gnuplot -e option is used to execute a string of statements, and the "<cat" parameter defines that the input data is piped.

Figure 3 shows a statistical example that compares similar AWK/gnuplot commands and results with an SQL statement. The gnuplot stats option returns a fairly complete list of calculations. To extract a specific stat value, the output is given a variable name prefix and then the result can be used/printed based on the prefix_stat. For example, to get the median value of a column, you would use

gnuplot -e 'stats "<cat" name "TEMPS" nooutput; print TEMPS_median'
Figure 3: Get detailed stats with gnuplot.

If two columns are passed to the stats command, calculations such as slope, intercept, and correlation will be returned.

Visualizing with gnuplot

Like the earlier stats example, one-line statements can be created that pipe AWK output to a gnuplot chart. The syntax for an AWK/gnuplot line chart is

awk -F, 'condition {print column}' csvfile | gnuplot -p -e 'plot "<cat" w l'

The gnuplot persist option, -p, keeps the plot open after the statement is executed, and w l stands for a chart with lines.

Figure 4 shows an example of an AWK/gnuplot call that creates a line chart. For comparison, the equivalent SQL SELECT statement with a DB Browser line plot is also shown.

Figure 4: Create complex AWK/gnuplot statements that are equivalent to SQL, but that can also plot data in graphs.

Gnuplot offers a good variety of chart types. For example, Figure 5 shows a box plot, which can help identify outlier data. In my project, I could see that July had some skewing of high temperature values.

Figure 5: Create a box plot to show data outliers.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Gnuplot

    Use Gnuplot with command-line utilities.

  • Open Data

    A lot of COVID-19 data is available through online REST APIs. With a little ingenuity and some open source tools, you can extract and analyze the data yourself.

  • Embedding Scripts in Bash

    Solve Bash blind spots by embedding other scripting languages into your Bash scripts to get the features you need. Pete shows you solutions for floating-point math, charting, GUIs, and hardware integration.

  • Tutorials – Shell Math

    While Bash is not the most advanced environment for doing and visualizing math, its power will surprise you. Learn how to calculate and display your results with shell scripts.

  • Stat-like Tools for Admins

    ASCII tools can be life savers when they provide the only access you have to a misbehaving server. However, once you're on the node what do you do? In this article, we look at stat-like tools: vmstat, dstat, and mpstat.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News