How to Do Math in Bash
BASH, “In the beginning… was the command line” Years ago, we didn’t have fancy frameworks that handled our distributed computing for us, or applications that could read files intelligently and give us accurate results. If we did, it was very expensive or only worked for a small problem set, very few people had access to this technology, and it was mostly proprietary.
For newcomers to the world of data science, you might have
used the command line for a small number of things. Maybe you moved a file from
one place to another using mv
or read a file using cat
.
Or you might have never used the command line at all or at least not for data
science. In this article, we hope to show some tools and ways you can perform
some Math in bash.
Why bash?
We’re focusing on Bourne-again shell (bash) for multiple reasons. First, it’s the most popular shell and you can find it everywhere. In fact, for the majority of Linux distributions, bash is the default shell. It’s a great first shell to learn and very easy to work with. There are a number of examples and resources available to help you with bash if you ever get stuck. From a bare-metal installation in a data center to an instance running in the cloud, bash is there, installed, and waiting for input.
There are a number of other shells you can choose from, such
as the Z shell (zsh
). The Z shell
is fairly new (and by new I mean released in 1990, which is new in shell land)
and provides a number of powerful features. Other notable shells are tcsh
, ksh
, and fish
. The C Shell
(tcsh
), the Korn Shell (ksh
), and the Friendly
Interactive Shell (fish) are still widely used
today. FreeBSD has made tcsh
its
default shell for the root user and ksh
is still used for a lot of Solaris
operating systems. Fish is also a great starter shell with a lot of features to
help the user navigate the shell without feeling lost.
While these shells are still very powerful and stable, we will be focusing on using bash, as we want to focus on consistency across multiple platforms and help you learn a very active and popular shell that’s been around for 30 years.
Math in bash itself
Bash itself is
able to do simple integer arithmetic.
When a little more capability is required, two command-line tools, bc
and awk
, are
capable of doing many types of calculations. There are at least three different ways to
accomplish this in bash.
Using let
You can use the let command to do simple bash arithmetic:
$ let x=1
$ echo $x
1
$ let x=$x+1
$ echo $x
2
Basic arithmetic
You
can do addition, subtraction, multiplication (be sure to escape
the *
operator
with \*
) and integer
division:
expr 1 + 2
3
expr 3 \* 10
30
The numbers must be separated by spaces.
Double-parentheses
Similar to let, you can do simple integer arithmetic in bash using doubled parentheses:
a=$((1 + 2))
echo $a
((a++))
echo $a
3
4
To see the full range of operations available in the shell, check out the GNU reference page at https://www.gnu.org/software/bash/manual/html_node/Shell-Arithmetic.html.
bc, the unix basic calculator
bc
is a
calculator scripting language. Scripts in bc
can be executed with the bc
command. Imagine a test.bc
file contains the following code:
scale = 2;
(10.0*2+2)/7;
That means you
can run bc
like this:
cat test.bc | bc
3.14
bc
can do far
more than just divide two numbers. It’s a fully-fledged scripting language on
its own and you can do arbitrarily complex things with a bc
script. A bc
script might be the ending point of
a pipeline of data, where, initially, the data files are massaged into a stream
of data rows, and then a bc
script is used to compute the values we’re
looking for. Let’s illustrate this with a simple example.
In this example,
we need to take a CSV data file and compute the average of the second number in
each row and also compute the sum of the fourth number in each row, say we have
a bc
function
to compute something interesting on these two numbers such as a harmonic mean.
We can use awk
to output
the numbers into a bc
script and then feed the result into bc
using a pipe.
So, say our bc
function to compute the harmonic
mean of two numbers looks like this:
scale=5;
define harmonic(x,y){ return 2.0/((1.0/x) + (1.0/y)); }
We can use awk
to find the two numbers and
construct the bc
script and then pipe it to bc
to execute:
awk '{s+=$2 ; f+=$4}END{print "scale=5;\n define harmonic(x,y){ return 2.0/((1.0/x) + (1.0/y)); } \n harmonic(",s/NR,",",f,")"}' data.txt | bc
See the bc
documentation at https://www.gnu.org/software/bc/manual/html_mono/bc.html for more things you could do
with bc
.
Math in (g)awk
awk
(including
the gnu
implementation, gawk
) is designed to stream text
processing, data extraction, and reporting. A large percentage of practical
statistics is made up of counting things in specific ways, and this is one of
the things awk
excels
at. Tallying totals, histograms, and grouped counts are all very easy in awk
.
An awk
program is structured as a set of
patterns that are matched and actions to take when those patterns are matched:
pattern {action}
pattern {action}
pattern {action}
…
For each record
(usually each line of text passed to awk
), each pattern is tested to see whether the
record matches, and if so, the action is taken. Additionally, each record is
automatically split into a list of fields by a delimiter. The default action,
if none is given, is to print the record. The default pattern is to match
everything. There are two special patterns, BEGIN
and END
, which are matched only before any records are
processed, or after, respectively.
The power
of awk
lies in
its variables: variables can be used without a declaration. There are some
special variables already available to you that are useful for math:
$0: The text of the entire record.
$1, $2, … : The text of the 1st, 2nd, etc fields in the record.
NF: The number of fields in the current record.
NR: The current count of records (equal to the total number of records in the END step)
Additionally, you
can assign values to your own variables. awk
natively supplies variables that can hold
strings, integers, floating point numbers, and regular expressions and
associative arrays.
As an example, say we want to count the word frequency in the reviews of our test data. Run this code:
zcat amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz | tail -n +2 | head -n 10000 | cut -f14 | awk 'BEGIN {FS="[^a-zA-Z]+"}; {for (i=1;i<NF;i++) words[$i] ++}; END {for (i in words) print words[i], i}' | head
It will produce these results:
Say we’d like to
compute a histogram of the star values of the reviews. This is also
very easy with awk
:
zcat amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz | tail -n +2 | cut -f8 | awk '{star[$0]++}; END {for (i in star) print i,star[i]}'
The preceding code produces this:
We can see that
four- and five-star reviews dominate this dataset. Besides counting, awk
is also great for manipulating the
format of strings.
This article showed how the command line offers
several options for doing arithmetic and other mathematical operations. Simple
arithmetic and grouped tallies can be performed using bash itself or awk
. If
you found it interesting, you can explore Hands-On Data Science with the Command Line for big data processing and analytics at
speed and scale using command line tools. Hands-On Data Science with the Command Line will help you learn how to speed up the
process and perform automated tasks using command-line tools.
Leave a Reply