Tag: bash

How to Do Math in Bash

BASH, “In the beginning… was the command line” Years ago, we didn’t have fancy frameworks that handled our distributed computing for us, or applications that could read files intelligently and give us accurate results. If we did, it was very expensive or only worked for a small problem set, very few people had access to this technology, and it was mostly proprietary.

For newcomers to the world of data science, you might have used the command line for a small number of things. Maybe you moved a file from one place to another using mv or read a file using cat. Or you might have never used the command line at all or at least not for data science. In this article, we hope to show some tools and ways you can perform some Math in bash.

Why bash?

We’re focusing on Bourne-again shell (bash) for multiple reasons. First, it’s the most popular shell and you can find it everywhere. In fact, for the majority of Linux distributions, bash is the default shell. It’s a great first shell to learn and very easy to work with. There are a number of examples and resources available to help you with bash if you ever get stuck. From a bare-metal installation in a data center to an instance running in the cloud, bash is there, installed, and waiting for input.

There are a number of other shells you can choose from, such as the Z shell (zsh). The Z shell is fairly new (and by new I mean released in 1990, which is new in shell land) and provides a number of powerful features. Other notable shells are tcshksh, and fish. The C Shell (tcsh), the Korn Shell (ksh), and the Friendly Interactive Shell (fish) are still widely used today. FreeBSD has made tcsh its default shell for the root user and ksh is still used for a lot of Solaris operating systems. Fish is also a great starter shell with a lot of features to help the user navigate the shell without feeling lost.

While these shells are still very powerful and stable, we will be focusing on using bash, as we want to focus on consistency across multiple platforms and help you learn a very active and popular shell that’s been around for 30 years.

Math in bash itself

Bash itself is able to do simple integer arithmetic. When a little more capability is required, two command-line tools, bc and awk, are capable of doing many types of calculations. There are at least three different ways to accomplish this in bash.

Using let

You can use the let command to do simple bash arithmetic:

$ let x=1
 $ echo $x
 1
 $ let x=$x+1
 $ echo $x
2

Basic arithmetic

You can do addition, subtraction, multiplication (be sure to escape the * operator with \*) and integer division:

expr 1 + 2
3
expr 3 \* 10
30

The numbers must be separated by spaces.

Double-parentheses

Similar to let, you can do simple integer arithmetic in bash using doubled parentheses:

a=$((1 + 2))
echo $a 
((a++))
echo $a

3
4

To see the full range of operations available in the shell, check out the GNU reference page at https://www.gnu.org/software/bash/manual/html_node/Shell-Arithmetic.html.

bc, the unix basic calculator

bc is a calculator scripting language. Scripts in bc can be executed with the bc command. Imagine a test.bc file contains the following code:

scale = 2;
(10.0*2+2)/7;

That means you can run bc like this:

cat test.bc | bc
3.14

bc can do far more than just divide two numbers. It’s a fully-fledged scripting language on its own and you can do arbitrarily complex things with a bc script. A bc script might be the ending point of a pipeline of data, where, initially, the data files are massaged into a stream of data rows, and then a bc script is used to compute the values we’re looking for. Let’s illustrate this with a simple example.

In this example, we need to take a CSV data file and compute the average of the second number in each row and also compute the sum of the fourth number in each row, say we have a bc function to compute something interesting on these two numbers such as a harmonic mean. We can use awk to output the numbers into a bc script and then feed the result into bc using a pipe.

So, say our bc function to compute the harmonic mean of two numbers looks like this:

scale=5; 
define harmonic(x,y){ return 2.0/((1.0/x) + (1.0/y)); }

We can use awk to find the two numbers and construct the bc script and then pipe it to bc to execute:

awk '{s+=$2 ; f+=$4}END{print "scale=5;\n define harmonic(x,y){ return 2.0/((1.0/x) + (1.0/y)); } \n harmonic(",s/NR,",",f,")"}' data.txt | bc

See the bc documentation at https://www.gnu.org/software/bc/manual/html_mono/bc.html for more things you could do with bc.

Math in (g)awk

awk (including the gnu implementation, gawk) is designed to stream text processing, data extraction, and reporting. A large percentage of practical statistics is made up of counting things in specific ways, and this is one of the things awk excels at. Tallying totals, histograms, and grouped counts are all very easy in awk.

An awk program is structured as a set of patterns that are matched and actions to take when those patterns are matched:

pattern {action}
pattern {action}
pattern {action}
…

For each record (usually each line of text passed to awk), each pattern is tested to see whether the record matches, and if so, the action is taken. Additionally, each record is automatically split into a list of fields by a delimiter. The default action, if none is given, is to print the record. The default pattern is to match everything. There are two special patterns, BEGIN and END, which are matched only before any records are processed, or after, respectively.

The power of awk lies in its variables: variables can be used without a declaration. There are some special variables already available to you that are useful for math:

$0: The text of the entire record.
$1, $2, … : The text of the 1st, 2nd, etc fields in the record.
NF: The number of fields in the current record.
NR: The current count of records (equal to the total number of records in the END step)

Additionally, you can assign values to your own variables. awk natively supplies variables that can hold strings, integers, floating point numbers, and regular expressions and associative arrays.

As an example, say we want to count the word frequency in the reviews of our test data. Run this code:

zcat amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz | tail -n +2 | head -n 10000 | cut -f14 | awk 'BEGIN {FS="[^a-zA-Z]+"}; {for (i=1;i<NF;i++) words[$i] ++}; END {for (i in words) print words[i], i}' | head

It will produce these results:


Counting the word frequency in the reviews of our test data

Say we’d like to compute a histogram of the star values of the reviews. This is also very easy with awk:

zcat amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz | tail -n +2 | cut -f8 | awk '{star[$0]++}; END {for (i in star) print i,star[i]}'

The preceding code produces this:


Computing a histogram of the star values of the reviews

We can see that four- and five-star reviews dominate this dataset. Besides counting, awk is also great for manipulating the format of strings.

This article showed how the command line offers several options for doing arithmetic and other mathematical operations. Simple arithmetic and grouped tallies can be performed using bash itself or awk. If you found it interesting, you can explore Hands-On Data Science with the Command Line for big data processing and analytics at speed and scale using command line tools. Hands-On Data Science with the Command Line will help you learn how to speed up the process and perform automated tasks using command-line tools.