Prev: Functions, Up: Index, Next: Redirections

The right tool for the job

While Bash alone is a very powerful tool, and we’ve already seen that in many cases, things written in Bash outperform small UNIX commands, it still should not be treated as a general-purpose programming language, simply because it’s not designed to be one. For large and complex solutions, languages like C++, Java, C# are most often a better choice. And even when you know Bash is a perfect fit for the given requirements, if performance is not a key factor, it is more recommended to stay with the good old UNIX commands and not re-invent the wheel.

The true purpose of Bash is to be a glue for all tiny tools that do just one thing and do it good. Still, the choice of a proper tool for the given task is not always trivial. Consider the following script:

sum_of_salaries_for_user() {
    local USER="${1}"
    local DATAFILE="${2}"
    local USERNAME SALARY LINE SUM

    SUM=0
    while read LINE; do
        USERNAME=$(echo "${LINE}" | cut -d: -f1)
        SALARY=$(echo "${LINE}" | cut -d: -f3)
        if [[ ${USERNAME} = "${USER}" ]]; then
            ((SUM+=${SALARY}))
        fi
    done < ${DATAFILE}
    echo ${SUM}
}

Let’s see how it performs against a file with 5k lines:

time sum_of_salaries_for_user jsmith salaries_5k.csv

real    0m16,115s
user    0m15,684s
sys     0m3,881s

For sure, we can improve it a bit. Let’s start from reducing the number of subprocesses:

sum_of_salaries_for_user() {
    local USER="${1}"
    local DATAFILE="${2}"
    local USERNAME SALARY LINE SUM

    SUM=0
    while read LINE; do
        USERNAME=$(cut -d: -f1 <<< "${LINE}")   # redirection
        SALARY=$(cut -d: -f3 <<< "${LINE}")     # instead of a pipe
        if [[ ${USERNAME} = "${USER}" ]]; then
            ((SUM+=${SALARY}))
        fi
    done < ${DATAFILE}
    echo ${SUM}
}
time sum_of_salaries_for_user jsmith salaries_5k.csv

real    0m14,035s
user    0m11,486s
sys     0m3,361s

We could also get rid of the external cut command altogether:

sum_of_salaries_for_user() {
    local USER="${1}"
    local DATAFILE="${2}"
    local USERNAME SALARY LINE SUM

    SUM=0
    while read LINE; do
        USERNAME=${LINE%%:*}        skip columns 2+
        TMP=${LINE#*:}              # skip the first column
        TMP=${TMP#*:}               # skip the second column
        SALARY=${TMP%%:*}           # skip columns 4+
        if [[ ${USERNAME} = "${USER}" ]]; then
            ((SUM+=${SALARY}))
        fi
    done < ${DATAFILE}
    echo ${SUM}
}
time sum_of_salaries_for_user jsmith salaries_5k.csv

real    0m0,280s
user    0m0,134s
sys     0m0,146s

That’s already a major improvement, but let’s try to squeeze it even more.

sum_of_salaries_for_user() {
    local USER="${1}"
    local DATAFILE="${2}"
    local USERNAME SALARY LINE SUM

    SUM=0
    # Split into an array while reading
    while IFS=':' read -a FIELDS; do
        USERNAME=${FIELDS[0]}
        SALARY=${FIELDS[2]}
        if [[ ${USERNAME} = "${USER}" ]]; then
            ((SUM+=${SALARY}))
        fi
    done < ${DATAFILE}
    echo ${SUM}
}
time sum_of_salaries_for_user jsmith salaries_5k.csv

real    0m0,168s
user    0m0,067s
sys     0m0,101s

So let’s try with something bigger:

time sum_of_salaries_for_user jsmith salaries_500k.csv

real    0m16,620s
user    0m7,951s
sys     0m8,668s

With this we’ve hit the wall with what we can do in pure Bash. Let’s compare that with the possibilities of a tool that is highly optimized for this particular kind of work (i.e. filtering, projection, aggregation):

sum.awk
BEGIN { sum = 0 }
$1 == "jsmith" { sum += $3 }
END { print sum }
time awk -F: -f sum.awk < salaries_500k.csv

real    0m0,152s
user    0m0,143s
sys     0m0,009s

And it works even for much bigger data sets (82 milion records, 4.6 GB):

time awk -F: -f sum.awk < salaries_82m.csv

real    0m18,991s
user    0m17,945s
sys     0m1,045s

AWK is also a very powerful tool on its own. It’s a separate programming language designed strictly for procecssing structured streams of text. Yet the most common use of it in shell scripts seems to be something like this:

ps aux | grep jsmith | awk '{ print $2 }' | xargs kill

which is far below its true potential.

Prev: Functions, Up: Index, Next: Redirections