Prev: Functions, Up: Index, Next: Redirections
The right tool for the job
While Bash alone is a very powerful tool, and we’ve already seen that in many cases, things written in Bash outperform small UNIX commands, it still should not be treated as a general-purpose programming language, simply because it’s not designed to be one. For large and complex solutions, languages like C++, Java, C# are most often a better choice. And even when you know Bash is a perfect fit for the given requirements, if performance is not a key factor, it is more recommended to stay with the good old UNIX commands and not re-invent the wheel.
The true purpose of Bash is to be a glue for all tiny tools that do just one thing and do it good. Still, the choice of a proper tool for the given task is not always trivial. Consider the following script:
sum_of_salaries_for_user() {
local USER="${1}"
local DATAFILE="${2}"
local USERNAME SALARY LINE SUM
SUM=0
while read LINE; do
USERNAME=$(echo "${LINE}" | cut -d: -f1)
SALARY=$(echo "${LINE}" | cut -d: -f3)
if [[ ${USERNAME} = "${USER}" ]]; then
((SUM+=${SALARY}))
fi
done < ${DATAFILE}
echo ${SUM}
}
Let’s see how it performs against a file with 5k lines:
time sum_of_salaries_for_user jsmith salaries_5k.csv
real 0m16,115s
user 0m15,684s
sys 0m3,881s
For sure, we can improve it a bit. Let’s start from reducing the number of subprocesses:
sum_of_salaries_for_user() {
local USER="${1}"
local DATAFILE="${2}"
local USERNAME SALARY LINE SUM
SUM=0
while read LINE; do
USERNAME=$(cut -d: -f1 <<< "${LINE}") # redirection
SALARY=$(cut -d: -f3 <<< "${LINE}") # instead of a pipe
if [[ ${USERNAME} = "${USER}" ]]; then
((SUM+=${SALARY}))
fi
done < ${DATAFILE}
echo ${SUM}
}
time sum_of_salaries_for_user jsmith salaries_5k.csv real 0m14,035s user 0m11,486s sys 0m3,361s
We could also get rid of the external cut
command altogether:
sum_of_salaries_for_user() {
local USER="${1}"
local DATAFILE="${2}"
local USERNAME SALARY LINE SUM
SUM=0
while read LINE; do
USERNAME=${LINE%%:*} skip columns 2+
TMP=${LINE#*:} # skip the first column
TMP=${TMP#*:} # skip the second column
SALARY=${TMP%%:*} # skip columns 4+
if [[ ${USERNAME} = "${USER}" ]]; then
((SUM+=${SALARY}))
fi
done < ${DATAFILE}
echo ${SUM}
}
time sum_of_salaries_for_user jsmith salaries_5k.csv real 0m0,280s user 0m0,134s sys 0m0,146s
That’s already a major improvement, but let’s try to squeeze it even more.
sum_of_salaries_for_user() {
local USER="${1}"
local DATAFILE="${2}"
local USERNAME SALARY LINE SUM
SUM=0
# Split into an array while reading
while IFS=':' read -a FIELDS; do
USERNAME=${FIELDS[0]}
SALARY=${FIELDS[2]}
if [[ ${USERNAME} = "${USER}" ]]; then
((SUM+=${SALARY}))
fi
done < ${DATAFILE}
echo ${SUM}
}
time sum_of_salaries_for_user jsmith salaries_5k.csv real 0m0,168s user 0m0,067s sys 0m0,101s
So let’s try with something bigger:
time sum_of_salaries_for_user jsmith salaries_500k.csv real 0m16,620s user 0m7,951s sys 0m8,668s
With this we’ve hit the wall with what we can do in pure Bash. Let’s compare that with the possibilities of a tool that is highly optimized for this particular kind of work (i.e. filtering, projection, aggregation):
BEGIN { sum = 0 }
$1 == "jsmith" { sum += $3 }
END { print sum }
time awk -F: -f sum.awk < salaries_500k.csv
real 0m0,152s
user 0m0,143s
sys 0m0,009s
And it works even for much bigger data sets (82 milion records, 4.6 GB):
time awk -F: -f sum.awk < salaries_82m.csv real 0m18,991s user 0m17,945s sys 0m1,045s
AWK is also a very powerful tool on its own. It’s a separate programming language designed strictly for procecssing structured streams of text. Yet the most common use of it in shell scripts seems to be something like this:
which is far below its true potential. |
Prev: Functions, Up: Index, Next: Redirections