Powers combined

We’ve seen already that a specialized tool like AWK, if used correctly, can beat a shell script in terms of performance.

sum.awk
BEGIN { sum = 0 }
$1 == "jsmith" { sum += $3 }
END { print sum }
time awk -F: -f sum.awk < salaries_100m.csv

real    0m12,830s
user    0m11,314s
sys     0m1,516s

That shouldn’t come as a surprise though. AWK programs are static — they can be parsed once and executed efficiently against every line of input. Bash is a very dynamic language — all those expansions and substitutions force it to parse a line of the script each time it’s executed.

It turns out, however, that there’s still some room for improvement. AWK works in a very straightforward way: it passes every line of input through a set of rules, executing them only when the condition applies. The less input gets passed to AWK, the less time is spent on data which will never match any rule.

time { grep '^jsmith:' salaries_100m.csv > salaries_filtered.csv ; awk -F: -f sum.awk < salaries_filtered.csv; } > /dev/null

real    0m3,539s
user    0m3,024s
sys     0m0,500s

Despite being highly optimized for text files, AWK is still a multi-purpose tool, capable of filtering, transforming and aggregating the data. On the other hand, grep covers just one of those things, and thus can utilize more specialized algorithms.

Now, previous examples suggest that using a pipeline should yield even better results.

time { grep '^jsmith:' salaries_100m.csv | awk -F: -f sum.awk; } > /dev/null

real    0m3,298s
user    0m2,991s
sys     0m0,585s

This time, however, the difference is less spectacular. Measuring those two commands separately reveals the cause:

time grep '^jsmith:' salaries_100m.csv > salaries_filtered.csv > /dev/null

real    0m3,325s
user    0m2,876s
sys     0m0,445s

time awk -F: -f sum.awk < salaries_filtered.csv > /dev/null

real    0m0,237s
user    0m0,229s
sys     0m0,008s

Since Grep does more than 90 percent of the job here, running the commands in parallel won’t give much in terms of speed. Still, the pipeline:

  • is shorter and cleaner,

  • lets avoid writing to the disk (which may hinder the execution on older devices),

  • lets us optimize the parts of the process independently.