Prev: Pipelines, Up: Index, Next: Process substitutions
Powers combined
We’ve seen already that a specialized tool like AWK, if used correctly, can beat a shell script in terms of performance.
BEGIN { sum = 0 }
$1 == "jsmith" { sum += $3 }
END { print sum }
time awk -F: -f sum.awk < salaries_100m.csv
real 0m12,830s
user 0m11,314s
sys 0m1,516s
That shouldn’t come as a surprise though. AWK programs are static — they can be parsed once and executed efficiently against every line of input. Bash is a very dynamic language — all those expansions and substitutions force it to parse a line of the script each time it’s executed.
It turns out, however, that there’s still some room for improvement. AWK works in a very straightforward way: it passes every line of input through a set of rules, executing them only when the condition applies. The less input gets passed to AWK, the less time is spent on data which will never match any rule.
time { grep '^jsmith:' salaries_100m.csv > salaries_filtered.csv ; awk -F: -f sum.awk < salaries_filtered.csv; } > /dev/null
real 0m3,539s
user 0m3,024s
sys 0m0,500s
Despite being highly optimized for text files, AWK is still a multi-purpose tool, capable of filtering, transforming and aggregating the data. On the other hand, grep covers just one of those things, and thus can utilize more specialized algorithms.
Now, previous examples suggest that using a pipeline should yield even better results.
time { grep '^jsmith:' salaries_100m.csv | awk -F: -f sum.awk; } > /dev/null
real 0m3,298s
user 0m2,991s
sys 0m0,585s
This time, however, the difference is less spectacular. Measuring those two commands separately reveals the cause:
time grep '^jsmith:' salaries_100m.csv > salaries_filtered.csv > /dev/null
real 0m3,325s
user 0m2,876s
sys 0m0,445s
time awk -F: -f sum.awk < salaries_filtered.csv > /dev/null
real 0m0,237s
user 0m0,229s
sys 0m0,008s
Since Grep does more than 90 percent of the job here, running the commands in parallel won’t give much in terms of speed. Still, the pipeline:
-
is shorter and cleaner,
-
lets avoid writing to the disk (which may hinder the execution on older devices),
-
lets us optimize the parts of the process independently.
Prev: Redirections, Up: Index, Next: Process substitutions