Prev: Specialized tools, Up: Index, Next: Pipelines
Pipelines
One of the greatest features of UNIX shells is the ability run commands in pipelines — sequences in which every adjacent pair of commands is connected via a standard output/input bridge called a pipe. Shell scripts are full of pipelines, mostly because they’re quick to write and easy to read.
ps aux | grep jsmith | awk '{ print $2 }' | xargs kill
However, that ease of use may come at the cost of performance, especially with lack of proper understanding of the underlying mechanisms.
Consider the following example:
cat /etc/passwd | grep '/home'
This pipeline usage (anti-)pattern is quite common, especially among the beginners. First of all, using cat
to print the content of a single file to the standard output just to pipe it to grep
is unnecessary, since grep
can accept a file name as a command-line parameter as well. But the issue is not just about the syntax. It hurts performance as well:
time for ((i=0; i<10000; i++)); do
cat /etc/passwd | grep '/home'
done >/dev/null
real 0m17,502s
user 0m19,546s
sys 0m8,964s
time for ((i=0; i<10000; i++)); do
grep '/home' /etc/passwd
done >/dev/null
real 0m9,889s
user 0m7,302s
sys 0m2,990s
The reason for this lies in the way pipelines are executed. Because commands in the pipeline are connected via pipes, and a pipe needs to be open on both sides (closing one side breaks the pipe), all processes must be started simultaneously — in separate subshells. By passing the file name directly to grep
, we avoid the creation of a subprocess for cat
and the I/O operations needed for communication.
Redirections
If we really need to pass the data via stdin
(e.g. because the command we want to use doesn’t support parameters), we can use redirections, which give a comparable result:
time for ((i=0; i<10000; i++)); do
grep '/home' < /etc/passwd
done >/dev/null
real 0m10,086s
user 0m7,427s
sys 0m3,043s
Another common usage of pipelines involves extracting something from a string we already have at hand:
DATA="jsmith:John Smith:555-1111"
time for ((i=0; i<10000; i++)); do
PHONE=$(echo ${DATA} | cut -d: -f3)
done >/dev/null
real 0m16,079s
user 0m15,555s
sys 0m4,013s
This is more tricky, because there’s no easy way to do that in Bourne shell. We could use here documents:
DATA="jsmith:John Smith:555-1111"
time for ((i=0; i<10000; i++)); do
PHONE=$(cut -d: -f3 <<-EOF
${DATA}
EOF
)
done >/dev/null
real 0m12,768s
user 0m10,460s
sys 0m3,091s
It is faster, but it doesn’t look too pretty. Fortunately, Bash solves the problem with a special redirection syntax:
DATA="jsmith:John Smith:555-1111"
time for ((i=0; i<10000; i++)); do
PHONE=$(cut -d: -f3 <<< "${DATA}");
done >/dev/null
real 0m12,378s
user 0m10,161s
sys 0m2,977s
Looks better, but still seems quite slow though. That is due to (a) calling an external command and (b) command substitution. Both these problems can be fixed using a built-in command:
DATA="jsmith:John Smith:555-1111"
time for ((i=0; i<10000; i++)); do
IFS=':' read LOGIN NAME PHONE <<< "${DATA}"
done >/dev/null
real 0m0,454s
user 0m0,128s
sys 0m0,323s
The last trick works only for cases covered by the read
built-in command (like splitting the string into fields). However, working with CSV files is so common that it pays off to remember it.
Temporary files
Writing data to the filesystem was historically considered slow, mostly because disk drives used to be much slower. However, the storage technology advanced to the point where it makes sense to store data in temporary files only to avoid starting a new subprocess. Let’s consider the following two functions.
test_passwd() {
for ((id=0; i<100; i++)); do
echo "u${id}:x:${id}:${id}:User ${id}:/home/u${id}:/bin/bash"
done
}
my_cut() {
local FIELDS SEPARATOR="${1:0:1}" COLUMN="${2}"
while IFS="${SEPARATOR}" read -a FIELDS; do
echo "${FIELDS[${COLUMN}]}"
done
}
The first one isn’t that important, as it only generates some random data for our test and prints it to stdout
. The second one is more interesting — it re-implements a basic functionality of cut
. If we run them both in a pipeline, we will get 2 subprocesses on each execution, which, as we already know, may be costly, especially if ran in a loop:
time for ((i=0;i<10000;i++)); do
test_passwd | my_cut : 6
done >/dev/null
real 0m4,761s
user 0m5,279s
sys 0m1,847s
If we split the pipeline into two phases and store the data in a temporary file, we might get quite good results:
time for ((i=0;i<10000;i++)); do
test_passwd > temp_file # Write to a temp file
my_cut : 6 < temp_file
done >/dev/null
real 0m0,191s
user 0m0,138s
sys 0m0,052s
The actual times will, of course, heavily depend on the kind of device and filesystem, and thus may vary.
So far, every time we dropped pipelines in favor of other mechanisms, we got better performance. Because of this, you might have got an impression that pipelines are slow. If fact, they are only slow if used on small data sets (like parsing a single line), especially in a loop. This is because the overhead of creating a subshell for each segment is greater then the computation itself. The true power of pipelines emerges when there’s a lot of data to process for a single pipeline, and the chain consists of more than 2 components (but not too many). |
Prev: Specialized tools, Up: Index, Next: Pipelines