Prev: Specialized tools, Up: Index, Next: Pipelines

Pipelines

One of the greatest features of UNIX shells is the ability run commands in pipelines — sequences in which every adjacent pair of commands is connected via a standard output/input bridge called a pipe. Shell scripts are full of pipelines, mostly because they’re quick to write and easy to read.

ps aux | grep jsmith | awk '{ print $2 }' | xargs kill

However, that ease of use may come at the cost of performance, especially with lack of proper understanding of the underlying mechanisms.

Consider the following example:

cat /etc/passwd | grep '/home'

This pipeline usage (anti-)pattern is quite common, especially among the beginners. First of all, using cat to print the content of a single file to the standard output just to pipe it to grep is unnecessary, since grep can accept a file name as a command-line parameter as well. But the issue is not just about the syntax. It hurts performance as well:

time for ((i=0; i<10000; i++)); do
    cat /etc/passwd | grep '/home'
done >/dev/null

real    0m17,502s
user    0m19,546s
sys     0m8,964s
time for ((i=0; i<10000; i++)); do
    grep '/home' /etc/passwd
done >/dev/null

real    0m9,889s
user    0m7,302s
sys     0m2,990s

The reason for this lies in the way pipelines are executed. Because commands in the pipeline are connected via pipes, and a pipe needs to be open on both sides (closing one side breaks the pipe), all processes must be started simultaneously — in separate subshells. By passing the file name directly to grep, we avoid the creation of a subprocess for cat and the I/O operations needed for communication.

Redirections

If we really need to pass the data via stdin (e.g. because the command we want to use doesn’t support parameters), we can use redirections, which give a comparable result:

time for ((i=0; i<10000; i++)); do
    grep '/home' < /etc/passwd
done >/dev/null

real    0m10,086s
user    0m7,427s
sys     0m3,043s

Another common usage of pipelines involves extracting something from a string we already have at hand:

DATA="jsmith:John Smith:555-1111"
time for ((i=0; i<10000; i++)); do
    PHONE=$(echo ${DATA} | cut -d: -f3)
done >/dev/null

real    0m16,079s
user    0m15,555s
sys     0m4,013s

This is more tricky, because there’s no easy way to do that in Bourne shell. We could use here documents:

DATA="jsmith:John Smith:555-1111"
time for ((i=0; i<10000; i++)); do
    PHONE=$(cut -d: -f3 <<-EOF
		${DATA}
		EOF
    )
done >/dev/null

real    0m12,768s
user    0m10,460s
sys     0m3,091s

It is faster, but it doesn’t look too pretty. Fortunately, Bash solves the problem with a special redirection syntax:

DATA="jsmith:John Smith:555-1111"
time for ((i=0; i<10000; i++)); do
    PHONE=$(cut -d: -f3 <<< "${DATA}");
done >/dev/null

real    0m12,378s
user    0m10,161s
sys     0m2,977s

Looks better, but still seems quite slow though. That is due to (a) calling an external command and (b) command substitution. Both these problems can be fixed using a built-in command:

DATA="jsmith:John Smith:555-1111"
time for ((i=0; i<10000; i++)); do
    IFS=':' read LOGIN NAME PHONE <<< "${DATA}"
done >/dev/null

real    0m0,454s
user    0m0,128s
sys     0m0,323s

The last trick works only for cases covered by the read built-in command (like splitting the string into fields). However, working with CSV files is so common that it pays off to remember it.

Temporary files

Writing data to the filesystem was historically considered slow, mostly because disk drives used to be much slower. However, the storage technology advanced to the point where it makes sense to store data in temporary files only to avoid starting a new subprocess. Let’s consider the following two functions.

test_passwd() {
    for ((id=0; i<100; i++)); do
        echo "u${id}:x:${id}:${id}:User ${id}:/home/u${id}:/bin/bash"
    done
}

my_cut() {
    local FIELDS SEPARATOR="${1:0:1}" COLUMN="${2}"
    while IFS="${SEPARATOR}" read -a FIELDS; do
        echo "${FIELDS[${COLUMN}]}"
    done
}

The first one isn’t that important, as it only generates some random data for our test and prints it to stdout. The second one is more interesting — it re-implements a basic functionality of cut. If we run them both in a pipeline, we will get 2 subprocesses on each execution, which, as we already know, may be costly, especially if ran in a loop:

time for ((i=0;i<10000;i++)); do
    test_passwd | my_cut : 6
done >/dev/null


real    0m4,761s
user    0m5,279s
sys     0m1,847s

If we split the pipeline into two phases and store the data in a temporary file, we might get quite good results:

time for ((i=0;i<10000;i++)); do
    test_passwd > temp_file         # Write to a temp file
    my_cut : 6 < temp_file
done >/dev/null

real    0m0,191s
user    0m0,138s
sys     0m0,052s

The actual times will, of course, heavily depend on the kind of device and filesystem, and thus may vary.

So far, every time we dropped pipelines in favor of other mechanisms, we got better performance. Because of this, you might have got an impression that pipelines are slow. If fact, they are only slow if used on small data sets (like parsing a single line), especially in a loop. This is because the overhead of creating a subshell for each segment is greater then the computation itself. The true power of pipelines emerges when there’s a lot of data to process for a single pipeline, and the chain consists of more than 2 components (but not too many).

Prev: Specialized tools, Up: Index, Next: Pipelines