Pipelines

Now that we know how not to use pipelines, let’s talk about the right way.

If you ever worked with Java’s Stream API, or LINQ from C#, or Haskell’s List monad, you might have noticed some similarities to shell pipelines: There’s a stream of data, be it objects in a collection or rows from an SQL query, or lines from the text file. There’s also a chain of operations manipulating the stream datum by datum (filtering, mapping, reduction):

Java

Shell

Files.lines(Paths.get("/etc/passwd"))
     .filter(line -> line.contains("/home"))
     .map(line -> line.replace("/home", "/new_home"))
     .forEach(System.out::println);
cat /etc/passwd
    | grep '/home'
    | sed 's#/home#/new_home#'
    | xargs -l echo

Using Stream API on a collection of just a few elements in a method called on every HTTP request, while trendy, might not be the best idea if you care about a single microsecond or GC cycle. On the other hand, if the stream is a part of a long-running thread which consumes data from a queue, transforms it and sends it to another queue, then this overhead can be ignored, because creating the stream is a one-time operation.

Although the machinery behind shell pipelines is a bit different (especially with regard to concurrency), the rule is quite similar: it’s better if the pipeline (or stream) is created only once and then fed with larger amount of data.

Concurrency

Shell pipelines are inherently concurrent. Let’s have a closer look at the performance benefit it brings compared to the same set of commands executed sequentially.

The table below shows execution times for a sequential execution running on various types of block devices:

Sequential execution

SATA

NVMe

Ramdisk

time {
    grep '/home' < users_100m.csv > temp1
    cut -d: -f6 < temp1 > temp2
    sed 's/home/new_home/' < temp2
} > /dev/null
real    0m6,036s
user    0m4,994s
sys     0m1,041s
real    0m32,598s
user    0m8,434s
sys     0m3,967s
real    0m4,830s
user    0m4,063s
sys     0m0,765s

The same set of commands ran as a pipeline outperforms them all:

time {
    grep '/home' users_100m.csv | cut -d: -f6 | sed 's/home/new_home/'
} >/dev/null

real    0m3,767s
user    0m5,830s
sys     0m0,940s

As it was already noted, pipelines are executed using separate subshells (one for each command) running in parallel. The longer the pipeline is, the more processing steps can potentially be parallelized (with the limit being the number of available CPUs).

Prev: Redirections, Up: Index, Next: Powers combined