Simplest command

Before we can dive into more complex scenarios, we need to get to know Bash’s natural environment and understand its limits. The central concept of practically any OS shell is called a process.

Simply put, a process describes a single execution of a program. It has a unique identifier, it encapsulates the code that needs to be executed, its dependencies (shared libraries) and the memory required to perform its task.

Every standalone program stored separately on the disk requires a separate process. UNIX-like systems are famous for their remarkable set of small single-responsibility programs (utilities), which are combined by the shell to perform a complex task. It therefore seems natural to expect that the more sophisticated thing we need to do, the more utilities we will use, and thus the more processes will be created.

Let’s try and see what would be the overhead related to the execution of the simplest of the commands: /usr/bin/true. It’s source code, written in C, boils down to something like this:

main(int argc, char **argv)
    return 0; // Exit with a status code indicating success
In GNU, this command actually does a bit more than that (processing arguments, printing a localized usage etc.), but these are the things every UNIX command does. We can’t completely disregard them, since they still contribute to the total overhead of the process instantiation, but we don’t need to care about them right now.

Since its single execution takes less than a millisecond, we’ll run it a few times to see how it does on average:

time for ((i=0; i<10000; i++)); do

real    0m5.829s
user    0m4.226s
sys     0m1.942s

0.582 ms for a single run might seem little, but it’s actually quite a lot for a program that essantially does nothing! To understand why, we need to get the idea of the structure of an average process.

Process structure

process structure

The image above (borrowed from here) presents an overview of the memory layout of a Linux process. While full understanding of it is not necessary to write efficient Bash scripts, a few details provide hints about the reasons for our measurement results. Knowledge of the structure also helps better understand the output of strace and pmap.

First, there are quite a few memory segments serving different purposes. Starting from the bottom:

  1. Text segment stores the code of the program that needs to be executed. For it to be there, something had to load it. This might not be that much of an issue nowadays, because the hard drives are currently much faster then even a few years ago, but it still adds to the total process startup time. Also, the size of the program itself might matter here.

  2. Data and BSS segments store initialized and uninitialized static variables respectively. Their size depends on the particular program.

  3. Heap is the place for dynamic data produced at runtime. It can grow according to the needs of the program, and the way it is used can sometimes affect the performance.

  4. Memory Mapping Segment holds mappings of shared libraries and other files used by the program.

  5. Stack stores things like local variables, function parameters and return addresses.

  6. Kernel space is a memory area reserved for the kernel.

System calls

In order to access any resource managed by the operating system, programs use the API exposed by the kernel known as system calls. That includes reading and writing files, mapping files to the memory, accessing disks, network devices, processes and more.

Our example with true showed that about 33% of the execution time was spent on system calls. But what would such simple command need to do? Let’s have a look at the output of strace:

$ strace /usr/bin/true
execve("/usr/bin/true", ["/usr/bin/true"], 0x7ffe63ac7b80 /* 67 vars */) = 0
brk(NULL)                              = 0x56347df72000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffd5257feb0) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb64014b000
access("/etc/", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=97021, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 97021, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fb640133000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\3206\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=2072888, ...}, AT_EMPTY_PATH) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2117488, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fb63fe00000
mmap(0x7fb63fe22000, 1544192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7fb63fe22000
mmap(0x7fb63ff9b000, 356352, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19b000) = 0x7fb63ff9b000
mmap(0x7fb63fff2000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1f1000) = 0x7fb63fff2000
mmap(0x7fb63fff8000, 53104, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fb63fff8000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb640130000
arch_prctl(ARCH_SET_FS, 0x7fb640130740) = 0
set_tid_address(0x7fb640130a10)         = 157948
set_robust_list(0x7fb640130a20, 24)     = 0
rseq(0x7fb640131060, 0x20, 0, 0x53053053) = 0
mprotect(0x7fb63fff2000, 16384, PROT_READ) = 0
mprotect(0x56347c79b000, 4096, PROT_READ) = 0
mprotect(0x7fb640181000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7fb640133000, 97021)           = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Surprisingly, there’s a lot of things to be done even for such simple command: memory allocation, loading shared libraries, memory protection and more. If we could avoid at least some of them, we would surely gain some speed. But what can we do to achieve that? And does it really make a difference?