strace: The Basics -

strace is a utility available on Linux that allows you to display all of the calls between a userspace program and the kernel, called system calls. System calls are used by userspace applications when they need to do something that requires the kernel. This covers pretty much and data entering or leaving the application, requests for more memory, libraries getting loaded, and more.

strace is similar to other utilities from other systems such as ktrace on BSD/OSX and truss on Solaris/FreeBSD. Although these tools work differently, they fill the same puspose. Tips from this article should apply to the others, given they might need to be modified to match the output and system calls available.

Very few applications call system calls directly, as this would make it hard to port to a different operating system that had different system calls. Instead of using system calls directly most programs will use a library. When programming in C you might use glibc, uClibc, musl or others. When programming Go or Java the standard library included with the language will make the system calls.

So lets copy a file using cat under strace and see what the result is. This can be acheived with the following command.

strace cat /tmp/test > /tmp/test-cat

Now we will dissect the output below, but if you’d like it all in one file you can download it here.

execve("/usr/bin/cat", ["cat", "/tmp/test"], [/* 67 vars */]) = 0

So here is our very first system call. We can see in the line above that it is named execve(). If we want to find out what a system call does we can look it up in the Linux Programmers Manual, in particular section 2 deals with system calls. So to read about the execve() system call you would type man 2 execve.

The execve() system call takes a string called filename, and array of strings named argv and another array of strings names envp. The manpage tells us that it starts a program using the executable located at filename, with the arguments in argv and with the environment envp. For future system calls we will assume that you look up the system call before you continue reading.

In Unix starting a program consists of two system calls, fork() and exec(). In Linux these have many variations such as execve() and execveat() for exec() and vfork() and clone() for fork(). fork() duplicates the current process and exec() replaces the current process with a different program. Here we can’t see the fork() because that is done by the launching shell, however we can see the exec() call because it is done by strace after it has setup tracing.

brk(NULL)                               = 0x55a943856000

Our second system call! If you have already loaded the manpage you’ll now know that brk() allows a program to manipulate the location of the program break. The program break is the address of the top of the heap. Increasing the address of the program break allocated more memory to a program, and decreasing it deallocates memory as you can see below.

However this call to brk() is called with an address of NULL. When brk() is called this way it simply returns the address of the program break. The system call returned 0x55a943856000 so this address is the end of our heap, addresses above this are not part of our heap (though they may be mapped pages).

Since we haven’t done anything yet this wil be the Linux loader asking for the location of the program break to help it dynamically load the required libraries.

mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1904bae000

mmap() requests that a file or device be mapped to a location in the programs memory space. This means that particular pages above the heap, and below the stack may end up referring to a file, a section of memory shared with another process, a device or just some pages of physical memory. This is a second way to allocate memory

In this case we see that mmap is being called to request a page be mapped at and location (because addr is NULL), the page should be at least 8192 bytes long, it needs to be readable and writable (because the PROT_READ and PROT_WRITE flags have been ORed), it should be private to this application (thats what MAP_PRIVATE does), not linked to any file (MAP_ANONYMOUS, and fd is -1). We can see that this page is successfully mapped at 0x7f1904bae000.

As we are still loading the application, this will be some memory required by the Linux loader.

access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)

The access() system call checks a processes ability to access a particular file. In this case cat is looking for read only access to /etc/ld.so.preload.

You can read about the /etc/ld.so.preload file on the manpage for the Linux dynamic linker (man ld.so). Essentially though it contains a list of comma or whitespace seperated list of libraries that should be loaded before anything else. This feature of the linker allows system administrators to inject code into every dynamically linked program on the system. As you can imagine it is also liable to be abused by viruses.

The return code of ENOENT indicates that this file does not exist. As the file does not exist on this system, nothing will be loaded at this time.

open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=224517, ...}) = 0
mmap(NULL, 224517, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f1904b77000
close(3)                                = 0

These 4 calls are all related. First we open() a file descriptor /etc/ld.so.cache, for reading (O_RDONLY) and to be closed on the next exec() call (O_CLOEXEC). The return code indicates that this file has ben opened at file descriptor 3.

The next call is to fstat() on the file descriptor we just opened. This call gathers information about the current status of the file such as, who owns it, the permissions, the size, the time stamps for last access and so on. There is a lot to print here so strace only prints the mode (0755) and the size (2115824).

Next we call mmap to map the entire file (using the size we read earlier) into read only (PROT_READ) private memory (MAP_PRIVATE). This will make it available to be read from directly in the processes memory. The return code indicates that the file will start at 0x7f1904b77000 in the processes memory.

Finally as the file is now mapped into memory we no longer need the file descriptor, so we close() it.

/etc/ld.so.cache is also mentioned on the ld.so manpage. It contains a cached list of libraries on the system so that the linker can quickly fine them. It is created by ldconfig and you can get a list of the included libraries with ldconfig -p. The Linux Linker will need this to load the libraries the program needs.

open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \5\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2115824, ...}) = 0
mmap(NULL, 3955040, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f19045c5000
mprotect(0x7f1904782000, 2093056, PROT_NONE) = 0
mmap(0x7f1904981000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bc000) = 0x7f1904981000
mmap(0x7f1904987000, 14688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1904987000
close(3)                                = 0

Here we follow a similar pattern to above. We open() /lib64/libc.so.6 read only. Then we read() the first 832 bytes, presumably to confirm that the file is what we think it is by reading a header. We fstat() the file to figure out its size. When we mmap() the file into memory, however we map in more bytes than the size of the file to make room for other mappings we will make. this file gets mmapped to the address 0x7f19045c5000.

mprotect() is used to change the protection set on pages of memory. When we mmap()ed the file into memory we specified that the memory should be readable and executable (PROT_READ|PROT_EXEC). Now with mprotect() we are denying all access to part of it, presumably because the application does not need it.

We then use mmap() to make two more mappings overlapping the region we mapped and mprotected above. The first one contains a small part (24567 bytes) of /lib64/libc.so.6 at an offset of 1818624 bytes in. The second mmap() is anonymous memory. Both these are marked PROT_READ|PROT_WRITE meaning they are readable and writible, so they likely contain some data glibc needs. These mmap() calls are different from the other because the loader is asking for a specific part of memory by setting MAP_FIXED.

Finally we close() the no longer needed file descriptor.

If you want to know more about what is happening in the step you can read elf/dl-load.c in the glibc source code.

Here is a map of the memory that got setup for glibc.

mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1904b75000
arch_prctl(ARCH_SET_FS, 0x7f1904b75700) = 0
mprotect(0x7f1904981000, 16384, PROT_READ) = 0
mprotect(0x55a942c9d000, 4096, PROT_READ) = 0
mprotect(0x7f1904bb0000, 4096, PROT_READ) = 0
munmap(0x7f1904b77000, 224517)          = 0

Another mmap() allocating 8192 bytes of memory which the OS sets up at 0x7f1904b75000. Followed by a call to arch_prctl(). This system call is used to set the state of the FS and GS registers. FS and GS have a long history, while they were originally implemented to allow access to segmented memory, they have been adopted by both Windows and Linux to represent the thread local and CPU local storage. Here it is used to setup the thread local storage (FS register) about 1 and 3/4 of the way into this new allocation.

Then three different sections of RAM are made read only. The first is the first 16 kilobytes of the peice of memory we marked OFFSET above, which had some data from /lib64/libc.so.6 in it. The second is a value before the program break, which has some part of our own programs data section in it. Finally the last address is somewhere inside the memory that was allocated by the first mmap().

This set of system calls above is likely glibc running its initialisation and setting itself up.

brk(NULL)                               = 0x55a943856000
brk(0x55a943877000)                     = 0x55a943877000
brk(NULL)                               = 0x55a943877000

As we saw earlier the brk() system call is used to move the program break, and essentially allocate memory on the heap. Here it is called three times, once to check where the current program break is, another time to move the program break upwards 132 kilobytes and finally once more to confirm the program break has moved. This means that the program now has access to 132 kilobytes more memory than before.

open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=111950656, ...}) = 0
mmap(NULL, 111950656, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f18fdb01000
close(3)                                = 0

Here we again follow the same pattern we have seen earlier, open()ing a file, fstat()ing to find the size, mmap()ing the file into memory and finally close()ing the file descriptor.

The file in question (/usr/lib/locale/locale-archive) keeps an archive of all the available locales on the system. It is stored in a binary format and designed to be easily mapped into memory for use by different programs. There is a command named build-locale-archive which is responsible for building it from the available locale information.

fstat(1, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0

We are finally at the meat of the program. Here cat is actually running instead of all its setup code. fstat() is being used to get information on one of the standard Unix file descriptors stdout. This is the main source of output for a program, and in this case it is redirected to /tmp/test-cat. The other source of output is stderr. Below is a list of the standard file descriptors.

Number	Name
0	stdin
1	stdout
2	stderr

open("/tmp/test", O_RDONLY)             = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=5, ...}) = 0

We open() the file named /tmp/test read only, and it gets assigned file descriptor 3. We then get its attributes using fstat().

fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0

The system call fadvise64() is used to tell the Linux kernel how we plan to use a file. This allows the kernel to optimise its IO and caching. Here we tell the Kernel we plan to read the file (fd = 3 so it means /tmp/test) sequentially (POSIX_FADV_SEQUENTIAL) from start (offset = 0) to finish (len = 0). This Kernel could choose to optimise our access by prefetching the next piece of a file, every time we read the piece before.

mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1904b8c000

cat uses mmap() to allocate a 136kb buffer.

read(3, "test\n", 131072)               = 5

cat tries to read 128kb of data from file descriptor 3 (/tmp/test). The value of the bytes read are test\n. Although strace doesn’t tell us, its likely that this was read into the buffer we just allocated. The return code is 5 because only 5 bytes were read.

write(1, "test\n", 5)                   = 5

cat is doing its job, data that it just read from the file is now being written to stdout.

read(3, "", 131072)                     = 0

cat tries to read more data, but there is no more data left in the file. This causes it to get a return code of zero.

munmap(0x7f1904b8c000, 139264)          = 0

cat has no more use for the buffer it allocated, so it unmaps it.

close(3)                                = 0

As all of /tmp/test has been read, it is no longer needed, and is closed.

close(1)                                = 0
close(2)                                = 0

We have no more output to give, so we close both stdout and stderr.

exit_group(0)                           = ?
+++ exited with 0 +++

cat exits with a return code of zero to signify that no error occurred.

strace debug