perf is a tool you can use in Linux for analyzing performance-related issues. It has many features (e.g. it can report statistics on cache misses and set dynamic probes on kernel functions), but the one I’m concerned at this point is callchain sampling. That is, we can use perf as a sampling profiler.
A sampling profiler periodically inspects the stacktrace of the processes running in the CPUs at that time. During the sampling tick, it will record what function is currently runnig, what function called it, and so on recursively.
Sampling profilers are a go-to tool for figuring out where time is spent when running code. Given enough samples you can draw a clear correlation between the number of samples a function was found and what percentage of time that function was in the stack. Furthermore, since callers and callees are also tracked, you can know what other function called this one and how much time was spent on other functions inside this one.
What is using perf like?
You can try this on your own system by running perf top -g
where -g
stands for “Enable call-graph recording”. perf top gives you real time information about where time is currently spent. Alternatively, you can also record a capture and then open it, for example.
perf record -g ./my-program # or use -p PID to record an already running program
perf report
Samples: 11 of event 'cycles', Event count (approx.): 7158501
Children Self Command Shared Object Symbol
- 86.52% 0.00% xz xz [.] _start
_start
- __libc_start_main
- 72.04% main
- 66.16% coder_run
lzma_code
stream_encode
block_encode
lz_encode
lzma2_encode
- lzma_lzma_encode
- 37.36% lzma_lzma_optimum_normal
lzma_mf_find
lzma_mf_bt4_find
__dabt_usr
do_DataAbort
do_page_fault
handle_mm_fault
- wp_page_copy
37.36% __memset64
28.81% rc_encode
- 5.88% args_parse
lzma_check_is_supported
ret_from_exception
do_PrefetchAbort
do_page_fault
handle_mm_fault
...
The percentage numbers represent total time spent in that function. You can show or hide the callees of each function by selecting it with the arrow keys and then pressing the +
key. You can expect the main function to take a significant chunk of the samples (that is, the entire time the program is running), which is subdivided between its callees, some taking more time than others, forming a weighted tree.
For even more detail, perf also records the position of the Program Counter, making it possible to know how much time is spent on each instruction within a given function. You can do this by pressing enter and selecting “Annotate code”. The following is a real example:
│ while (!feof(memInfoFile)) {
5.75 │180:┌─→mov r0, sl
│ │→ bl feof@plt
17.67 │ │ cmp r0, #0
│ │↓ bne 594
│ │char token[MEMINFO_TOKEN_BUFFER_SIZE + 1] = { 0 };
6.15 │ │ vmov.i32 q8, #0 @ 0x00000000
6.08 │ │ ldr r3, [fp, #-192] @ 0xffffff40
5.14 │ │ str r0, [fp, #-144] @ 0xffffff70
│ │if (fscanf(memInfoFile, "%" STRINGIFY(MEMINFO_TOKEN_BUFFER_SIZE) "s%zukB", token, &amount) != 2)
│ │ mov r2, r6
4.96 │ │ mov r1, r5
│ │ mov r0, sl
│ │char token[MEMINFO_TOKEN_BUFFER_SIZE + 1] = { 0 };
5.98 │ │ vstr d16, [r7, #32]
6.61 │ │ vst1.8 {d16-d17}, [r7]
11.91 │ │ vstr d16, [r7, #16]
5.52 │ │ vstr d16, [r7, #24]
5.67 │ │ vst1.8 {d16}, [r3]
│ │if (fscanf(memInfoFile, "%" STRINGIFY(MEMINFO_TOKEN_BUFFER_SIZE) "s%zukB", token, &amount) != 2)
│ │ mov r3, r9
11.83 │ │→ bl __isoc99_fscanf@plt
6.75 │ │ cmp r0, #2
│ └──bne 180
perf automatically attempts to use the available debug information from the binary to associate machine instructions with source lines. It can also highlight jump targets making it easier to follow loops. By default the left column shows the estimated percentage of time within this function where the accompanying instruction was running (other options are available with --percent-type
).
The above example is a 100% CPU usage bug found in WebKit caused by a faulty implementation of fprintf
in glibc. We can see the looping clearly in the capture. It’s also possible to derive—albeit not visible in the fragment— that other instructions of the function did not appear in virtually any of the samples, confirming the loop never exits.
What do I need to use perf?
- A way to traverse callchains efficiently in the target platform that is supported by perf.
- Symbols for all functions in your call chains, even if they’re not exported, so that you can see their names instead of their pointers.
- A build with optimizations that are at least similar to production.
- If you want to track source lines: Your build should contain some debuginfo. The minimal level of debugging info (
-g1
in gcc) is OK, and so is every level above. - The perf binary, both in the target machine and in the machine you want to see the results. They don’t have to be the same machine and they don’t need to use the same architecture.
If you use x86_64 or ARM64, you can expect this to work. You can stop reading and enjoy perf.
Things are not so happy in the ARM32 land. I have spent roughly a month troubleshooting, learning lots of miscellaneous internals, patching code all over the stack, and after all of that, finally I got it working, but it has certainly been a ride. The remaining parts of this series cover how I got there.
This won’t be a tutorial in the usual sense. While you could follow this series like a tutorial, the goal is to get a better understanding of all the pieces involved so you’re more prepared when you have to do similar troubleshooting.