{"id":106,"date":"2025-09-09T12:55:45","date_gmt":"2025-09-09T12:55:45","guid":{"rendered":"https:\/\/blogs.igalia.com\/aboya\/?p=106"},"modified":"2025-09-09T12:55:45","modified_gmt":"2025-09-09T12:55:45","slug":"getting-perf-to-work-on-arm32-linux-part-1-the-tease","status":"publish","type":"post","link":"https:\/\/blogs.igalia.com\/aboya\/2025\/09\/09\/getting-perf-to-work-on-arm32-linux-part-1-the-tease\/","title":{"rendered":"Getting perf to work on ARM32 Linux: Part 1, the tease"},"content":{"rendered":"\n<p><a href=\"https:\/\/perfwiki.github.io\/main\/\">perf<\/a> is a tool you can use in Linux for analyzing performance-related issues. It has many features (e.g. it can report statistics on cache misses and set dynamic probes on kernel functions), but the one I&#8217;m concerned at this point is callchain sampling. That is, we can use perf as a <em><strong>sampling profiler<\/strong><\/em>.<\/p>\n\n\n\n<p>A sampling profiler periodically inspects the stacktrace of the processes running in the CPUs at that time. During the sampling tick, it will record what function is currently runnig, what function called it, and so on recursively.<\/p>\n\n\n\n<p>Sampling profilers are a go-to tool for figuring out where time is spent when running code. Given enough samples you can draw a clear correlation between the number of samples a function was found and what percentage of time that function was in the stack. Furthermore, since callers and callees are also tracked, you can know what other function called this one and how much time was spent on other functions inside this one.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What is using perf like?<\/h3>\n\n\n\n<p>You can try this on your own system by running <code>perf top -g<\/code> where <code>-g<\/code> stands for <em>\u201cEnable call-graph recording<\/em>\u201d. <em>perf top <\/em>gives you real time information about where time is currently spent. Alternatively, you can also record a capture and then open it, for example.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nperf record -g .\/my-program  # or use -p PID to record an already running program\nperf report\n<\/pre><\/div>\n\n\n<pre class=\"wp-block-code\"><code>Samples: 11  of event 'cycles', Event count (approx.): 7158501\n  Children      Self  Command  Shared Object      Symbol\n-   86.52%     0.00%  xz       xz                 &#091;.] _start\n     _start\n   - __libc_start_main\n      - 72.04% main\n         - 66.16% coder_run\n              lzma_code\n              stream_encode\n              block_encode\n              lz_encode\n              lzma2_encode\n            - lzma_lzma_encode\n               - 37.36% lzma_lzma_optimum_normal\n                    lzma_mf_find\n                    lzma_mf_bt4_find\n                    __dabt_usr\n                    do_DataAbort\n                    do_page_fault\n                    handle_mm_fault\n                  - wp_page_copy\n                       37.36% __memset64\n                 28.81% rc_encode\n         - 5.88% args_parse\n              lzma_check_is_supported\n              ret_from_exception\n              do_PrefetchAbort\n              do_page_fault\n              handle_mm_fault\n...<\/code><\/pre>\n\n\n\n<p>The percentage numbers represent total time spent in that function. You can <strong>show or hide the callees<\/strong> of each function by selecting it with the arrow keys and then pressing the <strong><code>+<\/code> key<\/strong>. You can expect the main function to take a significant chunk of the samples (that is, the entire time the program is running), which is subdivided between its callees, some taking more time than others, forming a weighted tree.<\/p>\n\n\n\n<p>For even more detail, perf also records the position of the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Program_counter\"><strong>Program Counter<\/strong><\/a>, making it possible to know how much time is spent on each instruction within a given function. You can do this by pressing enter and selecting <em>\u201c<strong>Annotate code<\/strong>\u201d<\/em>. The following is a real example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>       \u2502     while (!feof(memInfoFile)) {\n  5.75 \u2502180:\u250c\u2500\u2192mov          r0, sl\n       \u2502    \u2502\u2192 bl           feof@plt\n 17.67 \u2502    \u2502  cmp          r0, #0\n       \u2502    \u2502\u2193 bne          594\n       \u2502    \u2502char token&#091;MEMINFO_TOKEN_BUFFER_SIZE + 1] = { 0 };\n  6.15 \u2502    \u2502  vmov.i32     q8, #0  @ 0x00000000\n  6.08 \u2502    \u2502  ldr          r3, &#091;fp, #-192] @ 0xffffff40\n  5.14 \u2502    \u2502  str          r0, &#091;fp, #-144] @ 0xffffff70\n       \u2502    \u2502if (fscanf(memInfoFile, \"%\" STRINGIFY(MEMINFO_TOKEN_BUFFER_SIZE) \"s%zukB\", token, &amp;amount) != 2)\n       \u2502    \u2502  mov          r2, r6\n  4.96 \u2502    \u2502  mov          r1, r5\n       \u2502    \u2502  mov          r0, sl\n       \u2502    \u2502char token&#091;MEMINFO_TOKEN_BUFFER_SIZE + 1] = { 0 };\n  5.98 \u2502    \u2502  vstr         d16, &#091;r7, #32]\n  6.61 \u2502    \u2502  vst1.8       {d16-d17}, &#091;r7]\n 11.91 \u2502    \u2502  vstr         d16, &#091;r7, #16]\n  5.52 \u2502    \u2502  vstr         d16, &#091;r7, #24]\n  5.67 \u2502    \u2502  vst1.8       {d16}, &#091;r3]\n       \u2502    \u2502if (fscanf(memInfoFile, \"%\" STRINGIFY(MEMINFO_TOKEN_BUFFER_SIZE) \"s%zukB\", token, &amp;amount) != 2)\n       \u2502    \u2502  mov          r3, r9\n 11.83 \u2502    \u2502\u2192 bl           __isoc99_fscanf@plt\n  6.75 \u2502    \u2502  cmp          r0, #2\n       \u2502    \u2514\u2500\u2500bne          180<\/code><\/pre>\n\n\n\n<p>perf automatically attempts to use the available debug information from the binary to associate machine instructions with source lines. It can also highlight jump targets making it easier to follow loops. By default the left column shows the estimated percentage of time within this function where the accompanying instruction was running (other options are available with <code>--percent-type<\/code>).<\/p>\n\n\n\n<p>The above example is a 100% CPU usage bug found in WebKit caused by a faulty implementation of <code>fprintf<\/code> in glibc. We can see the looping clearly in the capture. It&#8217;s also possible to derive\u2014albeit not visible in the fragment\u2014 that other instructions of the function did not appear in virtually any of the samples, confirming the loop never exits.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What do I need to use perf?<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A <strong>way to traverse callchains<\/strong> efficiently in the target platform that is supported by perf.<\/li>\n\n\n\n<li><strong>Symbols for all functions<\/strong> in your call chains, even if they&#8217;re not exported, so that you can see their names instead of their pointers.<\/li>\n\n\n\n<li>A build with <strong>optimizations that are at least similar to production<\/strong>.<\/li>\n\n\n\n<li>If you want to track source lines: Your build should contain some <strong>debuginfo<\/strong>. The minimal level of debugging info (<code>-g1<\/code> in gcc) is OK, and so is every level above.<\/li>\n\n\n\n<li>The <strong>perf<\/strong> binary, both in the target machine and in the machine you want to see the results. They don&#8217;t have to be the same machine and they don&#8217;t need to use the same architecture.<\/li>\n<\/ul>\n\n\n\n<p>If you use x86_64 or ARM64, you can expect this to work. You can stop reading and enjoy perf.<\/p>\n\n\n\n<p>Things are not so happy in the ARM32 land. I have spent roughly a month troubleshooting, learning lots of miscellaneous internals, patching code all over the stack, and after all of that, finally I got it working, but it has certainly been a ride. The remaining parts of this series cover how I got there.<\/p>\n\n\n\n<p>This won&#8217;t be a tutorial in the usual sense. While you could follow this series like a tutorial, the goal is to get a better understanding of all the pieces involved so you&#8217;re more prepared when you have to do similar troubleshooting.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>perf is a tool you can use in Linux for analyzing performance-related issues. It has many features (e.g. it can report statistics on cache misses and set dynamic probes on kernel functions), but the one I&#8217;m concerned at this point is callchain sampling. That is, we can use perf as a sampling profiler. A sampling &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blogs.igalia.com\/aboya\/2025\/09\/09\/getting-perf-to-work-on-arm32-linux-part-1-the-tease\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Getting perf to work on ARM32 Linux: Part 1, the tease&#8221;<\/span><\/a><\/p>\n","protected":false},"author":57,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-106","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"_links":{"self":[{"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/posts\/106","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/users\/57"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/comments?post=106"}],"version-history":[{"count":6,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/posts\/106\/revisions"}],"predecessor-version":[{"id":146,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/posts\/106\/revisions\/146"}],"wp:attachment":[{"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/media?parent=106"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/categories?post=106"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/tags?post=106"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}