{"id":112,"date":"2025-09-29T10:42:12","date_gmt":"2025-09-29T10:42:12","guid":{"rendered":"https:\/\/blogs.igalia.com\/aboya\/?p=112"},"modified":"2025-09-29T10:42:12","modified_gmt":"2025-09-29T10:42:12","slug":"getting-perf-to-work-on-arm32-linux-part-2-the-isas","status":"publish","type":"post","link":"https:\/\/blogs.igalia.com\/aboya\/2025\/09\/29\/getting-perf-to-work-on-arm32-linux-part-2-the-isas\/","title":{"rendered":"Getting perf to work on ARM32 Linux: Part 2, the ISAs"},"content":{"rendered":"\n<p>Welcome to the second part in this series on how to get perf to work on ARM32. If you just arrived here and want to know what is perf and why it would be useful, refer to <a href=\"https:\/\/blogs.igalia.com\/aboya\/?p=106\">Part 1<\/a>\u2014it is very brief. If you&#8217;re already familiar with perf, you can skip it.<\/p>\n\n\n\n<p>To put it blunty, <strong>ARM32 is a bit of a mess<\/strong>. Navigating this mess is a significant part of the difficulty in getting perf working. This post will focus on one of these messy parts: the ISAs, plural.<\/p>\n\n\n\n<p>The <em>ISA (Instruction Set Architecture)<\/em> of a CPU defines the set of instructions and registers available, as well as how they are encoded in machine code. ARM32 CPUs generally have not one but <strong>two coexisting ISAs: ARM and Thumb<\/strong>, with significant differences between each other.<\/p>\n\n\n\n<p>Unlike, let&#8217;s say, 32-bit x86 and 64-bit x86 executables running in the same operating system, ARM and Thumb can and often do coexist in the same process and have different sets of instructions and\u2014to a certain extent\u2014registers available, all while targetting the same hardware, and neither ISA being meant as a replacement of the other.<\/p>\n\n\n\n<p>If you&#8217;re interested in this series as a tutorial, you can probably skip this one. If, on the other hand, you want to understand these concepts to be better for when they inevitably pop up in your troubleshooting\u2014like it did in mine\u2014keep reading. This post will explain some consequential features of both ARM and Thumb, and how they are used in Linux.<\/p>\n\n\n\n<p>I highly recommend having a look at old ARM manuals for following this post. As it often happens with ISAs, old manuals are much more compact and easier to follow than the than current versions, making them a good choice for grasping the fundamentals. They often also have better diagrams, that were only possible when the CPUs were simpler\u2014the manuals for the ARM7TDMI (a very popular ARMv4T design for microcontrollers from the late 90s) are particularly helpful for introducing the architecture.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Some notable features of the ARM ISA<\/h3>\n\n\n\n<p>(Recommended introductory reference: <a href=\"https:\/\/blogs.igalia.com\/aboya\/files\/2025\/08\/ARM7TDMI-Manual-1995.pdf\">ARM7TDMI Manual (1995)<\/a>, Part 4: ARM Instruction Set. 64 pages, including examples.)<\/p>\n\n\n\n<p>The ARM ISA has a <strong>fixed instruction size of 32 bits<\/strong>.<\/p>\n\n\n\n<p>A notable feature of it is that the 4 most significant bits of each instruction contain a <strong>condition code<\/strong>. When you see <code>mov.ge<\/code> in assembly for ARM, that is the regular <code>mov<\/code> instruction with the condition code <code>1010<\/code> (<code>GE<\/code>: <em>Greater or Equal<\/em>). The condition code <code>1110<\/code> (<code>AL<\/code>: <em>Always<\/em>) is used for non-conditional instructions.<\/p>\n\n\n\n<p>ARM has 16 directly addressable registers, named <strong>r0 to r15<\/strong>. Instructions use 4-bit fields to refer to them.<\/p>\n\n\n\n<p>The ABIs give specific purposes to several registers, but as far as the CPU itself goes, there are <strong>very few special registers<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>r15 is the <em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Program_counter\">Program Counter (PC)<\/a><\/em>: it contains the address of the instruction about to be executed.<\/li>\n\n\n\n<li>r14 is meant to be used as <em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Link_register\">Link Register (LR)<\/a><\/em>\u2014it contains the address a function will jump to on return.<br>This is used by the <code>bl<\/code> (<em>Branch with link<\/em>) instruction, which before branching, will also update r14 (lr) with the value of r15 (pc), and is the main instruction used for function calls in ARM.<\/li>\n<\/ul>\n\n\n\n<p>All calling conventions I&#8217;m aware of use r13 as a <strong><em>full-descending stack<\/em><\/strong>. \u201c<em>Full stack\u201d<\/em> means that the register points to the last item pushed, rather than to the address that will be used by the next push (\u201c<em>open stack<\/em>\u201d). \u201c<em>Descending stack\u201d<\/em> means that as items are pushed, the address in the stack register decreases, as opposed to increasing (\u201c<em>ascending stack<\/em>\u201d). This is the same type of stack used in x86.<\/p>\n\n\n\n<p>The ARM ISA does not make assumptions about what type of stack programs use or what register is used for it, however. For stack manipulation, ARM has a <strong><em>Store Multiple<\/em> (<code>stm<\/code>)\/<em>Load Multiple<\/em> (<code>ldm<\/code>)<\/strong> instruction, which accepts any register as \u201cstack register\u201d and has flags for whether the stack is full or open, ascending or descending and whether the stack register should be updated at all (<em>\u201cwriteback\u201d<\/em>). The \u201cmultiple\u201d in the name comes from the fact that instead of having a single register argument, it operates on a 16 bit field representing all 16 registers. It will load or store all set registers, with lower index registers matched to lower addresses in the stack.<\/p>\n\n\n\n<p><code>push<\/code> and <code>pop<\/code> are assembler aliases for <code>stmfd r13!<\/code> (<em>Store Multiple Full-Descending<\/em> on r13 with writeback) and <code>ldmfd r13!<\/code> (<em>Load Multiple Full-Descending<\/em> on r13 with writeback) respectively\u2014the exclamation mark means writeback in ARM assembly code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Some notable features of the Thumb ISA<\/h3>\n\n\n\n<p>(Recommended introductory reference: <a href=\"https:\/\/blogs.igalia.com\/aboya\/files\/2025\/08\/ARM7TDMI-Manual-1995.pdf\">ARM7TDMI Manual (1995)<\/a>, Part 5: Thumb Instruction Set. 47 pages, including examples.)<\/p>\n\n\n\n<p>The <strong>Thumb-1 ISA has a fixed instruction size of 16 bits<\/strong>. This is meant to reduce code size, improve cache performance and make ARM32 competitive in applications previously reserved for 16-bit processors. Registers are still 32 bit in size.<\/p>\n\n\n\n<p>As you can imagine, having a fixed 16 bit size for instructions greatly limits what functionality is available: Thumb instructions generally have an ARM counterpart, but often not the other way around.<\/p>\n\n\n\n<p>Most instructions\u2014with the notable exception of the <em>branch<\/em> instruction\u2014<strong>lack condition codes<\/strong>. In this regards it works much more like x86.<\/p>\n\n\n\n<p>The vast majority of instructions only have space for 3 bits for indexing registers. This effectively means Thumb has <strong>only 8 registers<\/strong>\u2014so called <em><strong>low registers<\/strong><\/em>\u2014available to most instructions. The remaining registers\u2014referred as <em><strong>high registers<\/strong><\/em>\u2014are only available in special encodings of few select instructions.<\/p>\n\n\n\n<p><em>Store Multiple<\/em> (<code>stm<\/code>)\/<em>Load Multiple<\/em>(<code>ldm<\/code>) is largely replaced by <code>push<\/code> and <code>pop<\/code>, which here is not an alias but an actual ISA instruction and can <strong>only operate on low registers<\/strong> and\u2014as a special case\u2014can push LR and pop PC. The only stack supported is <em>full-descending<\/em> on r13 and writeback is always performed.<\/p>\n\n\n\n<p>A limited form of <em>Store Multiple<\/em> (<code>stm<\/code>)\/<em>Load Multiple<\/em> (<code>ldm<\/code>) with support for arbitrary low register as base is available, but it can only load\/store low registers, writeback is still mandatory, and it only supports one addressing mode (\u201cincrement after\u201d). This is not meant for stack manipulation, but for writing several registers to\/from memory at once.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Switching between ARM and Thumb<\/h3>\n\n\n\n<p>(Recommended reading: <a href=\"https:\/\/blogs.igalia.com\/aboya\/files\/2025\/08\/ARM7TDMI-Manual-1995.pdf\">ARM7TDMI Manual (1995)<\/a>, Part 2: Programmer&#8217;s Model. 3.2 Switching State. It&#8217;s just a few paragraphs.)<\/p>\n\n\n\n<p>All memory accesses in ARM must be <strong>32-bit aligned<\/strong>. Conveniently, this allows the 4 least significant bit of addresses to be used as <strong>flags<\/strong>, and ARM CPUs make use of this.<\/p>\n\n\n\n<p>When branching with the <code>bx<\/code> (<em>Branch with exchange<\/em>) instruction, the least significant bit of the register holding the branch address indicates whether the CPU should swich after the jump to ARM mode (0) or Thumb mode (1).<\/p>\n\n\n\n<p>It&#8217;s important to note that this bit in the address is just a flag: Thumb instructions lie in even addresses in memory.<\/p>\n\n\n\n<p>As a result, ARM and Thumb code can coexist in the same program and applications can use libraries compiled with each other mode. This is far from an esoteric feature; as an example, <a href=\"https:\/\/github.com\/buildroot\/buildroot\/blob\/feaf535\/package\/glibc\/glibc.mk#L65-L68\">buildroot always compiles glibc in ARM mode<\/a>, even if Thumb is used for the rest of the system.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Thumb-2 extension<\/h3>\n\n\n\n<p>(Recommended reference: <a href=\"https:\/\/web.archive.org\/web\/20250524021240\/https:\/\/class.ece.iastate.edu\/cpre288\/resources\/docs\/Thumb-2SupplementReferenceManual.pdf\">ARM Architecture Reference Manual: Thumb-2 Supplement (2005)<\/a>\u2014This one is already much longer, but it&#8217;s nevertheless the documentation for when Thumb-2 was introduced)<\/p>\n\n\n\n<p>Thumb-2 is an extension of the original Thumb ISA. Instructions are no longer fixed 16 bits in size, but instead <strong>instructions have variable size<\/strong> (16 or 32 bits).<\/p>\n\n\n\n<p>This allows to reintroduce a lot of functionality that was previously missing in Thumb but only pay for the increased code size in instructions that require it. For instance, <code>push<\/code> now <strong>can save high registers, but it will become a 32-bit instruction<\/strong> when doing so.<\/p>\n\n\n\n<p>Just like in Thumb-1, most instructions still lack condition codes. Instead, Thumb-2 introduces a different mechanism for making instructions conditional: the <strong><em>If-Then<\/em> (<code>it<\/code>) instruction<\/strong>. <code>it<\/code> receives a 4 bit condition code (same as in ARM) and a clever 4 bit \u201cmask\u201d. The <code>it<\/code> instruction makes execution of the following up to 4 instructions conditional on either the condition or its negation. The first instruction is never negated.<\/p>\n\n\n\n<p>An \u201c<em><strong>IT block<\/strong><\/em>\u201d is the sequence of instructions made conditional by a previous <code>it<\/code> instruction.<\/p>\n\n\n\n<p>For instance, the 16-bit instruction <code>ittet ge<\/code> means: make the next 2 instructions conditional on \u201cgreater or equal\u201d, the following instruction conditional on \u201cless than (i.e. not greater or equal)\u201d, and the following instruction conditional on \u201cgreater or equal\u201d. <code>ite eq<\/code> would make the following instruction be conditional on \u201cequal\u201d and the following instruction conditional on \u201cnot equal\u201d.<\/p>\n\n\n\n<p><strong>The IT block deprecation mess:<\/strong> <a href=\"https:\/\/developer.arm.com\/documentation\/dui0802\/b\/A32-and-T32-Instructions\/IT\">Some documentation pages<\/a> of ARM will state that <code>it<\/code> instructions followed by 32 bit instructions, or by more than one instruction, are deprecated. According to <a href=\"https:\/\/reviews.llvm.org\/D118044?id=402518\">clang commits from 2022<\/a>, this decision has been since reverted. The <a href=\"https:\/\/developer.arm.com\/documentation\/ddi0487\/lb\/?lang=en\">current (2025) version of the ARM reference manual for the A series of ARM CPUs<\/a> remains vague about this, claiming \u201c<em>Many uses of the IT instruction are deprecated for performance reasons<\/em>\u201d but doesn&#8217;t claim any specific use as deprecated in that same page. Next time you see gcc or GNU Assembler complaining about a certain IT block being \u201cperformance deprecated\u201d, this is what that is about.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Assembly code compatibility<\/h3>\n\n\n\n<p>Assemblers try to keep ARM and Thumb as mutually interchangeable where possible, so that it&#8217;s possible to write assembly code that can be assembled as either as long as you restrict your code to instructions available in both\u2014something much more feasible since Thumb-2.<\/p>\n\n\n\n<p>For instance, you can still use <code>it<\/code> instructions in code you assemble as ARM. The assembler will do some checks to make sure your IT block would work in Thumb the same as it would do if it was ARM conditional instructions and then ignore it. Conversely, instructions inside an IT block need to be tagged with the right condition code for the assembler to not complain, even if those conditions are stripped when producing Thumb.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What determines if code gets compiled as ARM or Thumb<\/h3>\n\n\n\n<p>If you try to use a buildroot environment, one of the settings you can tweak (<em>Target options\/ARM instruction set<\/em>) is whether ARM or Thumb-2 should be used as default.<\/p>\n\n\n\n<p>When you build gcc from source one of the options you can pass to <code>.\/configure<\/code> is <code><strong>--with-mode=arm<\/strong><\/code> (or similarly, <code>--with-mode=thumb<\/code>). This determines <strong>which one is used by default<\/strong>\u2014that is, if the gcc command line does not specify either. In buildroot, when \u201cToolchain\/Toolchain type\u201d is configured to use \u201cBuildroot toolchain\u201d, buildroot builds its own gcc and uses this option.<\/p>\n\n\n\n<p>To specify which ISA to use <strong>for a particular file<\/strong> you can use the gcc flags <code><strong>-marm<\/strong><\/code> or <code><strong>-mthumb<\/strong><\/code>. In buildroot, when  \u201cToolchain\/Toolchain type\u201d is configured to use \u201cExternal toolchain\u201d\u2014in which case the compiler is not compiled from source\u2014either of these flags is added to CFLAGS as a way to make it the default for packages built with buildroot scripts.<\/p>\n\n\n\n<p>A mode can also be overriden on a per-function-basis with <code>__attribute__((target(\"thumb\"))<\/code>. This is not very common however.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">GNU Assembler and ARM vs Thumb<\/h3>\n\n\n\n<p>In GNU Assembler, ARM or Thumb is selected with the <code>.arm<\/code> or <code>.thumb<\/code> directives respectively\u2014alternatively, <code>.code 16<\/code> and <code>.code 32<\/code> respectively have the same effect.<\/p>\n\n\n\n<p>Each functions that starts with Thumb code must be prefaced with the <code>.thumb_func<\/code> directive. This is necessary so that the symbol for the function includes the Thumb bit, and therefore branching to the function is done in the correct mode.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">ELF object files<\/h3>\n\n\n\n<p>There are several ways ELF files can encode the mode of a function, but the most common and most reliable is to check the addresses of the symbols. ELF files use the same \u201c<strong>lowest address bit means Thumb<\/strong>\u201d convention as the CPU.<\/p>\n\n\n\n<p>Unfortunately, while tools like objdump need to figure the mode of functions in order to e.g. disassemble them correctly, I have not found any high level flag in either objdump or readelf to query this information. Instead, here you can have a couple of Bash one liners using readelf.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: bash; title: ; notranslate\" title=\"\">\nsyms_arm() { &quot;${p:-}readelf&quot; --syms --wide &quot;$@&quot; |grep -E &#039;^\\s*&#x5B;&#x5B;:digit:]]+: &#x5B;0-9a-f]*&#x5B;02468ace]\\s+\\S+\\s+(FUNC|IFUNC)\\s+&#039;; }\nsyms_thumb() { &quot;${p:-}readelf&quot; --syms --wide &quot;$@&quot; |grep -E &#039;^\\s*&#x5B;&#x5B;:digit:]]+: &#x5B;0-9a-f]*&#x5B;13579bdf]\\s+\\S+\\s+(FUNC|IFUNC)\\s+|THUMB_FUNC&#039;; }\n<\/pre><\/div>\n\n\n<ol class=\"wp-block-list\">\n<li>The regular expression matches on the parity of the address.<\/li>\n\n\n\n<li><code>$p<\/code> is an optional variable I assign to my compiler prefix (e.g. <code>\/br\/output\/host\/bin\/arm-buildroot-linux-gnueabihf-<\/code>).<br>Note however that since the above commands just use <code>readelf<\/code>, they will work even without a cross-compiling toolchain.<\/li>\n\n\n\n<li><code>THUMB_FUNC<\/code> is written by readelf when a symbol has type <code>STT_ARM_TFUNC<\/code>. This is another mechanism I&#8217;m aware object files can use for marking functions as Thumb, so I&#8217;ve included it for completion; but I have not found any usages of it in the wild.<\/li>\n<\/ol>\n\n\n\n<p>If you&#8217;re building or assembling <strong>debug symbols<\/strong>, ranges of ARM and Thumb code are also marked with <strong><code>$a<\/code><\/strong> and <strong><code>$t<\/code><\/strong> symbols respectively. You can see them with <code>readelf --syms<\/code>. This has the advantage\u2014at least in theory\u2014of being able to work even in the presence of ARM and Thumb mixed in the same function.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Closing remarks<\/h3>\n\n\n\n<p>I hope someone else finds this mini-introduction to ARM32 useful. Now that we have an understanding of the ARM ISAs, in the next part we will go one layer higher and discuss the ABIs (plural again, tragically!)\u2014that is, what expectations have functions of each other as they call one another.<\/p>\n\n\n\n<p>In particular, we are interested in how the different ABIs handle\u2014or not\u2014frame pointers, which we will need in order for perf to do sampling profiling of large applications on low end devices with acceptable performance.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to the second part in this series on how to get perf to work on ARM32. If you just arrived here and want to know what is perf and why it would be useful, refer to Part 1\u2014it is very brief. If you&#8217;re already familiar with perf, you can skip it. To put it &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blogs.igalia.com\/aboya\/2025\/09\/29\/getting-perf-to-work-on-arm32-linux-part-2-the-isas\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Getting perf to work on ARM32 Linux: Part 2, the ISAs&#8221;<\/span><\/a><\/p>\n","protected":false},"author":57,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-112","post","type-post","status-publish","format-standard","hentry","category-uncategorized","entry"],"_links":{"self":[{"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/posts\/112","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/users\/57"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/comments?post=112"}],"version-history":[{"count":9,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/posts\/112\/revisions"}],"predecessor-version":[{"id":147,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/posts\/112\/revisions\/147"}],"wp:attachment":[{"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/media?parent=112"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/categories?post=112"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.igalia.com\/aboya\/wp-json\/wp\/v2\/tags?post=112"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}