Getting perf to work on ARM32 Linux: Part 2, the ISAs

Welcome to the second part in this series on how to get perf to work on ARM32. If you just arrived here and want to know what is perf and why it would be useful, refer to Part 1—it is very brief. If you’re already familiar with perf, you can skip it.

To put it blunty, ARM32 is a bit of a mess. Navigating this mess is a significant part of the difficulty in getting perf working. This post will focus on one of these messy parts: the ISAs, plural.

The ISA (Instruction Set Architecture) of a CPU defines the set of instructions and registers available, as well as how they are encoded in machine code. ARM32 CPUs generally have not one but two coexisting ISAs: ARM and Thumb, with significant differences between each other.

Unlike, let’s say, 32-bit x86 and 64-bit x86 executables running in the same operating system, ARM and Thumb can and often do coexist in the same process and have different sets of instructions and—to a certain extent—registers available, all while targetting the same hardware, and neither ISA being meant as a replacement of the other.

If you’re interested in this series as a tutorial, you can probably skip this one. If, on the other hand, you want to understand these concepts to be better for when they inevitably pop up in your troubleshooting—like it did in mine—keep reading. This post will explain some consequential features of both ARM and Thumb, and how they are used in Linux.

I highly recommend having a look at old ARM manuals for following this post. As it often happens with ISAs, old manuals are much more compact and easier to follow than the than current versions, making them a good choice for grasping the fundamentals. They often also have better diagrams, that were only possible when the CPUs were simpler—the manuals for the ARM7TDMI (a very popular ARMv4T design for microcontrollers from the late 90s) are particularly helpful for introducing the architecture.

Some notable features of the ARM ISA

(Recommended introductory reference: ARM7TDMI Manual (1995), Part 4: ARM Instruction Set. 64 pages, including examples.)

The ARM ISA has a fixed instruction size of 32 bits.

A notable feature of it is that the 4 most significant bits of each instruction contain a condition code. When you see mov.ge in assembly for ARM, that is the regular mov instruction with the condition code 1010 (GE: Greater or Equal). The condition code 1110 (AL: Always) is used for non-conditional instructions.

ARM has 16 directly addressable registers, named r0 to r15. Instructions use 4-bit fields to refer to them.

The ABIs give specific purposes to several registers, but as far as the CPU itself goes, there are very few special registers:

  • r15 is the Program Counter (PC): it contains the address of the instruction about to be executed.
  • r14 is meant to be used as Link Register (LR)—it contains the address a function will jump to on return.
    This is used by the bl (Branch with link) instruction, which before branching, will also update r14 (lr) with the value of r15 (pc), and is the main instruction used for function calls in ARM.

All calling conventions I’m aware of use r13 as a full-descending stack. “Full stack” means that the register points to the last item pushed, rather than to the address that will be used by the next push (“open stack”). “Descending stack” means that as items are pushed, the address in the stack register decreases, as opposed to increasing (“ascending stack”). This is the same type of stack used in x86.

The ARM ISA does not make assumptions about what type of stack programs use or what register is used for it, however. For stack manipulation, ARM has a Store Multiple (stm)/Load Multiple (ldm) instruction, which accepts any register as “stack register” and has flags for whether the stack is full or open, ascending or descending and whether the stack register should be updated at all (“writeback”). The “multiple” in the name comes from the fact that instead of having a single register argument, it operates on a 16 bit field representing all 16 registers. It will load or store all set registers, with lower index registers matched to lower addresses in the stack.

push and pop are assembler aliases for stmfd r13! (Store Multiple Full-Descending on r13 with writeback) and ldmfd r13! (Load Multiple Full-Descending on r13 with writeback) respectively—the exclamation mark means writeback in ARM assembly code.

Some notable features of the Thumb ISA

(Recommended introductory reference: ARM7TDMI Manual (1995), Part 5: Thumb Instruction Set. 47 pages, including examples.)

The Thumb-1 ISA has a fixed instruction size of 16 bits. This is meant to reduce code size, improve cache performance and make ARM32 competitive in applications previously reserved for 16-bit processors. Registers are still 32 bit in size.

As you can imagine, having a fixed 16 bit size for instructions greatly limits what functionality is available: Thumb instructions generally have an ARM counterpart, but often not the other way around.

Most instructions—with the notable exception of the branch instruction—lack condition codes. In this regards it works much more like x86.

The vast majority of instructions only have space for 3 bits for indexing registers. This effectively means Thumb has only 8 registers—so called low registers—available to most instructions. The remaining registers—referred as high registers—are only available in special encodings of few select instructions.

Store Multiple (stm)/Load Multiple(ldm) is largely replaced by push and pop, which here is not an alias but an actual ISA instruction and can only operate on low registers and—as a special case—can push LR and pop PC. The only stack supported is full-descending on r13 and writeback is always performed.

A limited form of Store Multiple (stm)/Load Multiple (ldm) with support for arbitrary low register as base is available, but it can only load/store low registers, writeback is still mandatory, and it only supports one addressing mode (“increment after”). This is not meant for stack manipulation, but for writing several registers to/from memory at once.

Switching between ARM and Thumb

(Recommended reading: ARM7TDMI Manual (1995), Part 2: Programmer’s Model. 3.2 Switching State. It’s just a few paragraphs.)

All memory accesses in ARM must be 32-bit aligned. Conveniently, this allows the 4 least significant bit of addresses to be used as flags, and ARM CPUs make use of this.

When branching with the bx (Branch with exchange) instruction, the least significant bit of the register holding the branch address indicates whether the CPU should swich after the jump to ARM mode (0) or Thumb mode (1).

It’s important to note that this bit in the address is just a flag: Thumb instructions lie in even addresses in memory.

As a result, ARM and Thumb code can coexist in the same program and applications can use libraries compiled with each other mode. This is far from an esoteric feature; as an example, buildroot always compiles glibc in ARM mode, even if Thumb is used for the rest of the system.

The Thumb-2 extension

(Recommended reference: ARM Architecture Reference Manual: Thumb-2 Supplement (2005)—This one is already much longer, but it’s nevertheless the documentation for when Thumb-2 was introduced)

Thumb-2 is an extension of the original Thumb ISA. Instructions are no longer fixed 16 bits in size, but instead instructions have variable size (16 or 32 bits).

This allows to reintroduce a lot of functionality that was previously missing in Thumb but only pay for the increased code size in instructions that require it. For instance, push now can save high registers, but it will become a 32-bit instruction when doing so.

Just like in Thumb-1, most instructions still lack condition codes. Instead, Thumb-2 introduces a different mechanism for making instructions conditional: the If-Then (it) instruction. it receives a 4 bit condition code (same as in ARM) and a clever 4 bit “mask”. The it instruction makes execution of the following up to 4 instructions conditional on either the condition or its negation. The first instruction is never negated.

An “IT block” is the sequence of instructions made conditional by a previous it instruction.

For instance, the 16-bit instruction ittet ge means: make the next 2 instructions conditional on “greater or equal”, the following instruction conditional on “less than (i.e. not greater or equal)”, and the following instruction conditional on “greater or equal”. ite eq would make the following instruction be conditional on “equal” and the following instruction conditional on “not equal”.

The IT block deprecation mess: Some documentation pages of ARM will state that it instructions followed by 32 bit instructions, or by more than one instruction, are deprecated. According to clang commits from 2022, this decision has been since reverted. The current (2025) version of the ARM reference manual for the A series of ARM CPUs remains vague about this, claiming “Many uses of the IT instruction are deprecated for performance reasons” but doesn’t claim any specific use as deprecated in that same page. Next time you see gcc or GNU Assembler complaining about a certain IT block being “performance deprecated”, this is what that is about.

Assembly code compatibility

Assemblers try to keep ARM and Thumb as mutually interchangeable where possible, so that it’s possible to write assembly code that can be assembled as either as long as you restrict your code to instructions available in both—something much more feasible since Thumb-2.

For instance, you can still use it instructions in code you assemble as ARM. The assembler will do some checks to make sure your IT block would work in Thumb the same as it would do if it was ARM conditional instructions and then ignore it. Conversely, instructions inside an IT block need to be tagged with the right condition code for the assembler to not complain, even if those conditions are stripped when producing Thumb.

What determines if code gets compiled as ARM or Thumb

If you try to use a buildroot environment, one of the settings you can tweak (Target options/ARM instruction set) is whether ARM or Thumb-2 should be used as default.

When you build gcc from source one of the options you can pass to ./configure is --with-mode=arm (or similarly, --with-mode=thumb). This determines which one is used by default—that is, if the gcc command line does not specify either. In buildroot, when “Toolchain/Toolchain type” is configured to use “Buildroot toolchain”, buildroot builds its own gcc and uses this option.

To specify which ISA to use for a particular file you can use the gcc flags -marm or -mthumb. In buildroot, when “Toolchain/Toolchain type” is configured to use “External toolchain”—in which case the compiler is not compiled from source—either of these flags is added to CFLAGS as a way to make it the default for packages built with buildroot scripts.

A mode can also be overriden on a per-function-basis with __attribute__((target("thumb")). This is not very common however.

GNU Assembler and ARM vs Thumb

In GNU Assembler, ARM or Thumb is selected with the .arm or .thumb directives respectively—alternatively, .code 16 and .code 32 respectively have the same effect.

Each functions that starts with Thumb code must be prefaced with the .thumb_func directive. This is necessary so that the symbol for the function includes the Thumb bit, and therefore branching to the function is done in the correct mode.

ELF object files

There are several ways ELF files can encode the mode of a function, but the most common and most reliable is to check the addresses of the symbols. ELF files use the same “lowest address bit means Thumb” convention as the CPU.

Unfortunately, while tools like objdump need to figure the mode of functions in order to e.g. disassemble them correctly, I have not found any high level flag in either objdump or readelf to query this information. Instead, here you can have a couple of Bash one liners using readelf.

syms_arm() { "${p:-}readelf" --syms --wide "$@" |grep -E '^\s*[[:digit:]]+: [0-9a-f]*[02468ace]\s+\S+\s+(FUNC|IFUNC)\s+'; }
syms_thumb() { "${p:-}readelf" --syms --wide "$@" |grep -E '^\s*[[:digit:]]+: [0-9a-f]*[13579bdf]\s+\S+\s+(FUNC|IFUNC)\s+|THUMB_FUNC'; }
  1. The regular expression matches on the parity of the address.
  2. $p is an optional variable I assign to my compiler prefix (e.g. /br/output/host/bin/arm-buildroot-linux-gnueabihf-).
    Note however that since the above commands just use readelf, they will work even without a cross-compiling toolchain.
  3. THUMB_FUNC is written by readelf when a symbol has type STT_ARM_TFUNC. This is another mechanism I’m aware object files can use for marking functions as Thumb, so I’ve included it for completion; but I have not found any usages of it in the wild.

If you’re building or assembling debug symbols, ranges of ARM and Thumb code are also marked with $a and $t symbols respectively. You can see them with readelf --syms. This has the advantage—at least in theory—of being able to work even in the presence of ARM and Thumb mixed in the same function.

Closing remarks

I hope someone else finds this mini-introduction to ARM32 useful. Now that we have an understanding of the ARM ISAs, in the next part we will go one layer higher and discuss the ABIs (plural again, tragically!)—that is, what expectations have functions of each other as they call one another.

In particular, we are interested in how the different ABIs handle—or not—frame pointers, which we will need in order for perf to do sampling profiling of large applications on low end devices with acceptable performance.

Getting perf to work on ARM32 Linux: Part 1, the tease

perf is a tool you can use in Linux for analyzing performance-related issues. It has many features (e.g. it can report statistics on cache misses and set dynamic probes on kernel functions), but the one I’m concerned at this point is callchain sampling. That is, we can use perf as a sampling profiler.

A sampling profiler periodically inspects the stacktrace of the processes running in the CPUs at that time. During the sampling tick, it will record what function is currently runnig, what function called it, and so on recursively.

Sampling profilers are a go-to tool for figuring out where time is spent when running code. Given enough samples you can draw a clear correlation between the number of samples a function was found and what percentage of time that function was in the stack. Furthermore, since callers and callees are also tracked, you can know what other function called this one and how much time was spent on other functions inside this one.

What is using perf like?

You can try this on your own system by running perf top -g where -g stands for “Enable call-graph recording”. perf top gives you real time information about where time is currently spent. Alternatively, you can also record a capture and then open it, for example.

perf record -g ./my-program  # or use -p PID to record an already running program
perf report
Samples: 11  of event 'cycles', Event count (approx.): 7158501
  Children      Self  Command  Shared Object      Symbol
-   86.52%     0.00%  xz       xz                 [.] _start
     _start
   - __libc_start_main
      - 72.04% main
         - 66.16% coder_run
              lzma_code
              stream_encode
              block_encode
              lz_encode
              lzma2_encode
            - lzma_lzma_encode
               - 37.36% lzma_lzma_optimum_normal
                    lzma_mf_find
                    lzma_mf_bt4_find
                    __dabt_usr
                    do_DataAbort
                    do_page_fault
                    handle_mm_fault
                  - wp_page_copy
                       37.36% __memset64
                 28.81% rc_encode
         - 5.88% args_parse
              lzma_check_is_supported
              ret_from_exception
              do_PrefetchAbort
              do_page_fault
              handle_mm_fault
...

The percentage numbers represent total time spent in that function. You can show or hide the callees of each function by selecting it with the arrow keys and then pressing the + key. You can expect the main function to take a significant chunk of the samples (that is, the entire time the program is running), which is subdivided between its callees, some taking more time than others, forming a weighted tree.

For even more detail, perf also records the position of the Program Counter, making it possible to know how much time is spent on each instruction within a given function. You can do this by pressing enter and selecting Annotate code. The following is a real example:

       │     while (!feof(memInfoFile)) {
  5.75 │180:┌─→mov          r0, sl
       │    │→ bl           feof@plt
 17.67 │    │  cmp          r0, #0
       │    │↓ bne          594
       │    │char token[MEMINFO_TOKEN_BUFFER_SIZE + 1] = { 0 };
  6.15 │    │  vmov.i32     q8, #0  @ 0x00000000
  6.08 │    │  ldr          r3, [fp, #-192] @ 0xffffff40
  5.14 │    │  str          r0, [fp, #-144] @ 0xffffff70
       │    │if (fscanf(memInfoFile, "%" STRINGIFY(MEMINFO_TOKEN_BUFFER_SIZE) "s%zukB", token, &amount) != 2)
       │    │  mov          r2, r6
  4.96 │    │  mov          r1, r5
       │    │  mov          r0, sl
       │    │char token[MEMINFO_TOKEN_BUFFER_SIZE + 1] = { 0 };
  5.98 │    │  vstr         d16, [r7, #32]
  6.61 │    │  vst1.8       {d16-d17}, [r7]
 11.91 │    │  vstr         d16, [r7, #16]
  5.52 │    │  vstr         d16, [r7, #24]
  5.67 │    │  vst1.8       {d16}, [r3]
       │    │if (fscanf(memInfoFile, "%" STRINGIFY(MEMINFO_TOKEN_BUFFER_SIZE) "s%zukB", token, &amount) != 2)
       │    │  mov          r3, r9
 11.83 │    │→ bl           __isoc99_fscanf@plt
  6.75 │    │  cmp          r0, #2
       │    └──bne          180

perf automatically attempts to use the available debug information from the binary to associate machine instructions with source lines. It can also highlight jump targets making it easier to follow loops. By default the left column shows the estimated percentage of time within this function where the accompanying instruction was running (other options are available with --percent-type).

The above example is a 100% CPU usage bug found in WebKit caused by a faulty implementation of fprintf in glibc. We can see the looping clearly in the capture. It’s also possible to derive—albeit not visible in the fragment— that other instructions of the function did not appear in virtually any of the samples, confirming the loop never exits.

What do I need to use perf?

  • A way to traverse callchains efficiently in the target platform that is supported by perf.
  • Symbols for all functions in your call chains, even if they’re not exported, so that you can see their names instead of their pointers.
  • A build with optimizations that are at least similar to production.
  • If you want to track source lines: Your build should contain some debuginfo. The minimal level of debugging info (-g1 in gcc) is OK, and so is every level above.
  • The perf binary, both in the target machine and in the machine you want to see the results. They don’t have to be the same machine and they don’t need to use the same architecture.

If you use x86_64 or ARM64, you can expect this to work. You can stop reading and enjoy perf.

Things are not so happy in the ARM32 land. I have spent roughly a month troubleshooting, learning lots of miscellaneous internals, patching code all over the stack, and after all of that, finally I got it working, but it has certainly been a ride. The remaining parts of this series cover how I got there.

This won’t be a tutorial in the usual sense. While you could follow this series like a tutorial, the goal is to get a better understanding of all the pieces involved so you’re more prepared when you have to do similar troubleshooting.

Setting up VisualStudio code to work with WebKitGTK using clangd

Lately I’m working on a refactor in the append pipeline of the MediaSource Extensions implementation of the WebKit for the GStreamer ports. Working on refactors often triggers many build issues, not only because they often encompass a lot of code, but also because it’s very easy to miss errors in the client code when updating an interface.

The traditional way to tackle this problem is by doing many build cycles: compile, fix the topmost error, and maybe some other errors on view that seem legit (note in C++ it’s very common to have chain errors that are consequence of previous errors), repeat until it builds successfully.

This approach is not very pleasant in a project like WebKit where an incremental build of a single file takes just enough time to cause the need for a distraction. It’s also worsened when it’s not just one file, but a complete build that may stop at any time, depending on the order the build system chooses for the files. Often it does take more time to wait for the compiler to show the error than to fix the error.

Unpleasant unfavors motivation, and lack of motivation unfavors productivity, and by the end of the day you are tired and still undone. Somehow it feels like the time spent fixing trivial build issues is substancially more than the time of a build cycle times the number of errors. Whether that perception is accurate or not, I am acutely aware of the huge impact having helpful tooling has on both productivity and quality of life, both while and after you’re done with the work, so I decided to have a look at the state of modern C++ language servers when working on a large codebase like WebKit. Previous experiences were very unsuccessful, but there are people dedicated to this and progress has been made.

Creating a WebKit project in VS Code

  1. Open the directory containing the WebKit checkout in VS Code.
  2. WebKit has A LOT of files. If you use Linux you will see a warning telling you increase the number of inotify watchers. Do so if you haven’t done it before, but even then, it will not be enough, because WebKit has more files than the maximum number of inotify watchers supported by the kernel. Also, they use memory.
  3. Go to File/Preferences/Settings, click the Workspace tab, search for Files: Watcher Exclude and add the following patterns:
    **/CMakeFiles/**
    **/JSTests/**
    **/LayoutTests/**
    **/Tools/buildstream/cache/**
    **/Tools/buildstream/repo/**
    **/WebKitBuild/UserFlatpak/repo/**

    This will keep the number of watches on a workable 258k. Still a lot, but under the 1M limit.

How to set up clangd

The following instructions assume you’re using WebKitGTK with the WebKit Flatpak SDK. They should also work for WPE with minimal substitutions.

  1. Microsoft has its own C++ plugin for VS Code, which may be installed by default. The authors of the clangd plugin recommend to uninstall the built-in C++ plugin, as running both doesn’t make much sense and could cause conflicts.
  2. Install the clangd extension for VS Code from the VS Code Marketplace.
  3. The WebKit flatpak SDK already includes clangd, so it’s not necessary to install it if you’re using it. On the other hand, because the flatpak has a virtual filesystem, it’s necessary to map paths from the flatpak to the outside. You can create this wrapper script for this purpose. Make sure to give it execution rights (chmod +x).
    #!/bin/bash
    set -eu
    # https://stackoverflow.com/a/17841619
    function join_by { local d=${1-} f=${2-}; if shift 2; then printf %s "$f" "${@/#/$d}"; fi; }
    
    local_webkit=/webkit
    include_path=("$local_webkit"/WebKitBuild/UserFlatpak/runtime/org.webkit.Sdk/x86_64/*/active/files/include)
    if [ ! -f "${include_path[0]}/stdio.h" ]; then
      echo "Couldn't find the directory hosting the /usr/include of the flatpak SDK."
      exit 1
    fi
    include_path="${include_path[0]}"
    mappings=(
      "$local_webkit/WebKitBuild/GTK/Debug=/app/webkit/WebKitBuild/Debug"
      "$local_webkit/WebKitBuild/GTK/Release=/app/webkit/WebKitBuild/Release"
      "$local_webkit=/app/webkit"
      "$include_path=/usr/include"
    )
    
    exec "$local_webkit"/Tools/Scripts/webkit-flatpak --gtk --debug run -c clangd --path-mappings="$(join_by , "${mappings[@]}")" "$@"

    Make sure to set the path of your WebKit repository in local_webkit.

    Then, in VS Code, go to File/Preferences/Settings, and in the left pane, search for Extensions/clangd. Change Clangd: Path to the absolute path of the saved script above. I recomend making these changes in the Workspace tab, so they apply only to WebKit.

  4. Create a symlink named compile_commands.json inside the root of the WebKit checkout directory pointing to the compile_commands.json file of the WebKit build you will be using, for instance: WebKitBuild/GTK/Debug/compile_commands.json
  5. Create a .clangd file inside the root of the WebKit checkout directory with these contents:
    If:
        PathMatch: "(/app/webkit/)?Source/.*\\.h"
        PathExclude: "(/app/webkit/)?Source/ThirdParty/.*"
    
    CompileFlags:
        Add: [-include, config.h]

    This includes config.h in header files in WebKit files, with the exception of those in Source/ThirdParty. Note: If you need to add additional rules, this is done by adding additional YAML documents, which are separated by a --- line.

  6. VS Code clangd plugin doesn’t read .clangd by default. Instead, it has to be instructed to do so by adding --enable-config to Clangd: Arguments. Also add --limit-results=5000, since the default limit for cross reference search results (100) is too small for WebKit.Additional tip: clangd will also add #include lines when you autocomplete a type. While the intention is good, this often can lead to spurious redundant includes. I have disabled it by adding --header-insertion=never to clangd’s arguments.
  7. Restart VS Code. Next time you open a C++ file you will get a prompt requesting confirmating your edited configuration:

VS Code will start indexing your code, and you will see a progress count in the status bar.

Debugging problems

clangd has a log. To see it, click View/Output, then in the Output panel combo box, select clangd.

The clangd database is stored in .cache/clangd inside the WebKit checkout directory. rm -rf’ing that directory will reset it back to its initial state.

For each compilation unit indexed, you’ll find a file following the pattern .cache/clangd/index/<Name>.<Hash>.idx. For instance: .cache/clangd/index/MediaSampleGStreamer.cpp.0E0C77DCC76C3567.idx. This way you can check whether a particular compilation unit has been indexed.

Bug: Some files are not indexed

You may notice VS Code has not indexed all your files. This is apparent when using the Find all references feature, since you may be missing results. This in particular affects to generated code, in particular unified sources (.cpp files generated by concatenating via #include a series of related .cpp files with the purpose of speeding up the build, compared to compiling them as individual units).

I don’t know the reason for this bug, but I can confirm the following workaround: Open a UnifiedSources file. Any UnifiedSources file will do. You can find them in paths such as WebKitBuild/GTK/Debug/WebCore/DerivedSources/unified-sources/UnifiedSource-043dd90b-1.cpp. After you open any of them, you’ll see VS Code indexing over a thousand files that were skipped before. You can close the file now. Find all references should work once the indexing is done.

Things that work

Overall I’m quite satisfied with the setup. The following features work:

  • Autocompletion:
  • . gets replaced to -> when autocompleting a member inside an object accessible by dereferencing a pointer or smart pointer. (. will autocomplete not only the members of the object, but also of the pointee).
  • Right click/Find All References: What it founds is accurate, although I don’t feel very confident in it being exhaustive, as that requires a full index.
  • Right click/Show Call Hierarchy: This a useful tool that shows what functions call the selected function, and so on, automating what otherwise is a very manual process. At least, when it’s exhaustive enough.
  • Right click/Type hierarchy: It shows the class tree containing a particular class (ancestors, children classes and siblings).
  • Error reporting: the right bar of VS Code will show errors and warnings that clangd identifies with the code. It’s important to note that there is a maximum number of errors per file, after which the checking will stop, so it’s a good idea to start from the top of the file. The errors seem quite precise and avoid a lot of trips to the compiler. Unfortunately, they’re not completely exhaustive, so even after the file shows no errors in clangd, it might still show errors in the actual compiler, but it still catches most with very detailed information.
  • Signature completion: after completing a function, you get help showing you what types the parameters expect

Known issues and workarounds

“Go to definition” not working sometimes

If “Go to definition” (ctrl+click on the name of a function) doesn’t work on a header file, try opening the source file by pressing Ctrl+o, then go back to the header file by pressing Ctrl+o again and try going to definition again.

Base functions of overriden functions don’t show up when looking for references

Although this is supposed to be a closed issue I can still reproduce it. For instance, when searching for uses of SourceBufferPrivateGStreamer::enqueueSample(), calls to the parent class, SourceBufferPrivate::enqueueSample() get ignored.

This is also a common issue when using Show Call Hierarchy.

Lots of strange errors after a rebase

Clean the cache, reindex the project. Close VS Code, rm -rf .cache/clangd/index inside the WebKit checkout directory, then open VS Code again. Remember to open a UnifiedSources file to create a complete index.

validateflow: A new tool to test GStreamer pipelines

It has been a while since GstValidate has been available. GstValidate has made it easier to write integration tests that check that playback and transcoding executing actions (like seeking, changing subtitle tracks, etc…) work as expected; testing at a high level rather than fine exact/fine grained data flow.

As GStreamer is applied to an ever wider variety of cases, testing often becomes cumbersome for those cases that resemble less typical playback. On one hand there is the C testing framework intended for unit tests, which is admittedly low level. Even when using something like GstHarness, checking an element outputs the correct buffers and events requires a lot of manual coding. On the other hand gst-validate so far has focused mostly on assets that can be played with a typical playbin, requiring extra effort and coding for the less straightforward cases.

This has historically left many specific test cases within that middle ground without an effective way to be tested. validateflow attempts to fill this gap by allowing gst-validate to test that custom pipelines acted in a certain way produce the expected result.

validateflow itself is a GstValidate plugin that records buffers and events flowing through a given pad and records them in a log file. The first time a test is run, this log becomes the expectation log. Further executions of the test still create a new log file, but this time it’s compared against the expectation log. Any difference is reported as an error. The user can rebaseline the tests by removing the expectation log file and running it again. This is very similar to how many web browser tests work (e.g. Web Platform Tests).

How to get it

validateflow has been landed recently on the development versions of GStreamer. Before 1.16 is released you’ll be able to use it by checking out the latest master branches of GStreamer subprojects, preferably with something like gst-build.

Make sure to update both gst-devtools. Then update gst-integration-testsuites by running the following command, that will update the repo and fetch media files. Otherwise you will get errors.

gst-validate-launcher --sync -L

Writing tests

The usual way to use validateflow is through pipelines.json, a file parsed by the validate test suite (the one run by default by gst-validate-launcher) where all the necessary elements of a validateflow tests can be placed together.

For instance:

"qtdemux_change_edit_list":
{
    "pipeline": "appsrc ! qtdemux ! fakesink async=false",
    "config": [
        "%(validateflow)s, pad=fakesink0:sink, record-buffers=false"
    ],
    "scenarios": [
        {
            "name": "default",
            "actions": [
                "description, seek=false, handles-states=false",
                "appsrc-push, target-element-name=appsrc0, file-name=\"%(medias)s/fragments/car-20120827-85.mp4/init.mp4\"",
                "appsrc-push, target-element-name=appsrc0, file-name=\"%(medias)s/fragments/car-20120827-85.mp4/media1.mp4\"",
                "checkpoint, text=\"A moov with a different edit list is now pushed\"",
                "appsrc-push, target-element-name=appsrc0, file-name=\"%(medias)s/fragments/car-20120827-86.mp4/init.mp4\"",
                "appsrc-push, target-element-name=appsrc0, file-name=\"%(medias)s/fragments/car-20120827-86.mp4/media2.mp4\"",
                "stop"
            ]
        }
    ]
},

These are:

  • pipeline: A string with the same syntax of gst-launch describing the pipeline to use. Python string interpolation can be used to get the path to the medias directory where audio and video assets are placed in the gst-integration-testsuites repo by writing %(media)s. It can also be used to get a video or audio sink that can be muted, with %(videosink)s or %(audiosink)s

  • config: A validate configuration file. Among other things that can be set, here validateflow overrides are defined, one per line, with %(validateflow)s, which expands to validateflow, plus some options defining where the logs will be written (which depends on the test name). Each override monitors one pad. The settings here define which pad, and what will be recorded.

  • scenarios: Usually a single scenario is provided. A series of actions performed in order on the pipeline. These are normal GstValidate scenarios, but new actions have been added, e.g. for controlling appsrc elements (so that you can push chunks of data in several steps instead of relying on a filesrc pushing a whole file and be done with it).

Running tests

The tests defined in pipelines.json are automatically run by default when running gst-validate-launcher, since they are part of the default test suite.

You can get the list of all the pipelines.json tests like this:

gst-validate-launcher -L |grep launch_pipeline

You can use these test names to run specific tests. The -v flag is useful to see the actions as they are executed. --gdb runs the test inside the GNU debugger.

gst-validate-launcher -v validate.launch_pipeline.qtdemux_change_edit_list.default

In the command line argument above validate. defines the name of the test suite Python file, testsuites/validate.py. The rest, launch_pipeline.qtdemux_change_edit_list.default is actually a regex: actually . just happens to match a period but it would match any character (it would be more correct, albeit also more inconvenient, to use \. instead). You can use this feature to run several related tests, for instance:

$ gst-validate-launcher -m 'validate.launch_pipeline\.appsrc_.*'

Setting up GstValidate default tests

[3 / 3]  validate.launch_pipeline.appsrc_preroll_test.single_push: Passed

Statistics:
-----------                                                  

           Total time spent: 0:00:00.369149 seconds

           Passed: 3
           Failed: 0
           ---------
           Total: 3

Expectation files are stored in a directory named flow-expectations, e.g.:

~/gst-validate/gst-integration-testsuites/flow-expectations/qtdemux_change_edit_list/log-fakesink0:sink-expected

The actual output log (which is compared to the expectations) is stored as a log file, e.g.:

~/gst-validate/logs/qtdemux_change_edit_list/log-fakesink0:sink-actual

Here is how a validateflow log looks.

event stream-start: GstEventStreamStart, flags=(GstStreamFlags)GST_STREAM_FLAG_NONE, group-id=(uint)1;
event caps: video/x-h264, stream-format=(string)avc, alignment=(string)au, level=(string)2.1, profile=(string)main, codec_data=(buffer)014d4015ffe10016674d4015d901b1fe4e1000003e90000bb800f162e48001000468eb8f20, width=(int)426, height=(int)240, pixel-aspect-ratio=(fraction)1/1;
event segment: format=TIME, start=0:00:00.000000000, offset=0:00:00.000000000, stop=none, time=0:00:00.000000000, base=0:00:00.000000000, position=0:00:00.000000000
event tag: GstTagList-stream, taglist=(taglist)"taglist\,\ video-codec\=\(string\)\"H.264\\\ /\\\ AVC\"\;";
event tag: GstTagList-global, taglist=(taglist)"taglist\,\ datetime\=\(datetime\)2012-08-27T01:00:50Z\,\ container-format\=\(string\)\"ISO\\\ fMP4\"\;";
event tag: GstTagList-stream, taglist=(taglist)"taglist\,\ video-codec\=\(string\)\"H.264\\\ /\\\ AVC\"\;";
event caps: video/x-h264, stream-format=(string)avc, alignment=(string)au, level=(string)2.1, profile=(string)main, codec_data=(buffer)014d4015ffe10016674d4015d901b1fe4e1000003e90000bb800f162e48001000468eb8f20, width=(int)426, height=(int)240, pixel-aspect-ratio=(fraction)1/1, framerate=(fraction)24000/1001;

CHECKPOINT: A moov with a different edit list is now pushed

event caps: video/x-h264, stream-format=(string)avc, alignment=(string)au, level=(string)3, profile=(string)main, codec_data=(buffer)014d401effe10016674d401ee8805017fcb0800001f480005dc0078b168901000468ebaf20, width=(int)640, height=(int)360, pixel-aspect-ratio=(fraction)1/1;
event segment: format=TIME, start=0:00:00.041711111, offset=0:00:00.000000000, stop=none, time=0:00:00.000000000, base=0:00:00.000000000, position=0:00:00.041711111
event tag: GstTagList-stream, taglist=(taglist)"taglist\,\ video-codec\=\(string\)\"H.264\\\ /\\\ AVC\"\;";
event tag: GstTagList-stream, taglist=(taglist)"taglist\,\ video-codec\=\(string\)\"H.264\\\ /\\\ AVC\"\;";
event caps: video/x-h264, stream-format=(string)avc, alignment=(string)au, level=(string)3, profile=(string)main, codec_data=(buffer)014d401effe10016674d401ee8805017fcb0800001f480005dc0078b168901000468ebaf20, width=(int)640, height=(int)360, pixel-aspect-ratio=(fraction)1/1, framerate=(fraction)24000/1001;

Prerolling and appsrc

By default scenarios don’t start executing actions until the pipeline is playing. Also by default sinks require a preroll for that to occur (that is, a buffer must reach the sink before the state transition to playing is completed).

This poses a problem for scenarios using appsrc, as no action will be executed until a buffer reaches the sink, but a buffer can only be pushed as the result of an appsrc-push action, creating a chicken and egg problem.

For many cases that don’t require playback we can solve this simply by disabling prerolling altogether, setting async=false in the sinks.

For cases where prerolling is desired (like playback), handles_states=true should be set in the scenario description. This makes the scenario actions run without having to wait for a state change. appsrc-push will notice the pipeline is in a state where buffers can’t flow and enqueue the buffer without waiting for it so that the next action can run immediately. Then the set-state can be used to set the state of the pipeline to playing, which will let the appsrc emit the buffer.

description, seek=false, handles-states=true
appsrc-push, target-element-name=appsrc0, file-name="raw_h264.0.mp4"
set-state, state=playing
appsrc-eos, target-element-name=appsrc0

Documentation

The documentation of validateflow, explaining its usage in more detail can be found here:

https://gstreamer.freedesktop.org/documentation/gst-devtools-1.0/plugins/validateflow.html