Debugging early boot issues on ARM64 (without console!)
In the last blog post, about booting the upstream kernel on Inforce 6640, I mentioned that there was an interesting issue with ftrace that led to a form of debug-by-rebooting approach, and that is worth a blog post. Of course it only took me 1 year+ to write it (!) – so apologies for this huge delay, and let’s debug some early boot failures!
First of all, a reminder of what was the issue: in an attempt to boot the upstream kernel on the Inforce 6640 board, I found that for some reason ftrace doesn’t work in such a setup, starting around kernel 5.7 – bisect wasn’t precise, even talking with the ftrace maintainer on IRC didn’t provide many clues (since more debugging was required).
In order to be accurate and provide up-to-date results, I re-tested this with kernel v6.2-rc7 and…same issue is observed, the kernel just can’t boot with ftrace, and not a single hint is provided in serial console!
A parenthesis here: kernel 6.2-rc7 does boot on inforce 6640, but with no graphics (likely some changes in the msm driver), having some SCSI timeouts on boot time (sd 0:0:0:0: [sda] tag#13 timing out command, waited 120s) and…no USB(!), unless the “old” device-tree (from v5.16) is used – I think it’s related with some dwc3/usb changes merged on qcom device-trees. But anyway, this might be a topic for a whole new blog post, so let’s get back to debugging early boot issues.
Once we face a completely empty serial output after the usual bootloader messages, a basic question comes up: has the kernel even booted? In other words, more technically meaningful: did the bootloader properly load the kernel and jump to the code? Did the kernel code effectively start execution?
The first idea is always “let’s print something”, but how, if I don’t know where I’d put the print statement? Not to mention…that super early code has nothing initialized, so even a printk could be “too much” for such code. Hence, the only alternative seemed to me…to reboot the machine from kernel code! With that, at least I’d validate that kernel code was running. And thinking about that…I even extended this logic: if I can reboot the kernel, I could kinda bisect the code to determine where the failure is. One question remains though – how to reboot a kernel so early?
Rebooting/shutdown is definitely a non-trivial task. By checking the kernel code (here, for example), one can see it’s full of callbacks to the architecture code; naturally, it’s a platform/arch low-level process. Checking machine_restart() under arch/arm64, we can see it goes through EFI calls (if EFI is supported). Noticing I was walking an unknown and potentially daunting path, the decision was to first seek help on IRC, and that brought me gold: Marc Zyngier (the ARM64 maintainer) presented me PSCI (Power State Coordination Interface) – if my device’s firmware supports this specification (and Inforce 6640 does!), I could issue a PSCI reset SMC (Secure Monitor Call) to get a board reset. In other words, with some assembly instructions I could perhaps reboot the kernel! Marc even provided me information about the register I should write, and after some tinkering (and more code study), I came up with this function:
This was first tested as an alternative way for the sysrq-b reset handler, and that worked like a charm! Now, where to plug such code in kernel early path? Entry point seems to make sense, so I tried this hack on head.S:
And voilá! With that, the kernel was bootlooping…meaning the board reached kernel code, so the first question was answered and debug could proceed! I decided to take a lucky step and jump directly to start_kernel(), which is C code and way easier to play with no fears of causing another issue while debugging! And the lucky step paid off: kernel was indeed executing that function. So, through a set of attempts using the arm64_smc_reset() above and commenting the results, I got the following “bisect” from the code:
It was a bit ugly to see the diff above in code, hence I’ve added as an image. Basically, it tells us the issue likely happens in setup_machine_fdt(), and by digging more (with more “bisect” reboots), I found the issue happens in fdt_check_header(). Now, as a next step I really wanted to print some values from the guilty function, but using the kernel command-line earlycon didn’t help – it turned out this is not early enough.
But what if I could write a very simple really early console, or even an easier approach: copy/port the earlycon code for this board to a bit “earlier” point in the code? And that’s what I tried next: by inspecting (and instrumenting) the msm_serial driver, I came up with this following alternative (a big code snippet here would be just terrible to read, so uploaded the patch instead): https://people.igalia.com/gpiccoli/arm64-really-early-console-msm-qcom.patch
Notice the code is quite hacky – it ports the __msm_console_write() function with a pre-configured MMIO address for the serial port to early arch code. But despite the code being merely a “mock-up”, using it with sprintf() (for formatting) allowed me to print stuff, which was quite nice! Unfortunately I didn’t have time to debug the ftrace issue more in this board – I guess the best path forward would be first to submit a proper ifc6640 device-tree, and then follow with the debug. I think it’s likely related to the bootloader version and kernel/DT offsets, so that’s not an easy debug and definitely it’d be time-consuming.
But I hope at least the PSCI reset trick, and maybe this really early serial console prototype/idea could be useful to somebody in the world of ARM boards debugging! Thanks for reading and see you in the next blog post (which I expect takes waaay less than 1 year heheh).
I enjoy the low-level world, like kernel, firmware, virtualization and all sorts of HW/SW interactions. Free software is both part of my work and personal beliefs - I’m an enthusiast of Linux overall!
View all posts by gpiccoli