{"id":61,"date":"2023-02-14T19:04:33","date_gmt":"2023-02-14T22:04:33","guid":{"rendered":"https:\/\/blogs.igalia.com\/gpiccoli\/?p=61"},"modified":"2023-02-21T17:11:55","modified_gmt":"2023-02-21T20:11:55","slug":"debugging-early-boot-issues-on-arm64-without-console","status":"publish","type":"post","link":"https:\/\/blogs.igalia.com\/gpiccoli\/2023\/02\/debugging-early-boot-issues-on-arm64-without-console\/","title":{"rendered":"Debugging early boot issues on ARM64 (without console!)"},"content":{"rendered":"\nIn the last blog post, about <a href=\"https:\/\/blogs.igalia.com\/gpiccoli\/2021\/12\/booting-upstream-kernel-on-inforce-6640\/\">booting the upstream kernel on Inforce 6640<\/a>, I mentioned that there was an interesting issue with ftrace that led to a form of debug-by-rebooting approach, and that is worth a blog post. Of course it only took me 1 year+ to write it (!) &#8211; so apologies for this huge delay, and let&#8217;s debug some early boot failures!\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nFirst of all, a reminder of what was the issue: in an attempt to boot the upstream kernel on the Inforce 6640 board, I found that for some reason ftrace doesn&#8217;t work in such a setup, starting around kernel 5.7 &#8211; bisect wasn&#8217;t precise, even talking with the ftrace maintainer on IRC didn&#8217;t provide many clues (since more debugging was required).\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nIn order to be accurate and provide up-to-date results, I re-tested this with kernel v6.2-rc7 and&#8230;same issue is observed, the kernel just can&#8217;t boot with ftrace, and not a single hint is provided in serial console!\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nA parenthesis here: kernel 6.2-rc7 does boot on inforce 6640, but with no graphics (likely some changes in the <code>msm<\/code> driver), having some SCSI timeouts on boot time (<code>sd 0:0:0:0: [sda] tag#13 timing out command, waited 120s<\/code>) and&#8230;no USB(!), unless the &#8220;old&#8221; device-tree (from v5.16) is used &#8211; I think it&#8217;s related with some <code>dwc3\/usb<\/code> changes merged on qcom device-trees. But anyway, this might be a topic for a whole new blog post, so let&#8217;s get back to debugging early boot issues.\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nOnce we face a completely empty serial output after the usual bootloader messages, a basic question comes up: has the kernel even booted? In other words, more technically meaningful: did the bootloader properly load the kernel and jump to the code? Did the kernel code effectively start execution?\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nThe first idea is always &#8220;let&#8217;s print something&#8221;, but how, if I don&#8217;t know where I&#8217;d put the print statement? Not to mention&#8230;that super early code has nothing initialized, so even a <code>printk<\/code> could be &#8220;too much&#8221; for such code. Hence, the only alternative seemed to me&#8230;to reboot the machine from kernel code! With that, at least I&#8217;d validate that kernel code was running. And thinking about that&#8230;I even extended this logic: if I can reboot the kernel, I could kinda bisect the code to determine where the failure is. One question remains though &#8211; how to reboot a kernel so early?\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nRebooting\/shutdown is definitely a non-trivial task. By checking the kernel code (<a href=\"https:\/\/git.kernel.org\/pub\/scm\/linux\/kernel\/git\/torvalds\/linux.git\/tree\/kernel\/reboot.c?h=v6.2-rc7#n265\">here<\/a>, for example), one can see it&#8217;s full of callbacks to the architecture code; naturally, it&#8217;s a platform\/arch low-level process. Checking <code>machine_restart()<\/code> under <code>arch\/arm64<\/code>, we can see it goes through EFI calls (if EFI is supported). Noticing I was walking an unknown and potentially daunting path, the decision was to first seek help on IRC, and that brought me gold: Marc Zyngier (the ARM64 maintainer) presented me <a href=\"https:\/\/developer.arm.com\/Architectures\/Power%20State%20Coordination%20Interface\">PSCI<\/a> (Power State Coordination Interface) &#8211; if my device&#8217;s firmware supports this specification (and Inforce 6640 does!), I could issue a PSCI reset SMC (Secure Monitor Call) to get a board reset. In other words, with some assembly instructions I could perhaps reboot the kernel! Marc even provided me information about the register I should write, and after some tinkering (and more code study), I came up with this function:\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\n\n<pre class=\"wp-block-code\"><code>static inline void arm64_smc_reset(void)\n{\n    asm(\"mov x0, #0x9\");\n    asm(\"mov x3, #0x0\");\n    asm(\"mov x2, #0x0\");\n    asm(\"mov x1, #0x0\");\n    asm(\"movk x0, #0x8400, lsl #16\");\n    asm(\"smc #0\");\n}<\/code><\/pre>\n\n\n\n\nThis was first tested as an alternative way for the <a href=\"https:\/\/git.kernel.org\/pub\/scm\/linux\/kernel\/git\/torvalds\/linux.git\/tree\/drivers\/tty\/sysrq.c?h=v6.2-rc7#n164\">sysrq-b<\/a> reset handler, and that worked like a charm! Now, where to plug such code in kernel early  path? Entry point seems to make sense, so I tried this hack on <code>head.S<\/code>:\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\n\n<pre class=\"wp-block-code\"><code>--- a\/arch\/arm64\/kernel\/head.S\n+++ b\/arch\/arm64\/kernel\/head.S\n@@ -87,6 +87,12 @@\n         *\/\n SYM_CODE_START(primary_entry)\n        bl      preserve_boot_args\n+       mov x0, #0x9\n+       mov x3, #0x0\n+       mov x2, #0x0\n+       mov x1, #0x0\n+       movk x0, #0x8400, lsl #16\n+       smc #0\n        bl      init_kernel_el    \/\/ w0=cpu_boot_mode\n        mov     x20, x0\n        bl      create_idmap<\/code><\/pre>\n\n\n\n\nAnd voil\u00e1! With that, the kernel was bootlooping&#8230;meaning the board reached kernel code, so the first question was answered and debug could proceed! I decided to take a lucky step and jump directly to <code>start_kernel()<\/code>, which is C code and way easier to play with no fears of causing another issue while debugging! And the lucky step paid off: kernel was indeed executing that function. So, through a set of attempts using the <code>arm64_smc_reset()<\/code> above and commenting the results, I got the following &#8220;bisect&#8221; from the code:\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.igalia.com\/gpiccoli\/files\/2023\/02\/bisect_comments.png\" alt=\"\" class=\"wp-image-64\" width=\"762\" height=\"635\" srcset=\"https:\/\/blogs.igalia.com\/gpiccoli\/files\/2023\/02\/bisect_comments.png 913w, https:\/\/blogs.igalia.com\/gpiccoli\/files\/2023\/02\/bisect_comments-580x483.png 580w, https:\/\/blogs.igalia.com\/gpiccoli\/files\/2023\/02\/bisect_comments-768x640.png 768w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><\/figure>\n\n\n\nIt was a bit ugly to see the diff above in code, hence I&#8217;ve added as an image. Basically, it tells us the issue likely happens in <code>setup_machine_fdt()<\/code>, and by digging more (with more &#8220;bisect&#8221; reboots), I found the issue happens in <code>fdt_check_header()<\/code>. Now, as a next step I really wanted to print some values from the guilty function, but using the kernel command-line <code>earlycon<\/code> didn&#8217;t help &#8211; it turned out this is not early enough.\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nBut what if I could write a very simple really early console, or even an easier approach: copy\/port the earlycon code for this board to a bit &#8220;earlier&#8221; point in the code? And that&#8217;s what I tried next: by inspecting (and instrumenting) the <code>msm_serial<\/code> driver, I came up with this following alternative (a big code snippet here would be just terrible to read, so uploaded the patch instead): <a href=\"https:\/\/people.igalia.com\/gpiccoli\/arm64-really-early-console-msm-qcom.patch\">https:\/\/people.igalia.com\/gpiccoli\/arm64-<\/a><em><a href=\"https:\/\/people.igalia.com\/gpiccoli\/arm64-really-early-console-msm-qcom.patch\">really<\/a><\/em><a href=\"https:\/\/people.igalia.com\/gpiccoli\/arm64-really-early-console-msm-qcom.patch\">-early-console-msm-qcom.patch<\/a>\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nNotice the code is quite hacky &#8211; it ports the <code>__msm_console_write()<\/code> function with a pre-configured MMIO address for the serial port to early arch code. But despite the code being merely a &#8220;mock-up&#8221;, using it with <code>sprintf()<\/code> (for formatting) allowed me to print stuff, which was quite nice! Unfortunately I didn&#8217;t have time to debug the ftrace issue more in this board &#8211; I guess the best path forward would be first to submit a proper <code>ifc6640<\/code> device-tree, and then follow with the debug. I think it&#8217;s likely related to the bootloader version and kernel\/DT offsets, so that&#8217;s not an easy debug and definitely it&#8217;d be time-consuming.\n\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n\nBut I hope at least the PSCI reset trick, and maybe this really early serial console prototype\/idea could be useful to somebody in the world of ARM boards debugging! Thanks for reading and see you in the next blog post (which I expect takes waaay less than 1 year heheh).\n","protected":false},"excerpt":{"rendered":"<p>In the last blog post, about booting the upstream kernel on Inforce 6640, I mentioned that there was an interesting issue with ftrace that led to a form of debug-by-rebooting approach, and that is worth a blog post. Of course it only took me 1 year+ to write it (!) &#8211; so apologies for this &hellip; <a href=\"https:\/\/blogs.igalia.com\/gpiccoli\/2023\/02\/debugging-early-boot-issues-on-arm64-without-console\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Debugging early boot issues on ARM64 (without console!)&#8221;<\/span><\/a><\/p>\n","protected":false},"author":67,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-61","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/posts\/61","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/users\/67"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/comments?post=61"}],"version-history":[{"count":26,"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/posts\/61\/revisions"}],"predecessor-version":[{"id":89,"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/posts\/61\/revisions\/89"}],"wp:attachment":[{"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/media?parent=61"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/categories?post=61"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.igalia.com\/gpiccoli\/wp-json\/wp\/v2\/tags?post=61"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}