So, I got a packed Debian image with 4.14 kernel and after some looking, noticed that the kernel was a modified / custom version from Linaro, which they call Qualcomm LT (Landing Tree). Hence, the first idea was to boot this image and see how well it worked. And unsurprisingly, it was working flawlessly – given this was provided by the board vendor, the expectation was that good amount of validation was performed, so we had a starting point here – the working version was the Linaro’s Qualcomm-LT 4.14. In a rush of optimism, I tried to test the latest released mainline kernel at the moment (v5.14) and…it didn’t work! Why would it, right? Where would be the fun? So, next step was to bisect it and I decided to start by using the Linaro’s LT tree – it failed on 5.7+, but succeeded to boot 5.4, although the board seemed pretty slow! First thought was likely a problem in the storage, like a bad device tree config for UFS (the storage device) or a driver change between 4.14 and 5.4. But before looking into that, we needed to understand why kernels 5.4+ weren’t booting…so how to proceed here, if ssh wasn’t working? Yeah…exactly: serial output to the rescue! So I got some jumper cables and decided to use a Raspberry Pi as the other end for serial connection. By connecting all the proper cables and using “minicom”, I could boot a 5.7 kernel and check the progress..and guess what? Kernel was booting normally!!!
But without PCIe – no network / GPU, for example. Also, it was still very slow. And then, an epiphany: it was the device-tree! By dumping the device-tree (from procfs) in both the working 4.14 and the non-working 5.7, a bunch of stuff were missing, like all the PCI structures. But why is that? Somebody just removed stuff from a working device-tree? Nah, it was just an obvious thing I didn’t realize before: up to kernel 5.6, the Inforce 6640 board was relying on the their “cousin’s” device-tree, the Dragonboard 820c! A pretty similar board, so relying on its DT made the board working like a charm, but in kernel 5.6 a proper device-tree for Inforce 6640 was added, although it’s very simple and minimal to boot the board. In order to have a fully working system after that finding, all we needed was to force the usage of the “old” db820c.dts, with the following hack patch:
OK, so now we have a working upstream kernel (with network!), so let’s investigate the remaining: why the board is so slow? And what about graphics, is it working? I started with the graphics and again, had some issues: board had graphics working on Linaro’s LT tree (v5.13), but not in the upstream 5.13/5.14. Another round of bisects later plus code inspection, and I found the “culprit”: a downstream patch that was also required in order to make the GPU work. The patch added a regulator with “always-on” property and a higher voltage than found in upstream. So, changed that and voilá: we had now a working 5.14 with graphics but…still slow! Moving then to the next journey: why the newer kernels are so much slower? In order to investigate that, I began by comparing the db-820c device-trees among versions, and despite there were differences, nothing rang a bell. So, jumped to sysfs and started investigating there…when it comes to my attention that the CPUs were running slow, very slow – CPUs frequency were tiny . Taking a look in the commits and checking cpufreq drivers, the troublemaker was found: after kernel 4.18, a patch added a driver for handling OPP voltage scaling for Kryo CPUs, which is the case for Snapdragon™ 820. After loading this new module, CPU clocks were back to normal/fast rates, the board was as fast (or even more) than running the original Linaro’s LT 4.14 downstream kernel. But… (yeah, another but)…after some simple CPU benchmarks, just some seconds after the beginning of the tests, board rebooted into a firmware error state. It was like as if it couldn’t handle all the CPU speed in the newer kernels. Here, I used the very powerful ftrace tooling to determine the reason of this behavior. And guess what was found? Cooling! The responsible for that was kernel lack of throttling. The following trace snippet shows how the governor would reduce the CPU clock/voltage in order to preserve CPUs health:
--- a/arch/arm64/boot/dts/qcom/Makefile +++ b/arch/arm64/boot/dts/qcom/Makefile dtb-$(CONFIG_ARCH_QCOM) += apq8096-db820c.dtb -dtb-$(CONFIG_ARCH_QCOM) += apq8096-ifc6640.dtb +#dtb-$(CONFIG_ARCH_QCOM) += apq8096-ifc6640.dtb dtb-$(CONFIG_ARCH_QCOM) += ipq6018-cp01-c1.dtb
The key here is the function
=> cpufreq_governor_limits.part.6 => cpufreq_set_policy => cpufreq_update_policy => cpufreq_set_cur_state => thermal_cdev_update => step_wise_throttle => handle_thermal_trip => thermal_zone_device_update.part.10 => thermal_zone_device_check => process_one_work => worker_thread
step_wise_throttlewhich downscale the CPU frequency to prevent from overheating. In order to have it properly working, cooling maps/trip points are required in the device-tree, it’s present in some downstream patches on Linaro’s kernel – but when that work was upstream’ed, it lacked some portions, hence if we have cpufreq driver and high frequencies, we’re prone to overheating / firmware reboots. The patch that fixes that was re-submitted by another person and I tried to get that accepted, by “lobbying” in the mailing-list and showing our test results, but up to 5.16-rc6 it isn’t present in Linus tree. Finally, worth to mention 2 things: for some reason, building a kernel with ftrace after v5.7 prevents it from booting. This led to a very interesting debug, like debug-by-rebooting (using a firmware triggered reboot in assembly), (re-)implementing a very early console to show output much before “earlycon” is available – stay tuned for a blog post about that! Second, we still lack that “regulator-always-on” necessary to have a functional GPU – after submitting a patch for that, I’ve discussed that with the maintainer, and it seems we should have the user of such regulator to request it properly – due to other tasks I couldn’t continue this work. So, in summary, if you want to boot an upstream kernel (5.14+) on Inforce 6640, the steps are: