Herostratus’ legacy

words from a lazy coder

The SSU nightmare (2)

Do you remember the third party policy nightmare [1]? Well, the SSU saga continues.As some of you may know, in the N900, the root file system is stored in a OneNAND chip with 256M of space. Meanwhile /home and /home/user/MyDocs are in a eMMC in two different partitions: ~2GB (ext2) for /home and ~29GB (vfat) for /home/user/MyDocs. This new layout has brought new limitations. The more visible one is the /opt problem [2]. But others, more subtle, have been arose, specially in HAM.

The SSU is the operation to upgrade the system in the device without reflashing, and this is the preferred way to keep it updated. The release 1.0 of Maemo 5 is already in the wild, and has been thoroughly tested and promptly fixed, thus the release 1.1 has a huge change set. And the SSU failed miserably in the early testing: the root file system ran out of space before the end of the process.

A first analysis exposed a regression in HAM (it was not closing the running applications before installing the SSU), and a problem with Gtk+ icon cache [3]: as it is a mapped file, when a package updates the icon cache, the rest of running processes kept a whole copy of the previous file. And closing the running user applications did not fix the issue, because there are many other system processes linked with the toolkit library. The Gtk guys cunningly proposed a fix using the triggers in dpkg [4].

Nevertheless, only unstable apt uses the trigger delaying strategy [5]. So we extracted the related commits and cherry picked them into Maemo’s apt [6]. I must say that I’m still crossing my fingers begging for the stability of those commits.

But the problem remained: the space in the root file system got exhausted. Other mapped files came noticed. “Kill’em all!”, they shouted, referring to the all running processes when performing the SSU. At this moment all the dirty ad-hoc hacks started get integrated into the source code of HAM (shame on me): first a list of services to stop (camera-ui, browserd, hildon-desktop, among others), and then list of processes to kill (yes, signal 9) [8].

Yet the free space got exhausted. “HAM does not calculate the free space correctly!”, they complained. There was a remarkable difference of what df shown and the amount of bytes that HAM recognized as free. We reviewed the source code of df in busybox [9], finding that it uses the f_bavail value in the statfs structure, meanwhile HAM uses f_bfree since Diablo. So in the spirit of prevention we changed it to f_bavail.

Still the required free space was more than the estimated. “Use the rescue mode as the last resource”, they demanded. So we changed added a new ad-hoc code path: if the SSU fails because of space exhaustion, reboot and get into the rescue mode. In this way is assured that none process is running. But this is not nice use a rescue mode for a normal operation. The upgrade must be as smooth as possible.

A further analysis shown that the packages install their documentation, and only a posteriori docpurge [10] deletes it. We should avoid write the documentation in the root file system. “Do a bind-mounting to another partition dumping all the documentation there”, they said, and another ad-hoc hacks were integrated. At least Marius brought some sanity and proposed a patch to dpkg to filter out some directories [11] before the got wrote in disk. Sadly, his change did not make it in the release 1.1.

At the end all these madness and hackish solutions seems to fix the SSU for now. But we shall clean out all this mess from HAM.