27
Feb 13

Optimizing Qt for MIPS: QString::fromLatin1

During the last months at Igalia we have been working on optimizing open source components for the MIPS architecture, using the SIMD instructions supported by the latest MIPS 74K cores. More specifically, I have been working on the Qt framework.

One of the most used parts of Qt are the basic utilities in the QtCore module, being QString a basic building block that all applications are expected to use heavily, so making it faster would be a gain for any application. One of the functions I first worked on is QString::fromLatin1: it converts an array of characters in ISO–8859–1 encoding (also known as “Latin 1”) and returns a QString object, which internally stores text data in UTF-16.

Let’s take a look at the generic version of QString::fromLatin1. Taking apart the surrounding code, the loop contains a simple “pull-up” operation: each input byte is converted to an unsigned 16-bit integer, padding the high 8 bits with zeroes (the method works for ISO–8859–1, do not try this for other input encodings):

// "dst": array of 16-bit integers
// "src": array of  8-bit integers
 
while (len--)
   *dst++ = (uchar) *src++;

Almost every compiler out there —including GCC— will implement this doing 1-byte loads into a register, zeroing the upper part of the register, and then doing a halfword (16-bit) store. Essentially:

; a0: "dst" array
; a1: "src" array
; a2: "len"
 
1:  lbu    t1, 0 (a1)
    addiu  a2, a2, -1
    sh     t1, 0 (a0)
    addiu  a0, a0, 2
    bnez   a2, 1b
     addiu a1, a1, 1

(Note how the branch delay slot is taken advantage of to increment one of the pointers; some compilers may just insert a nop instead.)

The most important thing to do is unrolling the loop, but some care is needed: as Thiago Macieira mentions in one of his articles, in average most of the strings in Qt have a length of 17 characters. This means that doing a long unrolling of the loop could be worse because of the calculations needed to determine how many times the unrolled loop is executed and the handling of the remaining items. The first thing to do was making two versions of the algorithm: one with the loop unrolled 4 times, the other with it unrolled 8 times.

As next step, the usual optimization is changing the memory access so as load/store operations are aligned, and handle full words (instead of bytes and halfwords). Loading a word means that the loop is implicitly unrolled 4 times, and some bit-juggling needs to be done in order to extract the values before storing them. Something like:

1:  lw     t1, (a1)  ; t1=ABCD
    addiu  a1, a1,  4
    ; extract bytes from t1 to t2/t3, padding
    ; them with zeroes: t2=0A0B, t3=0C0D
    addiu  a2, a2, -4
    sw     t2, 0 (a0)
    sw     t3, 4 (a0)
    bnez   a2, 1b
     addiu a0, a0, 8

If the code needed to unpack the data needs more than 14 instructions, then this loop would use more instructions than the basic one above (maybe it still would be marginally faster due to the aligned load/store). This is where the new DSP instructions of the 74K cores come in handy: the preceu.ph.qbl and preceu.ph.qbr instructions unpack two 8-bit parts of a register to two 16-bit integers in another register. Like this:

Using the preceu.ph.* instructions avoids having to do quite a lot of shifting and masking to get the values right. The final loop is still quite easy to grasp:

1:  lw     t1, (a1)  ; t1=ABCD
    addiu  a1, a1,  4
    preceu.ph.qbl t2, t1  ; t2=0A0B
    preceu.ph.qbr t3, t1  ; t3=0C0D
    addiu  a2, a2, -4
    sw     t2, 0 (a0)
    sw     t3, 4 (a0)
    bnez   a2, 1b
     addiu a0, a0, 8

Note that the actual implementation is longer, handles the extra items remaining after the unrolled loop, and has a preamble to make the pointers aligned—which, most of the time, is worth to do.

For the sake of completeness I made an additional version of the function in assembler with the loop unrolled 8 times, which uses byte loads and halfword stores. For the benchmarking different string lengths were checked, being each version of the function called a million times. The plot uses the average of 10 runs:

Legend for the plot above:

  • Qt: This is the original code included in Qt, which is s simple pull-up operation that converts chars to short integers. The code was built with -O2 -march=74kf -mdspr2 to get the best possible optimizations from GCC.
  • lbu+unroll8: Loads each character using lbu (byte load) instructions, stores results using sh (store halfword), and the main body of the loop is 8-unrolled. It also uses cache prefetch hinting for both source and destination memory areas.
  • lw+unroll4+dsp: The main body of the loop is 4-unrolled, using lw instructions to load aligned words, DSP instructions to unpack bytes into halfwords, and finally store results with sw instructions.
  • lw+unroll8+dsp: Similar to the previous one, main body is 8-unrolled, and if the source address is not aligned, up to three bytes are consumed before starting to use lw instructions to load aligned words.

The following observations can be done:

  • All the optimized versions perform better than the simple Qt version, being the speedup up to 3.x (note: the time scale in the graphic is logarithmic).
  • For strings up to 20 characters, the 4-unrolled function performs clearly better, because of the small data set and the fact that it avoids the additional preamble used to align the source pointer address. Also, some string lengths (4, 12, 20) can be better handled by the 4-unrolled loop because the lengths are not divisible by 8.
  • For strings bigger than 20 characters, the 8-unrolled with aligned loading version is the best, and it reaches a 3x speedup for strings longer than 32 characters. For longer strings, the speedup is maintaned and becomes clearer as string are longer.

With this data, we can conclude that the best implementation is an hybrid approach, using both the code for the 4-unrolled loop and the 8-unrolled loop, and selecting one of them depending on the length of the input data. Checking the length of the input string and branching takes just two extra instructions (slti and a bnez), which is okay to have in the function preamble because it is not costly.

Wrapping up, the final version is up to 3 times faster than the original Qt implementation, without sacrificing the speed for the rather common small strings.

Last but not least, I would like to thank MIPS Technologies for sponsoring this work, and for making it possible to have a development board at Igalia for testing and benchmarking.


11
Feb 13

Back from FOSDEM — 2013 Edition

A couple of weekends ago I attended FOSDEM, my third time in a row sponsored by Igalia.

Apart from the appeal of wandering around Brussels and enjoying what the European capital has to offer —mostly in the gastronomic field—, this year I had a presentation to make about the progress done in WebKitGTK+ during the last year. Because it was The room was quite full, so for those who could not make it on time or decided to not step in, here are the slides I used:

  • HTML slides: you will need a quite recent browser, WebKit-based if possible. The latest version of Epiphany the GNOME Web browser will do fine ;-)
  • PDF slides: Essentially the same, generated from the HTML plus some pdfutils mojo to crop the margins. Any reasonable PDF viewer should handle those well.

Those should be a good summary while the videos of the “crossdistro” room slowly start to appear thanks to the good work done by the video team.

Last but not least, in the middle of the huge amount of interesting talks, there were two more Igalians with talks at FOSDEM: Felipe had a talk about user experience design and API made a summary on how accessibility in GNOME got to be enabled by default.

There is only one small “complaint” I have about FOSDEM: would there ever be enough time to attend all the sessions one would like to? I bet not. And that is, in a way, quite amazing :-)


23
Jul 12

GUADEC, second round (for me)

Last summer I had the opportunity of attending the Desktop Summit in Berlin, and being it a great experience I was more than looking forward to attend GUADEC this year. Not just because it is going to be great to meet fellow hackers and friends after one year and attend the sessions, but also because it is taking place in A Coruña, where I lived for a decade. Last, but not least, there will be a GNOME OS hackfest on July 30th (check it out in the BoF/hackfests programme) and lately that has been something which has been in my list of interests.

See you there! :-D


27
Jun 12

Recipe: Running qmail-send under systemd

Call me old-fashioned, crazy or whatever, but I do use qmail (only the qmail-send service) as local MTA and for relaying outgoing mail to a smarthost. Over the years I tried several other options like msmtp, nbsmtp, or nullmailer, which would allow to not run a full-fledged MTA, but they either:

  • Do not queue up messages. (Queuing is interesting for sometimes-offline boxes, like a laptop: you still “send” messages normally, and once there is again a working Internet connection , they are delivered all in a row.)
  • Do not provide a sendmail-compatible binary. (Tools like git send-email expect this to be available, so one absolutely wants this.)
  • Are too dumb to provide some sort of reasonable “local” delivery to a remote mail address. (Certain tools like cron may send messages to local accounts, so I want the mailer to be able to forward those local messages to a remote address.)

Then at some point I though “hey, but qmail is no more than a number of small tools glued together which do one task well… why not running just qmail-send for deliveries?” and the idea indeed works like a charm. Traditionally I have been running it with dmon for running it, but lately I have been using systemd in my laptop, and decided to run it directly using it instead of letting it run the SysV init script. It turned out to be reasonably easy.

Those are the contents for the /etc/systemd/system/qmail-send.service unit file:

[Unit]
Description=qmail delivery daemon
After=syslog.target
After=local-fs.target
ConditionFileIsExecutable=/var/qmail/bin/qmail-start

[Install]
WantedBy=multi-user.target

[Service]
Type=simple
Restart=always
StandardOutput=syslog
StandardError=inherit
SyslogFacility=mail
SyslogIdentifier=qmail-send
Environment=PATH=/var/qmail/bin:/usr/bin:/bin:/usr/sbin:/sbin
ExecStart=/var/qmail/bin/qmail-start ~/.maildir/
ExecReload=/var/qmail/bin/qmail-tcpok ; /bin/kill -ALRM ${MAINPID}

Some notes:

  • This assumes that you have a LWQ-style qmail installation, hence the paths pointing to /var/qmail.
  • The Environment= line defines the PATH variable so it contains /var/qmail/bin. This is needed for qmail to work properly.
  • The ExecReload= clears the timeout tables with qmail-tcpok and then sends SIGALRM to the main process. This makes it try to deliver the queued messages. I have a handy script that runs systemctl reload qmail-send.service when NetworkManager connects to Internet.
  • Like all DJB-ware, qmail logs using the stdio streams, so we make them go to both the systemd journal and syslog (I want to be able to tail -f the log file for the mail facility).

Once this unit file is in place, use the following to reload the unit files, and then start and enable the service:

systemctl daemon-reload
systemctl start qmail-send.service
systemctl enable qmail-send.service

Voilà! :-)


26
Jun 12

List of articles about systemd

Even when my personal opinion on systemd is a bit skeptical regarding some of its components, it is undeniable that in the mid-term the main GNU/Linux distributions will be shipping it (or parts of it), and with Fedora having already adopted, it is just a matter of time before GNOME (as in “GNOME OS”) starts using it. So, for getting acquainted with it, I have installed it following the corresponding Arch Wiki page and started reading documentation. This article is mainly a compilation to remind myself about where to easily reach the relevant information, but it may still be useful to the people out there so I here it goes.

Lennart Poettering (main developer of the thing) has a “systemd for administrators” series of blog posts (thirteen articles so far):

  1. Verifying bootup
  2. Which service owns which processes?
  3. How do I convert a SysV init script to a systemd service file?
  4. Killing services
  5. The three levels of “off” (aka how to disable services)
  6. Changing roots (aka chroot()ing services)
  7. The Blame Game (aka determining which services are slowing down the boot process)
  8. The new configuration files (contains interesting notes for third party packages)
  9. On /etc/sysconfig and /etc/default (this is more like an opinion column, but interesting nevertheless)
  10. Instantiated services (aka service templates, like the multiple virtual console services)
  11. Converting inetd services
  12. Securing your services (aka how to restrict what services can do to harden them)
  13. Log and service status
  14. The self-explanatory boot (aka how unit files —services, etc— can provide hints to documentation in them)
  15. Watchdogs

Here you can find a list with links to the manual pages and some more documentation. For newcomers, I would totally recommend reading the FAQ and the Tips & Tricks pages.


15
Jun 12

The GNOME, the OS and the OSTree

Traditionally, GNOME has been defined —and seen— as a project aiming to produce a desktop environment made of Free Software. Maybe you have read (or heard) the term “GNOME OS” lately, which started to pop out here and there a while ago, but still it may sound a lot like vaporware… far from that, it is being shaped up in this very moment, and GNOME 3 is just the tip of the iceberg.

But isn’t making an entire operating system a lot more of work than just caring about the user experience? Yes. Kinda. Depends on how one does it. Fortunately there are a number of components already available: for example nobody has to take care about writing a kernel or the base graphics infrastructure—but adopting them instead, assembling a minimal host system on top of which the rest of the GNOME platform can be run and developed. Add some grease —a process to install the system—, some polish —a process to upgrade it— and some duct tape —a developer story— and then the thing will be ready to crank it up. This is precisely what OSTree is about.

OSTree

Trees: a new home for (the) GNOME (Creative-Commons image by Alan Moote)

A while ago Colin Walters started to implement Hacktree, later renamed to OSTree, with the aim to cover the inner guts of system deployment —installation, upgrades— and development. The basic idea is to have multiple, complete, bootable file system trees, with the possibility of chosing among them at boot time and versioning the contents in a repository. Roughly, operations then are carried away this way:

  • A new system installation consists in cloning a remote repository and checking out a tree that contains the latest release of the system.
  • An upgrade consists in pulling contents from the remote repository, checking out a new tree and rebooting into it. Note that the old tree can be kept around and be used as fallback when something fails in the new one.

If the above sounds a bit like the Git version control system, it is because OSTree draws quite some inspiration and ideas from it ;-)

As a nice consequence of allowing multiple file system trees, developers automatically get some extras for free:

  • They may want to check out a tree containing additional aids targeted to them and use it for development.
  • It would be possible to ensure that all developers work on top of the same set of platform components.
  • Also, a developer may chose to check out a particular revision of the system tree, to ease reproduce bugs in a controlled environment.

But developers deserve even more than that, isn’t it? Why not revamping the venerable JHBuild into something that knows how to interact with OSTree? Such a thing exists, and it’s called ostbuild.

Making a mark in the world

Some of us at Igalia think that OSTree has a very good potential to become an integral part of a GNOME system in a not-so-far future, so why not helping out to try to make that happen a bit earlier? It is not ready for prime time yet, hence the investment we are doing devoting hacking time in OSTree. Expect more posts about it later on ;-)


02
May 12

Recipe: Fixed number of workspaces in GNOME 3.x

Type this in a terminal (change 4 to the number of desired workspaces):

gsettings set org.gnome.desktop.wm.preferences num-workspaces 4
gsettings set org.gnome.shell.overrides dynamic-workspaces false

Note: I have only checked this with GNOME 3.4 — YMMV :)


20
Feb 12

Some Tracker + SPARQL bits

Those are some notes about Tracker with links and information which has been useful to me when using Tracker as data storage backend in a project. Maybe those will be useful to someone else out there :)

Reference documentation

The basics are the W3C standards:

Tracker-specific info

  • Tracker adds some extra features which can be used in SPARQL queries. Some of them can be used to make queries faster (see below).
  • It is possible to subscribe to notifications of changes to the Tracker database. However, using the low-level D-Bus support is crude, therefore it is better to use something like TrackerLiveQuery.

Optimizing queries

At some point one may realize that a particular query is not running fast enough, and usually the first thought would be to blame Tracker being slow: as a matter of fact Tracker tends to be quite fast, but sometimes the way it translates the SPARQL queries to SQL makes them be slower than they could be. And the good news is that most of the queries can be tweaked to facilitate the job of the query optimizer. There are some hints in the Tracker wiki:

Also, I would recommend reading those two nice articles by Adrien Bustany:

Undocumented behavior

Undocumented behavior exists in Tracker to a certain degree. Some of the undocumented behavior is caused by the fact that SQLite is used underneath and that the SPARQL parser included in Tracker is quite permissive and will just pass-through certain constructs when generating the SQL queries.

As an example, the regular expression syntax used y SPARQL does not include predefined character classes, but as SQLite uses POSIX regular expressions internally, the following filter expression works:

  SELECT ?name
  WHERE { ?urn nco:imNickname ?name
          FILTER (bound(?name) && !REGEX(?name, "[[:space:]]+")) }

(Obtains the nick names of instant messaging contacts which do not have spaces in them.)

Another example of undocumented behaviour is the fact that aggregate functions (e.g. COUNT) can be used on the result of a property function. For example, take this query:

  SELECT nie:url(?urn) COUNT(?regions)
  WHERE { ?urn rdf:type nmm:Photo .
          OPTIONAL { ?urn nfo:hasRegionOfInterest ?region } }
  GROUP BY ?urn

(Obtains the URLs of images and the number of associated regions of interest.)

The above query would run faster if it would be possible to get rid of the OPTIONAL clause, by using a property function, but when an attribute has multiple values, the property function returns the concatenation of the values separated by commas, so COUNT would be expected choke when applied to that… No! Actually, the following works as expected:

  SELECT nie:url(?urn) COUNT(nfo:hasRegionOfInterest(?urn))
  WHERE { ?urn rdf:type nmm:Photo }
  GROUP BY ?urn

(Obtains the URLs of images and the number of associated regions of interest — faster version.)


30
Jan 12

Recipe: Synchronize music folder to an external device

This is easy to come up with, but as I end up having to dig through rsync‘s manual page to find the relevant options to copy my music to a device that has a non-Unix filesystem, it may be good to just write the recipe down here. Note that I am assuming that the file system in the device does not permission bits and ownerships (e.g. FAT), so those are not to be kept, and using rsync --archive is not a good option rsync will fail to set those extra attributes.

rsync --progress --delete -vrtxmhi ~/Music/ /media/${target_device}/Music/

This is what I use to keep my music library synchronized across my laptop, my phones and the extra backup in an external hard drive :-)


29
Jan 12

FOSDEM 2012

This year I will be attending FOSDEM for the second time, and I am really looking forward to it. Not only because my flight from Helsinki this year will allow me to arrive on time for the beer event, but because it is a great event that fosters the Libre Software / Open Source community, it is packed with loads of interesting talks and people have the possibility to get in direct touch with each other in good atmosphere.

Once again, I am glad to be part of an über-cool company like Igalia, which is sponsoring my trip to Brussels. See you all there!


Bad Behavior has blocked 219 access attempts in the last 7 days.