<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Claudio Saavedra's ChangeLog</title>
    <link>https://blogs.igalia.com/csaavedra/news.html</link>
    <description>Claudio's day to day</description>

    <atom:link href="https://blogs.igalia.com/csaavedra/rss.xml" rel="self" type="application/rss+xml" />

    <copyright>2022 Claudio Saavedra</copyright>
    <managingEditor>csaavedra@gnome.org (Claudio Saavedra)</managingEditor>
    <webMaster>csaavedra@gnome.org (Claudio Saavedra)</webMaster>
    <language>en</language>
    <lastBuildDate>Mon, 03 Oct 2022 15:18:00 +0300</lastBuildDate>

    <item>
      <title>Mon 2022/Oct/03</title>
      <link>https://blogs.igalia.com/csaavedra/news-2022-10.html#D03</link>
      <guid>https://blogs.igalia.com/csaavedra/news-2022-10.html#D03</guid>
      <pubDate>Mon, 03 Oct 2022 14:28:00 +0300</pubDate>
      <description><![CDATA[
<p>
The series on the WPE port by the WebKit team at Igalia grows, with
several new articles that go deep into different areas of the engine:
</p>
<ul>
  <li>
    <a href="https://wpewebkit.org/blog/03-wpe-graphics-architecture.html">WPE graphics overview</a>, by <a href="https://blogs.igalia.com/magomez">Miguel</a>.
  </li>
  <li>
    <a href="https://wpewebkit.org/blog/04-wpe-qa-tooling.html">QA and tooling</a> by <a href="https://dev.to/lauromoura">Lauro</a>.
  </li>
  <li>
    <a href="https://wpewebkit.org/blog/04-wpe-networking-overview.html">WPE networking overview</a>, by <a href="https://blog.tingping.se/">Patrick</a>.
  </li>
</ul>
<p>
These articles are an interesting read not only if you're working on
WebKit, but also if you are curious on how a modern browser engine
works and some of the moving parts beneath the surface. So go check them out!
</p>
<p>
On a related note, the WebKit team is always on the lookout for talent
to join us. Experience with WebKit or browsers is not necessarily a
must, as we know from experience that anyone with a strong C/C++
background and enough curiosity will be able to ramp up and start
contributing soon enough.  If these articles spark your curiosity,
feel free to reach out to me to find out more
or <a href="https://www.igalia.com/jobs/browsers_webkit_position">to
apply directly</a>!
</p>
]]></description>
      </item>


      <item>
      <title>Fri 2022/Jul/01</title>
      <link>https://blogs.igalia.com/csaavedra/news-2022-07.html#D01</link>
      <guid>https://blogs.igalia.com/csaavedra/news-2022-07.html#D01</guid>
      <pubDate>Fri, 01 Jul 2022 13:39:00 +0300</pubDate>
      <description><![CDATA[
<p>
I wrote <a href="https://wpewebkit.org/blog/02-overview-of-wpe.html">a
technical overview of the WebKit WPE project</a> for the <a href="https://wpewebkit.org/blog/">WPE WebKit
blog</a>, for those interested in WPE as a potential solution to the
problem of browsers in embedded devices.
</p>

<p>This article begins a series of technical writeups on the
architecture of WPE, and we hope to publish during the rest of the
year further articles breaking down different components of WebKit,
including graphics and other subsystems, that will surely be
of great help for those interested in getting more familiar
with WebKit and its internals.</p>
]]></description>
    </item>

    <item>
      <title>Thu 2020/Oct/29</title>
      <link>https://blogs.igalia.com/csaavedra/news-2020-10.html#D29</link>
      <guid>https://blogs.igalia.com/csaavedra/news-2020-10.html#D29</guid>
      <pubDate>Thu, 29 Oct 2020 15:10:00 +0200</pubDate>
      <description><![CDATA[
<p>In this line of work, we all stumble at least once upon a problem
that turns out to be extremely elusive and very tricky to narrow down
and solve. If we&apos;re lucky, we might have everything at our
disposal to diagnose the problem but sometimes that&apos;s not the
case &ndash; and in embedded development it&apos;s often not the
case. Add to the mix proprietary drivers, lack of debugging symbols, a
bug that&apos;s very hard to reproduce under a controlled environment,
and weeks in partial confinement due to a pandemic and what you have
is better described as a very long lucid nightmare. Thankfully,
even the worst of nightmares end when morning comes, even if sometimes
morning might be several days away. And when the fix to the problem is
in an inimaginable place, the story is definitely one worth
telling.</p>
<h3 id="the-problem">The problem</h3>
<p>It all started with one
of <a href="https://www.igalia.com">Igalia</a>&apos;s customers deploying
a <a href="https://wpewebkit.org">WPE WebKit</a>-based browser in
their embedded devices. Their CI infrastructure had detected a problem
caused when the browser was tasked with creating a new webview (in
layman terms, you can imagine that to be the same as opening a new tab
in your browser). Occasionally, this view would never load, causing
ongoing tests to fail. For some reason, the test failure had a
reproducibility of ~75% in the CI environment, but during manual
testing it would occur with less than a 1% of probability. For reasons
that are beyond the scope of this post, the CI infrastructure was not
reachable in a way that would allow to have access to running
processes in order to diagnose the problem more easily. So with only
logs at hand and less than a 1/100 chances of reproducing the bug
myself, I set to debug this problem locally.</p>
<h3 id="diagnosis">Diagnosis</h3>
<p>The first that became evident was that, whenever this bug would
occur, the WebKit feature known as <em>web extension</em> (an
application-specific loadable module that is used to allow the program
to have access to the internals of a web page, as well to enable
customizable communication with the process where the page contents
are loaded &ndash; the web process) wouldn&apos;t work. The browser would be
forever waiting that the web extension loads, and since that wouldn&apos;t
happen, the expected page wouldn&apos;t load. The first place to look into
then is the web process and to try to understand what is preventing
the web extension from loading. Enter here, our good friend GDB, with
less than spectacular results thanks to stripped libraries.</p>

<pre><code >#0  0x7500ab9c in poll () from target:/lib/libc.so.6
#1  0x73c08c0c in ?? () from target:/usr/lib/libEGL.so.1
#2  0x73c08d2c in ?? () from target:/usr/lib/libEGL.so.1
#3  0x73c08e0c in ?? () from target:/usr/lib/libEGL.so.1
#4  0x73bold6a8 in ?? () from target:/usr/lib/libEGL.so.1
#5  0x75f84208 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#6  0x75fa0b7e in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#7  0x7561eda2 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#8  0x755a176a in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#9  0x753cd842 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#10 0x75451660 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#11 0x75452882 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#12 0x75452fa8 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#13 0x76b1de62 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#14 0x76b5a970 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#15 0x74bee44c in g_main_context_dispatch () from target:/usr/lib/libglib-2.0.so.0
#16 0x74bee808 in ?? () from target:/usr/lib/libglib-2.0.so.0
#17 0x74beeba8 in g_main_loop_run () from target:/usr/lib/libglib-2.0.so.0
#18 0x76b5b11c in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#19 0x75622338 in ?? () from target:/usr/lib/libWPEWebKit-1.0.so.2
#20 0x74f59b58 in __libc_start_main () from target:/lib/libc.so.6
#21 0x0045d8d0 in _start ()</code></pre>

<p>From all threads in the web process, after much tinkering around it
slowly became clear that one of the places to look into is
that <a href="https://man7.org/linux/man-pages/man2/poll.2.html"><code>poll()</code></a>
call. I will spare you the details related to what other threads were
doing, suffice to say that whenever the browser would hit the bug,
there was a similar stacktrace in one thread, going
through <a href="https://www.khronos.org/egl/">libEGL</a> to a
call to <code>poll()</code> on top of the stack, that would never
return. Unfortunately, a stripped EGL driver coming from a proprietary
graphics vendor was a bit of a showstopper, as it was the inability to
have proper debugging symbols running inside the device (did you know
that a non-stripped WebKit library binary with debugging symbols can
easily get GDB and your device out of memory?). The best one could do
to improve that was to use the
<a href="https://man7.org/linux/man-pages/man1/gcore.1.html"><code>gcore</code></a>
feature in GDB, and extract a core from the device for post-mortem
analysis. But for some reason, such a stacktrace wouldn&apos;t give
anything interesting below the <code>poll()</code> call to understand
what&apos;s being polled here. Did I say this was tricky?</p>
<h3 id="what-polls-">What polls?</h3>
<p>Because WebKit is a multiprocess web engine, having system calls
that signal, read, and write in sockets communicating with other
processes is an everyday thing. Not knowing what a <code>poll()</code>
call is doing and who is it that it&apos;s trying to listen to, not
very good. Because the call is happening under the EGL library, one
can presume that it&apos;s graphics related, but there are still
different possibilities, so trying to find out what is this polling is
a good idea.</p>
<p>A trick I learned while debugging this is that, in absence of
debugging symbols that would give a straightforward look into
variables and parameters, one can examine the CPU registers and try to
figure out from them what the parameters to function calls are. Let&apos;s
  do that with <code>poll()</code>. First, its signature.</p>
<pre><code class="c">int poll(struct pollfd *fds, nfds_t nfds, int timeout);</code></pre>
<p>Now, let's examine the registers.</p>
<pre><code class="c">(gdb) f 0
#0  0x7500ab9c in poll () from target:/lib/libc.so.6
(gdb) info registers
r0             0x7ea55e58	2124766808
r1             0x1	1
r2             0x64	100
r3             0x0	0
r4             0x0	0</code></pre>

<p>Registers <code>r0</code>, <code>r1</code>, and <code>r2</code>
contain <code>poll()</code>&apos;s three
parameters. Because <code>r1</code> is 1, we know that there is only
one file descriptor being polled.  <code>fds</code> is a pointer to an
array with one element then. Where is that first element? Well, right
there, in the memory pointed to directly by
<code>r0</code>. What does <code>struct pollfd</code> look like?</p>

<pre><code class="c">struct pollfd {
  int   fd;         /* file descriptor */
  short events;     /* requested events */
  short revents;    /* returned events */
};</code></pre>

<p>What we are interested in here is the contents of <code>fd</code>,
the file descriptor that is being polled. Memory alignment is again in
our side, we don&apos;t need any pointer arithmetic here. We can
inspect directly the register <code>r0</code> and find out what the
value of <code>fd</code> is.</p>

<pre><code>(gdb) print *0x7ea55e58
$3 = 8</code></pre>

<p>So we now know that the EGL library is polling the file descriptor
with an identifier of 8. But where is this file descriptor coming
from? What is on the other end? The <code>/proc</code> file system can
be helpful here.</p>

<pre><code class="console"># pidof WPEWebProcess
1944 1196
# ls -lh /proc/1944/fd/8
lrwx------    1 x x      64 Oct 22 13:59 /proc/1944/fd/8 -> socket:[32166]
</code></pre>

<p>So we have a socket. What else can we find out about it? Turns out,
not much without
the <a href="https://cateee.net/lkddb/web-lkddb/UNIX_DIAG.html"><code>unix_diag</code></a>
kernel module, which was not available in our device.  But we are
slowly getting closer. Time to call another good friend.</p>

<h3 id="where-gdb-fails-printf-triumphs">Where GDB fails, <code>printf()</code> triumphs</h3>

<p>Something I have learned from many years working with a project as
large as WebKit, is that debugging symbols can be very difficult to
work with. To begin with, it takes ages to build WebKit with them.
When cross-compiling, it&apos;s even worse. And then, very often the
target device doesn&apos;t even have enough memory to load the symbols
when debugging. So they can be pretty useless. It&apos;s then when
just
using <a href="https://man7.org/linux/man-pages/man3/fprintf.3.html"><code>fprintf()</code></a>
and logging useful information can simplify things. Since we know that
it&apos;s at some point during initialization of the web process that
we end up stuck, and we also know that we&apos;re polling a file
descriptor, let&apos;s find some early calls in the code of the web
process and add some
<code>fprintf()</code> calls with a bit of information, specially in
those that might have something to do with EGL. What can we find out
now?</p>

<pre><code>Oct 19 10:13:27.700335 WPEWebProcess[92]: Starting
Oct 19 10:13:27.720575 WPEWebProcess[92]: Initializing WebProcess platform.
Oct 19 10:13:27.727850 WPEWebProcess[92]: wpe_loader_init() done.
Oct 19 10:13:27.729054 WPEWebProcess[92]: Initializing PlatformDisplayLibWPE (hostFD: 8).
Oct 19 10:13:27.730166 WPEWebProcess[92]: egl backend created.
Oct 19 10:13:27.741556 WPEWebProcess[92]: got native display.
Oct 19 10:13:27.742565 WPEWebProcess[92]: initializeEGLDisplay() starting.</code></pre>

<p>Two interesting findings from the <code>fprintf()</code>-powered
logging here: first, it seems that file descriptor 8 is one known to
<a href="https://github.com/WebPlatformForEmbedded/libwpe">libwpe</a>
(the general-purpose library that powers the WPE WebKit port). Second,
that the last EGL API call right before the web process hangs
on <code>poll()</code> is a call
to <a href="https://www.khronos.org/registry/EGL/sdk/docs/man/html/eglInitialize.xhtml"><code>eglInitialize()</code></a>. <code>fprintf()</code>,
thanks for your service.</p>
<h3 id="number-8">Number 8</h3>
<p>We now know that the file descriptor 8 is coming from WPE and is
not internal to the EGL library. libwpe gets this file descriptor from
the UI process,
as <a href="https://trac.webkit.org/browser/webkit/trunk/Source/WebKit/Shared/WebProcessCreationParameters.h?rev=268978#L191">one
of the many creation parameters</a> that are passed via IPC to the
nascent process in order to initialize it. Turns out that this file
descriptor in particular, the so-called host client file descriptor,
is the one that the freedesktop backend of libWPE, from here onwards
<a href="https://github.com/Igalia/WPEBackend-fdo">WPEBackend-fdo</a>,
creates when a new client is set to connect to its Wayland display. In
a nutshell, in presence of a new client, a Wayland display is supposed
to create a pair of connected sockets, create a new client on the
Display-side, give it one of the file descriptors, and pass the other
one to the client process. Because this will be useful later on,
let&apos;s see how is
that <a href="https://github.com/Igalia/WPEBackend-fdo/blob/1d1a01452fb5df6c7cba1aff5a21636ab6cf838b/src/ws.cpp#L369">currently
implemented in WPEBackend-fdo</a>.</p>

<pre><code class="c">    int pair[2];
    if (socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, pair) < 0)
        return -1;

    int clientFd = dup(pair[1]);
    close(pair[1]);

    wl_client_create(m_display, pair[0]);
</code></pre>

<p>The file descriptor we are tracking down is the client file
descriptor, <code>clientFd</code>. So we now know what&apos;s going on in this socket:
Wayland-specific communication. Let&apos;s enable Wayland debugging next,
by running all relevant process with <code>WAYLAND_DEBUG=1</code>. We&apos;ll get back
to that code fragment later on.</p>
<h3 id="a-heisenbug-is-a-heisenbug-is-a-heisenbug">A Heisenbug is a Heisenbug is a Heisenbug</h3>
<p>Turns out that enabling Wayland debugging output for a few
processes is enough to alter the state of the system in such a way
that the bug does not happen at all when doing manual
testing. Thankfully the CI&apos;s reproducibility is much higher, so
after waiting overnight for the CI to continuously run until it hit
the bug, we have logs. What do the logs say?</p>
<pre><code>WPEWebProcess[41]: initializeEGLDisplay() starting.
  -> wl_display@1.get_registry(new id wl_registry@2)
  -> wl_display@1.sync(new id wl_callback@3)
</code></pre>

<p>So the EGL library is trying to fetch the Wayland
registry and it&apos;s doing a <code>wl_display_sync()</code> call
afterwards, which will block until the server responds. That&apos;s
where the blocking <code>poll()</code> call comes from.  So, it turns
out, the problem is not necessarily on this end of the Wayland socket,
but perhaps on the other side, that is, in the so-called UI process
(the main browser process). Why is the Wayland display not
replying?</p>

<h3 id="the-loop">The loop</h3>

<p>Something that is worth mentioning before we move on is how the
WPEBackend-fdo Wayland display integrates with the system.  This
display is a nested display, with each web view a client, while it is
itself a client of the system&apos;s Wayland display. This can be a bit
confusing if you&apos;re not very familiar with how Wayland works, but
fortunately there is
<a href="https://wayland-book.com/introduction.html">good
documentation about Wayland</a> elsewhere.</p>

<p>The way that the Wayland display in the UI process of a WPEWebKit
browser is integrated with the rest of the program, when it uses
WPEBackend-fdo, is through the
<a href="https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html">GLib
main event loop</a>. Wayland itself has an event loop implementation
for servers, but for a GLib-powered application it can be useful to
use GLib&apos;s and integrate Wayland&apos;s event processing with the different
stages of the GLib main loop. That is precisely how WPEBackend-fdo is
handling its clients&apos; events. As discussed earlier, when a new client
is created a pair of connected sockets are created and one end is
given to Wayland to control communication with the
client. <a href="https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html#GSourceFuncs"><code>GSourceFunc</code>
functions</a> are used to integrate Wayland with the application main
loop. In these functions, we make sure that whenever there are pending
messages to be sent to clients, those are sent, and whenever any of
the client sockets has pending data to be read, Wayland reads from
them, and to dispatch the events that might be necessary in response
to the incoming data. And here is where things start getting really
strange, because after doing a bit of
<code>fprintf()</code>-powered debugging inside the Wayland-GSourceFuncs functions,
it became clear that the Wayland events from the clients were never
dispatched, because the <code>dispatch()</code> <code>GSourceFunc</code> was not being called,
as if <em>there was nothing coming from any Wayland client</em>. But how is
that possible, if we already know that the web process client is
actually trying to get the Wayland registry?</p>

<p>To move forward, one needs to understand how the GLib main loop
works, in particular, with Unix file descriptor sources. A very brief
summary of this is that, during an iteration of the main loop, GLib
will poll file descriptors to see if there are any interesting events
to be reported back to their respective sources, in which case the
sources will decide whether to trigger the <code>dispatch()</code>
phase.  A simple source might decide in its <code>dispatch()</code>
method to directly read or write from/to the file descriptor; a
Wayland display source (as in our case), will
call <code>wl_event_loop_dispatch()</code> to do this for us.
However, if the source doesn&apos;t find any interesting events, or if
the source decides that it doesn&apos;t want to handle them,
the <code>dispatch()</code> invocation will not happen. More on the
GLib main event loop in
its <a href="https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html#glib-The-Main-Event-Loop.description">API
documentation</a>.</p>

<p>So it seems that for some reason the <code>dispatch()</code> method is not being
called. Does that mean that there are no interesting events to read
from? Let&apos;s find out.</p>
<h3 id="system-call-tracing">System call tracing</h3>
<p>Here we resort to another helpful
tool, <a href="https://man7.org/linux/man-pages/man1/strace.1.html"><code>strace</code></a>. With <code>strace</code>
we can try to figure out what is happening when the main loop polls
file descriptors. The <code>strace</code> output is huge (because it
takes easily over a hundred attempts to reproduce this), but we know
already some of the calls that involve file descriptors from the code
we looked at above, when the client is created. So we can use those
calls as a starting point in when searching through the several MBs of
logs. Fast-forward to the relevant logs.</p>

<pre><code class="c">socketpair(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0, [128<socket:[168468]>, 130<socket:[168469]>]) = 0
dup(130<socket:[168469]>)               = 131<socket:[168469]>
close(130<socket:[168469]>)             = 0
fcntl64(128<socket:[168468]>, F_DUPFD_CLOEXEC, 0) = 130<socket:[168468]>
epoll_ctl(34<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 130<socket:[168468]>, {EPOLLIN, {u32=1639599928, u64=1639599928}}) = 0</code></pre>

<p>What we see there is, first, WPEBackend-fdo creating a new socket
pair (128, 130) and then, when file descriptor 130 is passed to
  <a href="https://wayland.freedesktop.org/docs/html/apc.html#Server-structwl__display"><code>wl_client_create()</code></a> to
create a new client, Wayland adds that file descriptor to its
  <a href="https://man7.org/linux/man-pages/man7/epoll.7.html"><code>epoll()</code></a> instance
  for monitoring clients, which is referred to by file descriptor 34. This way, whenever there are
events in file descriptor 130, we will hear about them in file descriptor 34.</p>

<p>So what we would expect to see next is that, after the web process
is spawned, when a Wayland client is created using the passed file
descriptor and the EGL driver requests the Wayland registry from the
display, there should be a <code>POLLIN</code> event coming in file
descriptor 34 and, if the <code>dispatch()</code> call for the source
was called,
a <a href="https://man7.org/linux/man-pages/man2/epoll_wait.2.html"><code>epoll_wait()</code></a>
call on it, as that is
what <a href="https://github.com/wayland-project/wayland/blob/53dd99793dd95fcfc187a0ee81ab289dfbe7fc2a/src/event-loop.c#L995"><code>wl_event_loop_dispatch()</code></a>
would do when called from the source&apos;s <code>dispatch()</code>
method. But what do we have instead?</p>

<pre><code class="c">poll([{fd=30<socket:[21590]>, events=POLLIN}, {fd=34<anon_inode:[eventpoll]>, events=POLLIN}, {fd=59<socket:[19231]>, events=POLLIN}, {fd=110<socket:[40913]>, events=POLLIN}, {fd=114<socket:[166891]>, events=POLLIN}, {fd=132<socket:[168470]>, events=POLLIN}], 6, 0) = 1 ([{fd=34, revents=POLLIN}])
recvmsg(30<socket:[21590]>, {msg_namelen=0}, MSG_DONTWAIT|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
</code></pre>

<p><code>strace</code> can be a bit cryptic, so let&apos;s explain
those two function calls.  The first one is a poll in a series of file
descriptors (including 30 and 34) for <code>POLLIN</code> events. The
return value of that call tells us that there is a <code>POLLIN</code>
event in file descriptor 34 (the Wayland display <code>epoll()</code>
instance for clients). But unintuitively, the call right after is
trying to read a message from socket 30 instead, which we know
doesn&apos;t have any pending data at the moment, and consequently
returns an error value with an <code>errno</code>
of <code>EAGAIN</code> (Resource temporarily unavailable).</p>

<p>Why is the GLib main loop triggering a read from 30 instead of 34?
  And who is 30?</p>

<p>We can answer the latter question first. Breaking on a running UI
process instance at the right time shows who is reading from
the file descriptor 30:</p>

<pre><code>#1  0x70ae1394 in wl_os_recvmsg_cloexec (sockfd=30, msg=msg@entry=0x700fea54, flags=flags@entry=64)
#2  0x70adf644 in wl_connection_read (connection=0x6f70b7e8)
#3  0x70ade70c in read_events (display=0x6f709c90)
#4  wl_display_read_events (display=0x6f709c90)
#5  0x70277d98 in pwl_source_check (source=0x6f71cb80)
#6  0x743f2140 in g_main_context_check (context=context@entry=0x2111978, max_priority=<optimized out>, fds=fds@entry=0x6165f718, n_fds=n_fds@entry=4)
#7  0x743f277c in g_main_context_iterate (context=0x2111978, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
#8  0x743f2ba8 in g_main_loop_run (loop=0x20ece40)
#9  0x00537b38 in ?? ()</code></pre>

<p>So it&apos;s also Wayland, but on a different level. This
is the Wayland client source (remember that the browser is also a
Wayland client?), which is installed
by <a href="https://github.com/Igalia/cog">cog</a> (a thin browser
layer on top of WPE WebKit that makes writing browsers easier to do)
to process, among others, input events coming from the parent Wayland
display. <a href="https://github.com/Igalia/cog/blob/9a26af69b2fa0cc188de6d5d1ee2d527aff49f77/platform/cog-platform-fdo.c#L299">Looking
at the cog code</a>, we can see that the
<a href="https://wayland.freedesktop.org/docs/html/apb.html#Client-classwl__display"><code>wl_display_read_events()</code></a>
call happens only if GLib reports that there is
a <a href="https://developer.gnome.org/glib/stable/glib-IO-Channels.html#GIOCondition"><code>G_IO_IN</code></a>
(<code>POLLIN</code>) event in its file descriptor, but we already
know that this is not the case, as per the <code>strace</code>
output. So at this point we know that there are two things here that
are not right:</p>
<ol>
<li>A FD source with a G_IO_IN condition is not being dispatched.</li>
<li>A FD source without a G_IO_IN condition is being dispatched.</li>
</ol>
<p>Someone here is not telling the truth, and as a result the main loop
is dispatching the wrong sources.</p>
<h3 id="the-loop-part-ii-">The loop (part II)</h3>
<p>It is at this point that it would be a good idea to look at what
exactly the GLib main loop is doing internally in each of its stages
and how it tracks the sources and file descriptors that are polled and
that need to be processed. Fortunately, debugging symbols for GLib are
very small, so debugging this step by step inside the device is rather
easy.</p>
<p>Let&apos;s look at how the main loop decides which sources
to dispatch, since for some reason it&apos;s dispatching the wrong ones.
Dispatching happens in
the <a href="https://gitlab.gnome.org/GNOME/glib/-/blob/c686e1a0/glib/gmain.c#L3267"><code>g_main_dispatch()</code></a>
method. This method goes over a list of pending source dispatches and
after a few checks and setting the stage, the dispatch method for the
source gets called.  How is a source set as having a pending dispatch?
This happens in
<a href="https://gitlab.gnome.org/GNOME/glib/-/blob/c686e1a0/glib/gmain.c#L3827"><code>g_main_context_check()</code></a>,
where the main loop checks the results of the polling done in this
iteration and runs the <code>check()</code> method for sources that
are not ready yet so that they can decide whether they are ready to be
dispatched or not. Breaking into the Wayland display source, I know
that
the <a href="https://github.com/Igalia/WPEBackend-fdo/blob/1d1a01452fb5df6c7cba1aff5a21636ab6cf838b/src/ws.cpp#L62"><code>check()</code>
method</a> is called. How does this method decide to be dispatched or
not?</p>
<pre><code class="cpp">    [](GSource* base) -> gboolean
    {
        auto& source = *reinterpret_cast<Source*>(base);
        return !!source.pfd.revents;
    },</code></pre>
<p>In this lambda function we&apos;re returning <code>TRUE</code> or
<code>FALSE</code>, depending on whether the <code>revents</code>
field in
the <a href="https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html#GPollFD"><code>GPollFD</code></a>
structure have been filled during the polling stage of this iteration
of the loop. A return value of <code>TRUE</code> indicates the main
loop that we want our source to be dispatched. From
the <code>strace</code> output, we know that there is a
<code>POLLIN</code> (or <code>G_IO_IN</code>) condition, but we also know that the main loop is
not dispatching it. So let&apos;s look at what&apos;s in this <code>GPollFD</code> structure.</p>
<p>For this, let&apos;s go back to <code>g_main_context_check()</code> and inspect the array
of <code>GPollFD</code> structures that it received when called. What do we find?</p>

<pre><code class="c">(gdb) print *fds
$35 = {fd = 30, events = 1, revents = 0}
(gdb) print *(fds+1)
$36 = {fd = 34, events = 1, revents = 1}</code></pre>

<p>That&apos;s the result of the <code>poll()</code> call! So far so good. Now the method
is supposed to update the polling records it keeps and it uses when
calling each of the sources <code>check()</code> functions. What do these records
  hold?</p>

<pre><code class="c">(gdb) print *pollrec->fd
$45 = {fd = 19, events = 1, revents = 0}
(gdb) print *(pollrec->next->fd)
$47 = {fd = 30, events = 25, revents = 1}
(gdb) print *(pollrec->next->next->fd)
$49 = {fd = 34, events = 25, revents = 0}</code></pre>

<p>We&apos;re not interested in the first record quite yet, but clearly
there&apos;s something odd here. The polling records are showing a
different value in the <code>revent</code> fields for both 30 and 34. Are these
records updated correctly? Let&apos;s look at the algorithm that is doing
this update, because it will be relevant later on.</p>

<pre><code class="c">  pollrec = context->poll_records;
  i = 0;
  while (pollrec && i < n_fds)
    {
      while (pollrec && pollrec->fd->fd == fds[i].fd)
        {
          if (pollrec->priority <= max_priority)
            {
              pollrec->fd->revents =
                fds[i].revents & (pollrec->fd->events | G_IO_ERR | G_IO_HUP | G_IO_NVAL);
            }
          pollrec = pollrec->next;
        }

      i++;
    }</code></pre>

<p>In simple words, what this algorithm is doing is to traverse
simultaneously the polling records and the <code>GPollFD</code> array,
updating the polling records <code>revents</code> with the results of
polling. From
reading <a href="https://gitlab.gnome.org/GNOME/glib/-/blob/c686e1a0/glib/gmain.c#L4500">how
the <code>pollrec</code> linked list is built internally</a>, it&apos;s
possible to see that it&apos;s purposely sorted by increasing file
descriptor identifier value. So the first item in the list will have
the record for the lowest file descriptor identifier, and so on.  The
<code>GPollFD</code> array is also built in this way, allowing for a
nice optimization: if more than one polling record &ndash; that is, more
than one polling source &ndash; needs to poll the same file descriptor,
this can be done at once. This is why this otherwise O(n^2) nested
loop can actually be reduced to linear time.</p>
<p>One thing stands out here though: the linked list is only advanced
when we find a match. Does this mean that we always have a match
between polling records and the file descriptors that have just been
polled? To answer that question we need to check how is the array of
<code>GPollFD</code> structures
filled. <a href="https://gitlab.gnome.org/GNOME/glib/-/blob/c686e1a0/glib/gmain.c#L3762">This
is done in <code>g_main_context_query()</code></a>, as we hinted
before. I&apos;ll spare you the details, and just focus on what seems
relevant here: when is a poll record <em>not</em> used to fill
a <code>GPollFD</code>?</p>

<pre><code class="c">  n_poll = 0;
  lastpollrec = NULL;
  for (pollrec = context->poll_records; pollrec; pollrec = pollrec->next)
    {
      if (pollrec->priority > max_priority)
        continue;
  ...
</code></pre>

<p>Interesting! If a polling record belongs to a source whose priority
is lower than the maximum priority that the current iteration is
going to process, the polling record is skipped. Why is this?</p>
<p>In simple terms, this happens because each iteration of the main
loop finds out the highest priority between the sources that are ready
in the <code>prepare()</code> stage, before polling, and then only
those file descriptor sources with at least such a a priority are
polled. The idea behind this is to make sure that high-priority
sources are processed first, and that no file descriptor sources with
lower priority are polled in vain, as they shouldn&apos;t be
dispatched in the current iteration.</p>

<p>GDB tells me that the maximum priority in this iteration is
-60. From an earlier GDB output, we also know that there&apos;s a
source for a file descriptor 19 with a priority 0.</p>

<pre><code class="c">(gdb) print *pollrec
$44 = {fd = 0x7369c8, prev = 0x0, next = 0x6f701560, priority = 0}
(gdb) print *pollrec->fd
$45 = {fd = 19, events = 1, revents = 0}</code></pre>

<p>Since 19 is lower than 30 and 34, we know that this record is
before theirs in the linked list (and so it happens, it&apos;s the
first one in the list too). But we know that, because its priority is
0, it is too low to be added to the file descriptor array to be
polled. Let&apos;s look at the loop again.</p>

<pre><code class="c">  pollrec = context->poll_records;
  i = 0;
  while (pollrec && i < n_fds)
    {
      while (pollrec && pollrec->fd->fd == fds[i].fd)
        {
          if (pollrec->priority <= max_priority)
            {
              pollrec->fd->revents =
                fds[i].revents & (pollrec->fd->events | G_IO_ERR | G_IO_HUP | G_IO_NVAL);
            }
          pollrec = pollrec->next;
        }

      i++;
    }</code></pre>

<p>The first polling record was skipped during the update of
the <code>GPollFD</code> array, so the condition <code>pollrec
&amp;&amp; pollrec-&gt;fd-&gt;fd == fds[i].fd</code> is never going to
be satisfied, because 19 is not in the array. The
innermost <code>while()</code> is not entered, and as such
the <code>pollrec</code> list pointer never moves forward to the next
record. So no polling record is updated here, even if we have
updated <code>revent</code> information from the polling results.</p>
<p>What happens next should be easy to see. The <code>check()</code>
method for all polled sources are called with
outdated <code>revents</code>. In the case of the source
for file descriptor 30, we wrongly tell it there&apos;s a
<code>G_IO_IN</code> condition, so it asks the main loop to call
dispatch it triggering a a <code>wl_connection_read()</code> call in a
socket with no incoming data. For the source with file descriptor 34,
we tell it that there&apos;s no incoming data and
its <code>dispatch()</code> method is not invoked, even when on the
other side of the socket we have a client waiting for data to come and
blocking in the meantime. This explains what we see in
the <code>strace</code> output above. If the source with file
descriptor 19 continues to be ready and with its priority unchanged,
then this situation repeats in every further iteration of the main
loop, leading to a hang in the web process that is forever waiting
that the UI process reads its socket pipe.</p>

<h3 id="the-bug-explained">The bug &ndash; explained</h3>
<p>I have been using GLib for a very long time, and I have only fixed
  a couple of minor bugs in it over the years. Very few actually,
  which is why it was very difficult for me to come to accept that I
  had found a bug in one of the most reliable and complex parts of the
  library. Impostor syndrome is a thing and it really gets in the way.</p>

<p>But in a nutshell, the bug in the GLib main loop is that the very
clever linear update of registers is missing something very important:
it should skip to the first polling record matching before attempting
to update its <code>revents</code>. Without this, in the presence of a
file descriptor source with the lowest file descriptor identifier and
also a lower priority than the cutting priority in the current main
loop iteration, <code>revents</code> in the polling registers are not
updated and therefore the wrong sources can be dispatched. The
simplest patch to avoid this, would look as follows.</p>

<pre><code class="diff">   i = 0;
   while (pollrec && i < n_fds)
     {
+      while (pollrec && pollrec->fd->fd != fds[i].fd)
+        pollrec = pollrec->next;
+
       while (pollrec && pollrec->fd->fd == fds[i].fd)
         {
           if (pollrec->priority <= max_priority)</code></pre>

<p>Once we find the first matching record, let&apos;s update all consecutive
records that also match and need an update, then let&apos;s skip to the
next record, rinse and repeat. With this two-line patch, the web
process was finally unlocked, the EGL display initialized properly,
the web extension and the web page were loaded, CI tests starting
passing again, and this exhausted developer could finally put his mind
to rest.</p>
<p>A <a href="https://gitlab.gnome.org/GNOME/glib/-/commit/96ccf06d3da104d8706f0faec4e86d313e7cdbd9">complete
patch</a>, including improvements to the code comments around this
fascinating part of GLib and also a minimal test case reproducing the
bug have already been <a href="https://gitlab.gnome.org/GNOME/glib/-/merge_requests/1713">reviewed</a> by the GLib maintainers and merged to
both stable and development branches. I expect that at
least <em>some</em> GLib sources will start being called in a
  different (but correct) order from now on, so keep an eye on your
GLib sources. :-)</p>
<h3 id="standing-on-the-shoulders-of-giants">Standing on the shoulders of giants</h3>
<p>At this point I should acknowledge that without the support from my
colleagues in the WebKit team in Igalia, getting to the bottom of this
problem would have probably been much harder and perhaps my sanity
would have been at stake. I want to
thank <a href="https://perezdecastro.org/">Adri&aacute;n</a>
and <a href="https://blogs.igalia.com/zdobersek/">&Zcaron;an</a> for
their input on Wayland, debugging techniques, and for allowing me to
bounce back and forth ideas and findings as I went deeper into this
rabbit hole, helping me to step out of dead-ends, reminding me to use
tools out of my everyday box, and ultimately, to be brave enough to
doubt GLib&apos;s correctness, something that much more often than not I
take for granted.</p>
<p>Thanks also to <a href="https://tecnocode.co.uk/">Philip</a>
and <a href="https://coaxion.net/blog/">Sebastian</a> for their
feedback and prompt code review!</p>
]]></description>
    </item>

    <item>
      <title>Thu 2016/Dec/15</title>
      <link>https://blogs.igalia.com/csaavedra/news-2016-12.html#D15</link>
      <guid>https://blogs.igalia.com/csaavedra/news-2016-12.html#D15</guid>
      <pubDate>Thu, 15 Dec 2016 19:13:00 +0200</pubDate>
      <description><![CDATA[
	    <p>
		<a href="https://www.igalia.com">Igalia</a> is
		hiring. We're currently interested in <a
		href="https://www.igalia.com/nc/about-us/form/multimedia-developer/">Multimedia</a>
		and <a
		href="https://www.igalia.com/nc/about-us/form/chromium-developer/">Chromium</a>
		developers. Check the announcements for details on the
		positions and our company.
	    </p>
]]></description>
    </item>

    <item>
      <title>Mon 2016/Feb/08</title>
      <link>https://blogs.igalia.com/csaavedra/news-2016-02.html#D08</link>
      <guid>https://blogs.igalia.com/csaavedra/news-2016-02.html#D08</guid>
      <pubDate>Mon, 08 Feb 2016 11:01:00 +0200</pubDate>
      <description><![CDATA[
	  <p>
	    About a year ago, Igalia was approached by the people
	    working on printing-related technologies in HP to see
	    whether we could give them a hand in their ongoing effort
	    to improve the printing experience in the web. They had
	    been working for a while in extensions for popular web
	    browsers that would allow users, for example, to distill a
	    web page from cruft and ads and format its relevant
	    contents in a way that would be pleasant to read in
	    print. While these extensions were working fine, they were
	    interested in exploring the possibility of adding this
	    feature to popular browsers, so that users wouldn't need
	    to be bothered with installing extensions to have an
	    improved printing experience.
	  </p>

	  <p>
	    That's how Alex, Martin, and me spent a few months
	    exploring the Chromium project and its printing
	    architecture. Soon enough we found out that the Chromium
	    developers had been working already on a feature that
	    would allow pages to be removed from cruft and presented
	    in a sort of <em>reader mode</em>, at least in mobile
	    versions of the browser.  This is achieved through a
	    module called <a
	    href="https://github.com/chromium/dom-distiller">dom
	    distiller</a>, which basically has the ability to traverse
	    the DOM tree of a web page and return a clean DOM tree
	    with only the important contents of the page. This module
	    is based on the algorithms and heuristics in a project
	    called boilerpipe with some of it also coming from the now
	    popular Readability. Our goal, then, was to integrate the
	    DOM distiller with the modules in Chromium that take care
	    of generating the document that is then sent to both the
	    print preview and the printing service, as well as making
	    this feature available in the printing UI.
	  </p>

	  <p>
	    After a couple of months of work and thanks to the kind
	    code reviews of the folks at Google, we got the feature
	    landed in Chromium's repository. For a while, though, it
	    remained hidden behind a runtime flag, as the Chromium
	    team needed to make sure that things would work well
	    enough in all fronts before making it available to all
	    users. Fast-forward to last week, when I found out by
	    chance that the runtime flag has been flipped and the
	    <em>Simplify page</em> printing option has been available
	    in Chromium and Chrome for a while now, and it has even
	    reached the stable releases. The <em>reader mode</em>
	    feature in Chromium seems to remain hidden behind a
	    runtime flag, I think, which is interesting considering
	    that this was the original motivation behind the dom
	    distiller.
	  </p>

	  <p>
	    As a side note, it is worth mentioning that the
	    collaboration with HP was pretty neat and it's a good
	    example of the ways in which Igalia can help organizations
	    to improve the web experience of users. From the standards
	    that define the web to the browsers that people use in
	    their everyday life, there are plenty of areas in which
	    work needs to be done to make the web a more pleasant
	    place, for web developers and users alike. If your
	    organization relies on the web to reach its users, or to
	    enable them to make use of your technologies, chances are
	    that there are areas in which their experience can be
	    improved and that's one of the things we love doing.
	  </p>
]]></description>
    </item>

    <item>
      <title>Thu 2016/Feb/04</title>
      <link>https://blogs.igalia.com/csaavedra/news-2016-02.html#D04</link>
      <guid>https://blogs.igalia.com/csaavedra/news-2016-02.html#D04</guid>
      <pubDate>Thu, 04 Feb 2016 14:53:00 +0200</pubDate>
      <description><![CDATA[
	  <p>
	    We've opened <a
	      href="https://www.igalia.com/nc/igalia-247/news/item/join-us-we-are-hiring/">a
	      few positions for developers in the fields of
	      multimedia, networking, and compilers</a>. I could say a
	      lot about why working in <a
	      href="https://www.igalia.com">Igalia</a> is way different
	      to working on your average tech-company or start-up, but
	      I think the way it's summarized in the announcements is
	      pretty good. Have a look at them if you are curious and
	      don't hesitate to apply!
	  </p>
	  <ul>
	    <li><a href="https://www.igalia.com/about-us/form/multimedia-developer">Multimedia developer</a>.
	    </li>
	    <li><a href="https://www.igalia.com/about-us/form/networking-developer">Networking developer</a>.
	    </li>
	    <li><a href="https://www.igalia.com/about-us/form/networking-senior-developer">Senior networking developer</a>.
	    </li>
	    <li><a href="https://www.igalia.com/nc/about-us/form/compilers-programming-languages-developer">Compilers & programming languages developer</a>.
	    </li>
	</ul>
]]></description>
    </item>

    <item>
      <title>Fri 2015/Jul/10</title>
      <link>https://blogs.igalia.com/csaavedra/news-2015-07.html#D10</link>
      <guid>https://blogs.igalia.com/csaavedra/news-2015-07.html#D10</guid>
      <pubDate>Fri, 10 Jul 2015 10:26:00 +0300</pubDate>
      <description><![CDATA[
	  <p>
	    It's summer! That means that, if you are a student,
	    you could be one of our summer interns in Igalia this
	    season. We have two positions available: the first
	    related to WebKit work and the second to web
	    development. Both positions can be filled in either of
	    our locations in Galicia or you can work remotely from
	    wherever you prefer (plenty of us work remotely, so
	    you'll have to communicate with some of us via jabber
	    and email anyway).
	  </p>

	  <p>
	    Have a look at the <a
	      href="https://www.igalia.com/nc/igalia-247/news/item/announcing-igalias-summer-intern-positions/">announcement
	      in our web page</a> for more details, and don't hesitate
	      to contact me if you have any doubt about the
	      internships!
	  </p>
]]></description>
    </item>
  </channel>
</rss>
