Bless the uncertainty – my (mostly technical) ramblings and hack-logs

W3C, Encrypted Media Extensions and DRM: A rant

The recent decision of the Director of the World Wide Web Consortium to overrule members’ objections to publishing a DRM standard for the Web, together with a series of rant I have read around about the issue; compelled me to write my own rant. What follows is, as usual, my own personal view on the subject and does not represent that of any of my partners or contractors.

The argumnent that the Web would become irrelevant if it didn’t support reproducing DRM-protected content, therefore some sort of DRM-supporting technology has to be standarized as part of the W3C, is two-fold flawed. Yet this is one of the core arguments used to defend supporting Encrypted Media Extension (EME), a technology that enables DRM, as a W3C standard.

On one hand, this is like saying that the Web is mostly useful for reproducing DRM-video from the big content distributors, downplaying the rest of the features that make the Web so important and empowering. It is actually the case that the usefulness of the open Web is very hard to replicate with (a collection of) non-Web applications; while reproducing DRM-protected video would be a tiny feature of the Web, in comparison.

Notice also, that this is problematic only for DRM-protected content, which accounts for only a fraction of the video content available on the Internet. Otherwise non-DRM video can perfectly be reproduced today with HTML5. While it is true that there is high user demand for content distributed by those same companies that push for DRM on the Web; there isn’t any requirement to use DRM to deliver that content other than supporting a very specific business practice that is only important for one party, the distributors, and ncessarily results in more control and freedom-limiting to users.

Is it the W3C’s mission to protect business models and practices (of dubidous effectiveness and well-known privacy and/or security issues)? No, it isn’t.

On the other hand, it is a fallacy that if a given DRM-enabling technology is not part of a W3C standard, it will necessarily bring fragmentation to the ecosystem because each player will (re)invent their own DRM technology. True, this is what has hapended with Flash and Silverlight, but arguably because there wasn’t a W3C standard. These technologies were invented many years before consuming video on the Web was popular or even possible, at a time when the value of avoiding fragmentation was not enough or not evident to implementors. These technologies existed before the big streaming corps of today existed, and were created for other purposes at a time when browsers were competing in features at the expense of interoperability (remember IE?).

The Industry players (content distributors, web implementors, etc) are free to agree on a common solution outside the umbrella of the W3C, because it is in their best interest to avoid fragmentation. It saves costs and makes for a better user experience. It is what industries do when they see opportunity for cutting down costs. Just don’t utilize the W3C (or better, the W3C should not let itself be utilized) to standarize a techonology that undermines the openness of the Web.

Browsers vendors are and will always be competing with each other. The W3C’s best position might just be to leave the market do what it is (supposedly) good at; and limit itself to advise against DRM, or failing that, at most provide guidelines on how to limit the damage should Web DRM happens in other forum (e.g, implement sandboxing, UX for user consent, make it opt-in, protect researchers, etc).

The reality is that DRM content distributors need the Web more than the Web needs to support DRM-content. It is just much cheaper for them if the W3C, a respected institution, solves the fragmentation for them.

By enshrining EME in its standards, the W3C is unnecessarily supporting a set of (corporate) interests at the expense of its users’ interests, giving undeserved legitimacy to DRM, based on an argument of pragmatism that doesn’t exist.

Going even further, it is arguable that the Web should support every feature that there is out there, if it bears a risk for its users in terms of privacy, security and freedom of research. Today, Web users don’t expect to do absolutely everything in a Web browser, even if that is desirable in the future. Most borwsers out there support configuring 3er party URL schemes that would launch a native application to handle the content pointed to by such URL. This might just be an acceptable trade-off for the Web ecosystem for specific types of content.

Also, the majority of content distributors have their own native players. In most cases, they promote the use of their native apps to users over the Web version of their players, because the experience is usually better and they can exert more control.

It might be that we have all been sold a straw man.

Example: Run a headless OpenGL (ES) compute shader via DRM render-nodes

It has been a long time indeed since my last entry here. But I have actually been quite busy on a new adventure: graphics driver development.

Two years ago I started contributing to Mesa, mostly to the Intel i965 backend, as a member of the Igalia Graphics Team. During this time I have been building my knowledge around GPU driver development, the different parts of the Linux graphics stack, its different projects, tools, and the developer communities around them. It is still the beginning of the journey, but I’m already very proud of the multiple contributions our team has made.

But today I want to start a series of articles discussing basic examples of how to do cool things with the Khronos APIs and the Linux graphics stack. It will be a collection of short and concise programs that achieve a concrete goal, pretty much what I wish existed when I looked myself up for them in the public domain.

If you, like me, are the kind of person that learns by doing, growing existing examples; then I hope you will find this series interesting, and also encouraging to write your own examples.

Before finishing this sort of introduction and before we enter into the matter, I want to leave my stance on what I consider a good (minimal) example program, because I think it will be useful for this and future examples, to help the reader understand what to expect and how are they different from similar code found online. That said, not all examples in the series will be minimal, though.

A minimal example should:

provide as minimum boilerplate as possible. E.g, wrapping the example in a C++ class is unnecessary noise unless your are demoing OOP.
be contained in a single source code file if possible. Following the code flow across multiple files adds mental overhead.
have none or as minimum dependencies as possible. The reader should not need to install stuff that are not strictly necessary to try the example.
not favor function over readability. It doesn’t matter if it is not a complete or well-behaving program if that adds stuff not related with the one thing being showcased.

Ok, now we are ready to move to the first example in this series: the simplest way to embed and run an OpenGL-ES compute shader in your C program on Linux, without the need of any GUI window or connection to the X (or Wayland) server. Actually, it isn’t even necessary to be running any window system at all. Also, the program doesn’t need any special privilege, other than access to the unprivileged DRI/DRM infrastructure. In modern distros, this access is typically granted by being member of the video group.

The code is fairly short, so I’m inline-ing it below. It can also be checked-up on my gpu-playground repository. See below for a quick explanation of its most interesting parts.

#include <EGL/egl.h>
#include <EGL/eglext.h>
#include <GLES3/gl31.h>
#include <assert.h>
#include <fcntl.h>
#include <gbm.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

/* a dummy compute shader that does nothing */
#define COMPUTE_SHADER_SRC "          \
#version 310 es\n                                                       \
                                                                        \
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;       \
                                                                        \
void main(void) {                                                       \
   /* awesome compute code here */                                      \
}                                                                       \
"

int32_t
main (int32_t argc, char* argv[])
{
   bool res;

   int32_t fd = open ("/dev/dri/renderD128", O_RDWR);
   assert (fd > 0);

   struct gbm_device *gbm = gbm_create_device (fd);
   assert (gbm != NULL);

   /* setup EGL from the GBM device */
   EGLDisplay egl_dpy = eglGetPlatformDisplay (EGL_PLATFORM_GBM_MESA, gbm, NULL);
   assert (egl_dpy != NULL);

   res = eglInitialize (egl_dpy, NULL, NULL);
   assert (res);

   const char *egl_extension_st = eglQueryString (egl_dpy, EGL_EXTENSIONS);
   assert (strstr (egl_extension_st, "EGL_KHR_create_context") != NULL);
   assert (strstr (egl_extension_st, "EGL_KHR_surfaceless_context") != NULL);

   static const EGLint config_attribs[] = {
      EGL_RENDERABLE_TYPE, EGL_OPENGL_ES3_BIT_KHR,
      EGL_NONE
   };
   EGLConfig cfg;
   EGLint count;

   res = eglChooseConfig (egl_dpy, config_attribs, &cfg, 1, &count);
   assert (res);

   res = eglBindAPI (EGL_OPENGL_ES_API);
   assert (res);

   static const EGLint attribs[] = {
      EGL_CONTEXT_CLIENT_VERSION, 3,
      EGL_NONE
   };
   EGLContext core_ctx = eglCreateContext (egl_dpy,
                                           cfg,
                                           EGL_NO_CONTEXT,
                                           attribs);
   assert (core_ctx != EGL_NO_CONTEXT);

   res = eglMakeCurrent (egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, core_ctx);
   assert (res);

   /* setup a compute shader */
   GLuint compute_shader = glCreateShader (GL_COMPUTE_SHADER);
   assert (glGetError () == GL_NO_ERROR);

   const char *shader_source = COMPUTE_SHADER_SRC;
   glShaderSource (compute_shader, 1, &shader_source, NULL);
   assert (glGetError () == GL_NO_ERROR);

   glCompileShader (compute_shader);
   assert (glGetError () == GL_NO_ERROR);

   GLuint shader_program = glCreateProgram ();

   glAttachShader (shader_program, compute_shader);
   assert (glGetError () == GL_NO_ERROR);

   glLinkProgram (shader_program);
   assert (glGetError () == GL_NO_ERROR);

   glDeleteShader (compute_shader);

   glUseProgram (shader_program);
   assert (glGetError () == GL_NO_ERROR);

   /* dispatch computation */
   glDispatchCompute (1, 1, 1);
   assert (glGetError () == GL_NO_ERROR);

   printf ("Compute shader dispatched and finished successfully\n");

   /* free stuff */
   glDeleteProgram (shader_program);
   eglDestroyContext (egl_dpy, core_ctx);
   eglTerminate (egl_dpy);
   gbm_device_destroy (gbm);
   close (fd);

   return 0;
}

You have probably noticed that the program can be divided in 4 main parts:

Creating a GBM device from a render-node
Setting up a (surfaceless) EGL/OpenGL-ES context
Creating a compute shader program
Dispatching the compute shader

The first two parts are the most relevant to the purpose of this article, since they allow our program to run without requiring a window system. The rest is standard OpenGL code to setup and execute a compute shader, which is out of scope for this example.

Creating a GBM device from a render-node

What’s a render-node anyway? Render-nodes are the “new” DRM interface to access the unprivileged functions of a DRI-capable GPU. It is actually not new, only that this API was previously part of the single (privileged) interface exposed at /dev/dri/cardX. During Linux kernel 3.X series, the DRM driver started exposing the unprivileged part of its user-space API via the render-node interface, as a separate device file (/dev/dri/renderDXX). If you want to know more about render-nodes, there is a section about it in the Linux Kernel documentation and also a brief explanation on Wikipedia.

   int32_t fd = open ("/dev/dri/renderD128", O_RDWR);

The first step is to open the render-node file for reading and writing. On my system, it is exposed as /dev/dri/renderD128. It may be different on other systems, and any serious code would want to detect and select the appropiate render node file first. There can be more than one, one for each DRI-capable GPU in your system.

It is the render-node interface that ultimately allows us to use the GPU for computing only, from an unprivileged program. However, to be able to plug into this interface, we need to wrap it up with a Generic Buffer Manager (GBM) device, since this is the interface that Mesa (the EGL/OpenGL driver we are using) understands.

   struct gbm_device *gbm = gbm_create_device (fd);

That’s it. We now have a GBM device that is able to send commands to a GPU via its render-node interface. Next step is to setup an EGL and OpenGL context.

Setting up a (surfaceless) EGL/OpenGL-ES context

EGLDisplay egl_dpy = eglGetPlatformDisplay (EGL_PLATFORM_GBM_MESA, gbm, NULL);

This is the most interesting line of code in this section. We get access to an EGL display using the previously created GBM device as the platform-specific handle. Notice the EGL_PLATFORM_GBM_MESA identifier, which is the Mesa specific enum to interface a GBM device.

   const char *egl_extension_st = eglQueryString (egl_dpy, EGL_EXTENSIONS);
   assert (strstr (egl_extension_st, "EGL_KHR_surfaceless_context") != NULL);

Here we check that we got a valid EGL display that is able to create a surfaceless context.

   res = eglMakeCurrent (egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, core_ctx);

After setting up the EGL context and binding the OpenGL API we want (ES 3.1 in this case), we have the last line of code that requires attention. When activating the EGL/OpenGL-ES context, we want to make sure we specify EGL_NO_SURFACE, because that’s what we want right?

And that’s it. We now have an OpenGL context and we can start making OpenGL calls normally. Setting up a compute shader and dispatching it should be uncontroversial, so we leave it out of this article for simplicity.

What next?

Ok, with around 100 lines of code we were able to dispatch a (useless) compute shader program. Here are some basic ideas on how to grow this example to do something useful:

Dynamically detect and select among the different render-node interfaces available.
Add some input and output buffers to get data in and out of the compute shader execution.
Do some cool processing of the input data in the shader, then use the result back in the C program.
Add proper error checking, obviously.

In future entries, I would like to explore also minimal examples on how to do this using the Vulkan and OpenCL APIs. And also implement some cool routine in the shader that does something interesting.

Conclusion

With this example I tried to demonstrate how easy is to exploit the capabilities of modern GPUs for general purpose computing. Whether for off-screen image processing, computer vision, crypto, physics simulation, machine learning, etc; there are potentially many cases where off-loading work from the CPU to the GPU results in greater performance and/or energy efficiency.

Using a fairly modern Linux graphics stack, Mesa, and a DRI-capable GPU such as an Intel integrated graphics card; any C program or library can embed a compute shader today, and use the GPU as another computing unit.

Does your application have routines that could potentially be moved to an GLSL compute program? I would love to hear about it.

Drawing Web content with OpenGL (ES 3.0) instanced rendering

This is a follow up article about my ongoing research on Web content rendering using aggressive batching and merging of draw operations, together with OpenGL (ES 3.0) instanced rendering.

In a previous post, I discussed how relying on the Web engine’s layer tree to figure out non-overlapping content (layers) of a Web page, would (theoretically) allow an OpenGL based rasterizer to ignore the order of the drawing operations. This would allow the rasterizer to group together drawing of similar geometry and submit them efficiently to the GPU using instanced rendering.

I also presented some basic examples and comparisons of this technique with Skia, a popular 2D rasterizer, giving some hints on how much we can accelerate rendering if the overhead of the OpenGL API calls is reduced by using the instanced rendering technique.

However, this idea remained to be validated for real cases and in real hardware, specially because of the complexity and pressure imposed on shader programs, which now become responsible for de-referencing the attributes of each batched geometry and render them correctly.

Also, there are potential API changes in the rasterizer that could make this technique impractical to implement in any existing Web engine without significant changes in the rendering process.

To try keep this article short and focused, today I want to talk only about my latest experiments rendering some fairly complex Web elements using this technique; and leave the discussion about performance to future entries.

Everything is a rectangle

As mentioned in my previous article, almost everything in a Web page can be rendered with a rectangle primitive.

Web pages are mostly character glyphs, which today’s rasterizers normally draw by texture mapping a pre-rendered image of the glyph onto a rectangular area. Then you have boxes, images, shadows, lines, etc; which can all be drawn with a rectangle with the correct layout, transformation and/or texturing.

Primitives that are not rectangles are mostly seen in the element’s border specification, where you have borders with radius, and different styles: double, dotted, grooved, etc. There is a rich set of primitives coming from the combination of features in the borders spec alone.

There is also the Canvas 2D and SVG APIs, which are created specifically for arbitrary 2D content. The technique I’m discussing here purposely ignores these APIs and focuses on accelerating the rest.

In practice, however, these non-rectangular geometries account for just a tiny fraction of the typical rendering of a Web page, which allows me to effectively call them “exceptions”.

The approach I’m currently following assumes everything in a Web page is a rectangle, and all non-rectangular geometry is treated as exceptions and handled differently on shader code.

This means I no longer need to ignore the ordering problem since I always batch a rectangle for every single draw operation, and then render all rectangles in order. This introduces a dramatic change compared to the previous approach I discussed. Now I can (partially) implement this technique without changing the API of existing rasterizers. I say “partially” because to take full advantage of the performance gain, some API changes would be desired.

Drawing non-rectangular geometry using rectangles

So, how do we deal with these exceptions? Remember that we want to draw only with rectangles so that no operation could ever break our batch, if we want to take full advantage of the instanced rendering acceleration.

There are 3 ways of rendering non-rectangular geometry using rectangles:

1. Using a geometry shader:
This is the most elegant solution, and looks like it was designed for this case. But since it isn’t yet widely deployed, I will not make much emphasis on it here. But we need to follow its evolution closely.
2. Degenerating rectangles:
This is basically to turn a rectangle into a triangle by degenerating one of its vertices. Then, with a set of degenerated rectangles one could draw any arbitrary geometry as we do today with triangles.
3. Drawing geometry in the fragment shader:
This sounds like a bad idea, and it is definitely a bad idea! However, given the small and limited amount of cases that we need to consider, it can be feasible.

I’m currently experimenting with 3). You might ask why?, it looks like the worse option. The reason is that going for 2), degenerating rectangles, seems overkill at this point, lacking a deeper understanding of exactly what non-rectangle geometry we will ever need. Implementing a generic rectangle degeneration just for a few tiny set of cases would have been initially a bad choice and a waste of time.

So I decided to explore first the option of drawing these exceptions in the fragment shader and see how far I could go in terms of shader code complexity and performance (un)loss.

Next, I will show some examples of simple Web features rendered this way.

Experiments

The setup:

While my previous screen-casts were ran in my working laptop with a powerful Haswell GPU, one of my goals then was to focus on mobile devices. Hence, I started developing on an Arndale board I happen to have around. Details of the exact setup is out of the scope now, but I will just mention that the board is running a Linaro distribution with the official Mali T604 drivers by ARM.

Following is a video I ensambled to show the different examples running on the Arndale board (and my laptop at the same time). This time I had to record using an external camera instead of screen-casting to avoid interference with the performance, so please bear with my camera-on-hand video recording skills.

This video file is also available on Vimeo.

I won’t talk about performance now, since I plan to cover that in future deliveries. Enough to be said that the performance is pretty good, comparable to my laptop in most of the examples. Also, there are a lot of simple known optimizations that I have not done because I’m focusing on validating the method first.

One important thing to note is that when drawing is done in a fragment shader, you cannot benefit from multi-sampling anti-aliasing (MSAA), since sampling occurs at an earlier stage. Hence, you have to implement anti-aliasing your self. In this case, I implemented a simple distance-to-edge linear anti-aliasing, and to my surprise, the end result is much better than the MSAA with 8 samples I was trying on my Haswell laptop before, and it is also faster.

On a related note, I have found out that MSAA does not give me much when rendering character glyphs (the majority of content) since they come already anti-aliased by FreeType2. And MSAA will slow down the rendering of the entire scene for every single frame.

I continue to dump the code from this research into a personal repository on GitHub. Go take a look if you are interested in the prototyping of these experiments.

Conclusions and next steps

There is one important conclusion coming out from these experiments: The fact that the rasterizer is stateless makes it very inefficient to modify a single element in a scene.

By stateless I mean they do not keep semantic information about the elements being drawn. For example, lets say I draw a rectangle in one frame, and in the next frame I want to draw the same rectangle somewhere else on the canvas. I already have a batch with all the elements of the scene happily stored in a vertex buffer object on GPU memory, and the rectangle in question is there somewhere. If I could keep the offset where that rectangle is in the batch, I could modify its attributes without having to drop and re-submit the entire buffer.

The solution: Moving to a scene graph. Web engines already implement a scene graph but at a higher level. Here I’m talking about a scene graph in the rasterizer itself, where nodes keep the offset of their attributes in the batch (layout, transformation, color, etc); and when you modify any of these attributes, only the deltas are uploaded to the GPU, rather than the whole batch.

I believe a scene graph approach has the potential to open a whole new set of opportunities for acceleration, specially for transitions and animations, and scrolling.

And that’s exciting!

Apart from this, I also want to:

Benchmark! set up a platform for reliable benchmarking and perf comparison with Skia/Cairo.
Take a subset of this technique and test it in Skia, behind current API.
Validate the case of drawing drop shadows and multi-step gradient backgrounds.
Test in other different OpenGL ES 3.0 implementations (and more devices!).

Let us not forget the fight we are fighting: Web applications must be as fast as native. I truly think we can do it.

A possibly faster approach to OpenGL rasterization of 2D Web content

Even thought it has been a while since my last entry on this blog, I have been quite busy. During most of last year I brought my modest contributions into an awesome startup that you have probably heard of by now, helping them integrate GNOME technologies into their products. I was lucky to join their team at an early stage of the development and participate in key discussions.

But more on that project on future entries.

Today I want to talk about things that keep me busy these days, and are of course related to Web engines. Specifically, I want to talk about 2D rasterization and the process of putting pixels on the screen as fast as possible (aka, the 60 frames-per-second holy grail). I want to discuss an idea that has the potential to significantly increase the performance of page rendering by utilizing modern GPU capabilities, OpenGL, and a bit of help from Web engines’ magic.

This is a technical article, so if you are not very familiar with 2D rasterization, OpenGL or how Web engines draw stuff, I recommended you to take some time off and read about it. It is truly a world of wonders (and sometimes pain).

Instanced rendering

The core of the idea is based on instanced rendering. It is a fairly well known technique introduced by OpenGL 3.1 and OpenGL-ES 3.0 as extension GL_EXT_draw_instanced.

To draw geometry with OpenGL, one normally submits a primitive to the rendering pipeline. The primitive consists of a collection of vertices, and a number of attributes per each vertex. Traditionally, you could only submit one primitive at a time.

With instanced rendering, it is possible to send several “instances” of the same primitive (the same collection of vertices and attributes) on a single call. This dramatically reduces the overhead of pipeline state changes and gives the GPU driver a better chance at optimizing rendering of instances of a particular geometry.

Hence, it is generally a common practice for OpenGL applications to group rendering of similar geometry into batches, and submit them to the pipeline all at once as instances. This technique is commonly known as batching and merging.

Skia, the 2D rasterizer used by the Chromium and Android projects, and Cairo, a popular 2D rasterizer backing many projects such as GNOME and previous versions of Mozilla Firefox; both to some extent have support for some sort of instanced rendering in their respective GL backends.

Telling instances apart

Ok, it is possible to draw a bunch of primitives at once, but how can we make them look different? A way of customizing individual instances is necessary, otherwise they will all render on top of the previous one. Not very useful.

There are two ways of submitting per-instance information: one is by adding a “divisor” to the buffers containing vertex or attribute information (VBOs), which will tell the pipeline to use the divided chunks as per-instance information instead of per-vertex. glVertexAttribDivisor is used in this case.

The other way is to upload the per-instance information to a buffer texture (or any texture for that matter) and fetch the information of the corresponding vertex by sampling, using a new variable gl_InstanceID available in shader code, as the key. This variable will increase for each instance of the geometry being rendered (as oppose to per vertex, for which you have gl_VertexID).

So, a quick recap so far. We are able to draw several instances of the same geometry at once, very efficiently, and are able to upload arbitrary data to customize each of these instances at will.

But wait, there are caveats.

The ordering problem

So, lets say we can now group together all drawing operations that involve the same primitive (rectangle, line, circle, etc). What happens if we draw (say) a filled rectangle, then a circle on top, and then another rectangle on top of the circle?

Following the simple grouping rule, what will happen is that the two rectangles will be grouped together and drawn first in one call, then the circle. This will not render the expected result, since the circle will end up laying on top of the rectangle that was drawn after it.

This problem is commonly known as “ordering”, and it clearly breaks our otherwise super-performing batching and merging case.

So, in scenes that involve lots of geometry overlapping, the grouping is limited to contents that do not overlap, if we wanted to preserve the right order of operations.

In practice, it means that we first need to separate the content in layers, then group the same primitives within a single layer, and finally submit the batches from each layer in the right order.

But guess what? Browser engines already do exactly that. Engines build a layer tree (among several other trees) with the information contained in the HTML and CSS content (layout, styling, transformations, etc), where the content is separated in render nodes whose content do not normally overlap. The actual process is much more complicated than that, but this simplification is enough to illustrate the idea.

Now, what if?

First, for the sake of presenting an idea, lets ignore the 2D context of a canvas element by now. More on that later. Lets focus on most of the web sites out there.

If we look at the number of primitives typically used by the rendering of a page, they boil down to a handful. Essentially, they are:

A rectangle: for almost all HTML elements, which are boxes. And character glyphs! which are normally rendered ahead of time and cached in a texture layout, then texture-mapped onto a rectangle. And images!, which are also texture-mapped onto rectangles.
A thin line: for thin (<=1 pixel) borders, inset/outset effect, hr, etc. Thicker borders can be drawn as thin rectangles.
A round corner: the quarter of a circle, filled or stroke, used to implement rounded rectangles (hello border-radius).
A circle: for bulleted listings, radio-buttons, etc. Argueably, these can be rendered using unicode characters, so no need for specific geometry.

Lets stay with these. There are other cases that I will purposely ignore, like one seen in a rounded rectangle with different thickness in two consecutive borders.

Then we have, for each of these primitives, an evolutionary-like variety of background styles (imaged, colored, repeated, gradient, etc); transformations (rotation, translation, scaling, etc); border styles (again imaged, colored, with different thickness, etc), shadow and blurring effects, and so on.

With a working texture cache, we have a potentially good chance at aggressively grouping together drawing of these primitives, like rectangles for example, for all text glyphs, boxes and images.

So, what if we could submit to a smart shader all the information that describes and tells apart these grouped instances? Is it possible to efficiently pack and then re-interpret in a shader all the styling and transformation complexities of today's CSS-styled HTML elements?

A new approach

Existing 2D rasterizers used in Web engines (at least Skia and Cairo, whose source code is available to me) are general purpose drawing libraries. That means they should render deterministically for any kind of application, not only Web engines. Specifically, they need to avoid the ordering problem explained above, where the result of a set of overlapped drawing operations is different if you change their order.

There are several reasons why modern Web engines use general purpose 2D rasterizers (as opposed to rasterizers written specifically for the needs of Web content rendering). One clear reason is that they existed before (in the case of Cairo at least) as a generic 2D graphics library, and was later used for Web rendering. Other reason is that the implementation of the Canvas 2D spec requires a general purpose 2D API, because that's what it is. And there is a clear benefit in reusing your beautifully optimized Canvas 2D implementation to draw the rest of the Web contents. Also, these libraries evolved from a pixmap (image) backed rendering target, into libraries exploiting the hardware-acceleration of GPU cards. Both libraries now feature an OpenGL(ES) backend that is somehow forced to comply with the previously existing API and behaviors.

But that is sub-optimal for Web engines that simply want to draw non-overlapping content into layers, then draw the layers in order. And even though batching and merging do occur in the GL backends today, it is apparently far from optimal as we will see later.

So, if we completely ignore the ordering problem for the case of Web engines drawing already layered nodes onto an OpenGL based render target, we might be able to aggressively group together potentially all the operations that share the same primitive.

This is of course if, as mentioned above, we are able to describe the particularities of each instance of these primitives, hand them down to a smart shader for rendering, then do all that efficiently so that the performance gained in batching is not lost by uploading tons of instance information to the GPU or running heavy shader code.

It is unclear (to me) whether this is at all possible. That's is why this approach is just an idea yet lacking validation for the real world. But it is a research that could potentially boost performance of Web content rendering.

It shares some similarities with (and was partially inspired by) the way Android does font rendering.

A proof of concept

So, I was set up to write a proof of concept trying to validate or discard the idea as quickly as possible. The purpose is to write the minimum code that would allow meaningful comparison between this approach and exiting rasterizers (Skia being my first target for this), for specific use cases that are relevant to generic Web content rendering (not Canvas 2D).

My proof of concept is being developed at: https://github.com/elima/glr

So far, it just provides a few primitives: rectangle and rounded corner, allowing for 3 basic drawing operations: rectangle (filled or stroked), rounded rectangle (only filled) and character glyph (not text, just single characters).

Then each element drawn can be transformed (scaled, rotated and/or translated), laid-out on the canvas (top, left, width and height), and has a color or texture.

Anti-aliasing is achieved by multisampling with 8 samples per pixel. Character glyphs are not anti-aliased, that was too complex to put in a proof of concept and it is a problem already solved by others anyway. I used the simplest possible path to put a pre-cached glyph on the screen, and for that wrote a super naive texture cache, and used FreeType2 for rasterizing the glyphs.
The idea of including chars was to explore if text glyphs, which accounts for most of typical Web page's content, could be batched together with all others drawings that use a rectangle primitive.

Note should be taken that this proof-of-concept is not intended to become a new library. It is just a vehicle to validate an idea by providing the minimum implementation needed to test its limits. Eventually, all this would have to be implemented in existing libraries. I just happen to be very fluent at glib and C :), as to prototype fast.

Comparisons

Before we jump into FPS excitements, lets clarify that any comparison here should be taken with a grain of salt. 2D rasterizers are complex libraries that do a lot of non-trivial things like anti-aliasing, sub-pixel alignment, color space conversion, adaptation to the specifics of the underlying hardware/driver/GL-version combos, to name just a few.

Thus, any comparison should be put in the context of what code paths are being selected, what rendering operations are being grouped, and when and why they aren't; how many GL operations are submitted to the pipeline to render the same scene, etc.

I have included 3 initial examples that try to illustrate how batching and merging of "compatible" draws (sharing the same underlying primitive) improves performance when ordering is ignored, while at the same time each element can have its own color, layout and transformation. For each example, I have written a Skia counterpart that tries to render exactly the same, to the extent possible, for the sake of comparing.

The data below corresponds to runs in my laptop, which is a Thinkpad T440p running Debian GNU/Linux, has an integrated Intel(tm) GPU (4th gen), and the OpenGL driver is provided by Mesa 10.2.1.

I used apitrace to look at what GL commands are actually sent to the driver.

Lets start with the RectsAndText example. It basically draws a lot of alternating filled rectangles and character glyphs, each with its own color, transformation and layout. In the screencast below, both examples (Skia and glr) are running at the same time. This of course does not reflect real performance since both compete for GPU resources, but I decided to record it this way because the improvement is much better noticed. The frames-per-second decrease proportionally for both examples when run at the same time, so it remains relevant for comparison.

The window in the left corresponds to the Skia example, and the right to glr. The same goes for all screencasts below.

This video file is also available for download.

The difference is considerable. Skia performs at an average of 6-7 FPS while the new approach gives around 40 FPS. That’s a 5x-6x improvement, not bad. Also, notice that CPU usage is considerably higher in the case of Skia.

The interesting thing here is that in the case of glr, all elements are batched together (both rectangles and chars), so only one drawing operation is actually submitted to the pipeline, as you can see in the available apitrace dump. A trace for the corresponding Skia example is also available.

apitrace output for RectAndText glr example

apitrace output for RectAndText Skia example

The next example is Rects, which is similar but renders only rectangles, alternating between filled and stroked. The interesting bit is that in the case of glr, each style of rectangles is drawn onto one different layer, each layer operating on its own separate thread; demonstrating that parallelism is now possible during batching.

This video file is available for download.

In this example, the performance difference is even higher. glr is around 8x faster. Again, apitrace traces for the glr example and the Skia version are available. This time glr submits a total of 2 instanced drawing operations, one for filled rects and one for stroked.

apitrace output for Rects glr example

apitrace output for Rects Skia example

The last example draws several layers of non-overlapping rounded rectangles. As with previous examples, every element is given a unique layout, color and transformation. This example tries to illustrate that because batching operates only at layer level, the more layers you have the less you benefit from this technique. In this particular example, the gap is reduced considerably. In fact it looks like Skia is faster by a few FPS, but it is actually not true. When both examples are run together, Skia is faster, but if run separately, glr example is faster (though not much). I’m still figuring this out.

This video file is available for download.

And the traces for the glr example and the Skia example.

apitrace output for Layers glr example

apitrace output for Layers Skia example

If you are curious about the implementation, take a look at where most of the magic happens: GlrContext, GlrCanvas and GlrBatch objects, and the vertex shader. The rest of the code is mostly API and glue to provide a coherent way to use this approach. Specifically, an abstract concept of "layer" is introduced. The workflow goes this way:

For initialization, a context, a rendering target and a canvas object are created. This is similar to how other 2D libraries work.
In the rendering loop and for each frame, the canvas is first notified that a new frame will be rendered.
Then any number of layer objects are created and attached to the canvas. The drawing API works against a layer (instead of a canvas), and will group and batch all the drawing operations in internal commands. When drawing to a layer finishes, the layer is notified that it is ready.
Finally, the canvas is requested to finish the frame, right before swapping buffers. This call will wait for all the attached layers to finish (blocking if needed). Once all complete, the canvas will take the batched commands from each layer, in the order they were attached, and submit them to the pipeline for rendering.

One thing to remark is that layers are self-contained stateful objects, and can survive frames without needing to redraw.

Other benefits

One by-product derived from the fact that layers cache drawing operations in internal commands (which in turn use locally allocated buffers), is that layers now become data-parallel. This is a term rarely used in the context of OpenGL because as you probably know, the way its API is designed makes it a giant state machine, making any parallelization unpractical.

With this approach, layers can be drawn in separated threads (or fully moved to OpenCL), which can bring extra performance if there are several complex layers that need drawing at the same time.

Another potential extra benefit comes from the fact that the canvas renders to a target that is actually a framebuffer backed up by a multisample texture. This means we can use any previously rendered frame as a texture, the same way it currently works in both Chromium and Webkit, where layers are texture-mapped then composited into the final scene.

So, we have the flexibility that, if a particular layer is too complex or slow to draw, we can attach it alone to a canvas, render it, and use the texture as with the current model. But, if we are short on texture memory, it is possible to keep commands batched in layers and render them on every frame. This is kind of similar to what Chromium does, recording draw operations into an SkPicture and then re-playing back when needed.

Future work

This is an approach that needs validation for a number of real world use cases before it can be even considered for testing on a Web engine. It is key to explore how complex information (for example, multi-step gradient backgrounds, or complex border styling with rounded rectangles) can be passed to the shaders and rendered correctly and efficiently. Also, there are shadows and blurring effects, all parametrized to cover the most creative use cases, that also need verification against this model.

Basically, we need to understand the limits of the approach by trying to implement modern W3C specs, selecting the most complex features first.

Other important priorities are:

Understand how much workload can be imposed on shaders side before the gained performance starts to degrade.
Test on OpenGL-ES and constrained GPU on embedded ARM, to detect the minimum requirements.
Figure out how to implement a mid-frame flushing mechanism when texture cache exhausts or command buffers get too large. This is not trivial, since to flush a layer (that is possibly running in a separate thread) it has to be blocked, then the canvas has to wait for all layers below it to finish and then execute their commands, then signal the blocked layer to continue.
Try how scrolling would behave if previously batched layers are drawn for every frame, instead of using current scrolling techniques that rely on rolling big textures, or moving several tiles up and down. These techniques impose either great pressure on texture memory, or a lot of complexity on tile management (or both), specially in the context of new super-high resolution screens.

Conclusions and final words

I have tried to detail an idea that although not new, I believe has not been explored in full in the context of Web engines. It relies on two essential hypothesis:

That it is possible to batch not only geometry, but the complex attributes of arbitrarily styled HTML elements, and render that geometry as instances using shader code.
It is safe to ignore ordering of draw operations during rasterization phase, and leverage on Web engine’s layer tree to solve overlapping.

Modern GPUs and OpenGL APIs have great potential for optimizing 2D rasterization, but as it happens most of the times, there is no one solution to fit all. Instead, each particular application and use case requires a different set of strategies and trade-offs for optimum performance.

This approach, even if valid for a sufficient number of use cases is unlikely to go faster than existing approaches for all tests cases. Even less replace these approaches. This is pretty clear in the case of canvas 2D for example, which will continue to require a general purpose rasterizer. But if there is a sufficient number of use cases that would benefit from this approach to some degree, then maintaining one code path that enables it will already be a win.

Finally, I want to thank Samsung SRA for partially sponsoring the time I dedicated to pursue this idea, and also Igalia and igalians which are always there to back me up and help me move forward.

Now, is there anyone interested in helping me explore this idea further?

Introducing gocl, a gobject wrapper to OpenCL

For the past few months I have been working on this project to bring OpenCL closer to GNOME technologies, and today I’m glad to make the first public announcement. For the uninformed reader, OpenCL is a framework and language for writing programs that execute across heterogeneous HW pieces like CPUs, GPUs, DSPs, etc. While not applicable to any piece of software, OpenCL can unleash unparalleled performance and power efficiency on specific heavy algorithms like media decoding, cryptography, computer vision, big data indexing and processing, physics simulation, graphics, image compositing, among others.

Gocl is a GLib/GObject based library that aims at simplifying the use of OpenCL in GNOME software. It is intended to be a lightweight wrapper that adapts OpenCL programming patterns and boilerplate, and expose a simpler API that is known and comfortable to GNOME developers. Examples of such adaptations are the integration with GLib’s main loop, exposing non-blocking APIs, GError based error reporting and full gobject-introspection support. It will also be including convenient API to simplify code for the most common use patterns.

Gocl started as part of the work and research we do at Igalia on HW acceleration, that I decided to take a bit of, clean it up and release it in a way that can be useful to others. OpenCL is gaining relevance and popularity since the number of implementations and supported chips have grown significantly in recent years. Soon we are going to see OpenCL running anywhere and GNOME technologies should be ready to take advantage of it.

Full gtk-doc documentation is available, and source code is hosted at my GitHub account.

The API is very simple and limited at this stage, and should be considered very unstable. Although I’m not currently working on it full time, I do have kind of a roadmap for the API and features that I will prioritize:

Completing the missing asynchronous API
Adding API to query available OpencL extensions
Provide API to expose cl_khr_gl_sharing extension, for object sharing with OpenGL

You are welcome to suggest/request features that you would like to see in Gocl, as well as propose changes on the API. The GitHub issue tracking at project’s page is available for that, and also to report bugs.

So, do you know of a specific piece of software in GNOME that could potentially benefit from OpenCL? I would love to hear about it.

At Igalia, as part of our strong commitment to make the Web better and faster, we are already looking into ways of applying OpenCL to WebKit and its related technologies, and I’m personally interested on that line of work.

SHA-512 hashing support in glib

Always feels good to close old bugs, even if done unintentionally. In one of the projects I’m working on, I ran into the lack of SHA-512 support in glib and decided to step in. It turned out that such support was requested in a bug reported 3 years ago.

Whatever the reasons, the good news is that the next release of glib will ship with support for SHA-512 hashing in GChecksum. The implementation strictly follows the FIPS-180-2 standard.

Thanks to Emmanuele Bassi for reviewing my patch, Julian Andres Klode for merging it, and Igalia for sponsoring my dedication.

Going to JSConf.eu 2011

Back in 2009 I had the chance to attend the european edition of the Javascript Conference for the first time. It was a nice and intense learning experience (it runs only for two fully packed days). This event is the counterpart of the US edition however it gathers a wide and heterogeneous Javascript community from around the world and not just Europe.

And guess what, after missing my ticket last year, I’m attending again this year’s edition sponsored by igalia. I will be giving a talk titled “Javascript, the GNOME way” in which I will discuss the relationship between GNOME 3 and Javascript, the technologies behind it and how to get started writing JS in the “GNOME way”. My goals with this presentation are 1) to communicate to the wider Javascript audience about the awesomeness of using the GNOME libraries in JS and 2) to try bridging the two communities in subjects that matters to both, to ultimately foster collaboration and alignment between them.

I’m sure this year edition will be as cool as the others, and I look forward to absorb again all that knowledge, ideas, enthusiasm and yeah, the Berlin’s autumn breeze too.

See you there.

FileTea, low friction anonymous file sharing

Let me present you FileTea, a project enabling anonymous file sharing for the Web. It is designed to be simple and easy to use, to run in (modern) browsers without additional plugins, and to avoid the hassle of user registration.

Works like this:

Mary wants to share a file with Jane. She opens a browser and navigates to a FileTea service.
Mary drag-and-drops the file (or files) she wants to send into the page and copies the short url generated for each file.
Mary sends the url to Jane by instant messaging, e-mail, SMS or just posts it somewhere.
Jane receives the url, opens it in her browser and downloads the file.

A reference deployment is running at http://filetea.me. It is alpha version and is kindly hosted by Igalia (thanks!). Please, feel free to try it (and provide feedback!).

FileTea is free software and released under the terms of the GNU AGPL license. That means that you can install it in your own server and host it yourself for your organization, your business or your friends. The source code is hosted at Gitorious.

We are not alone

There are similar services running out there like http://min.us, http://ge.tt, http://imm.io, and maybe others I’m not aware of. But FileTea is different in three important aspects:

FileTea is free software (including server side), meaning development is open to the community, and you can run your own instance and make it better.
FileTea does not store any file in the server. It just synchronizes and bridges an upload from the seeder with a download to the leacher.
FileTea sets no limit to the size of shared files.

Features, my friend

Currently, FileTea has the bare minimum features to allow file-sharing, but I have a long list of possibly cool stuff to add as I find free time to work on them. Some hot topics are:

Global and per-transfer bandwidth limitation (up and down).
Proper thumbnails for the shared files.
Bulk sharing: the ability to share a group of files under a single url and download a zip or tar ball with all files together.

If you have more ideas, I would be happy to hear them.

So that’s it, happy sharing!

See you at Desktop Summit 2011

Although my talk “The Web jumps into D-Bus” was not accepted for this year edition of the Desktop Summit, I’m still attending the event thank to my employer Igalia which apart from sponsoring the conference, is sponsoring my attendance together with a bunch of other igalians as well. Like two years ago, this edition is special because we have the GUADEC and Akademy events co-located under the free Desktop Summit umbrella. This is a great opportunity for communication, coordination, sharing and synergy among the different free desktop environments, their projects, hackers and communities.

As usual, I will be hanging around closer to where GNOME folks gather, but this year that could not always be true since I have a special interest on Freedesktop.org and Web technologies, and that means I might be seen anywhere around . Apart from that, of course I will be hopping on the security and introspection-related sessions as usual; and I also want to get closer to the GNU community that usually gathers during these days. I hope there will be interesting discussions around the Freedombox project.

So, if you are in Berlin next week and want to chat about EventDance, the free desktop, free software, technology or life in general, find me there and be welcomed.

EventDance 0.1.4 released

During Easter holidays, I finally managed to find time to close EventDance 0.1.3 development cycle and release 0.1.4. This milestone took more than expected for several reasons, mainly due to some last minute API changes I had to introduce and a couple of features I couldn’t resist to implement earlier. The result is a long changelog that I will try to summarize:

Basic API for asymmetric (public-key) cryptography
EvdPkiPubkey and EvdPkiPrivkey classes provide abstraction for PKI public and private key respectively. They basically are asynchronous, GIO-friendly wrappers for libgcrypt PK functions. There is also API for asynchronous key-pair generation. By now, only encryption/decryption using RSA algorithm is supported.
Basic API for symmetric cryptography
EvdTlsCipher also provides an asynchronous, GIO-friendly wrapper for libcrypt symmetric crypto API, adding some nice features like data auto-padding and key aligning built right in. Not all algorithms supported by libgcrypt are available but only the most popular (e.g, AES 128/192/256, ARCFOUR).
SNI and lazy certificate selection for TLS credentials
Server Name Indication is a SSL/TLS extension that permits a client to request the domain name before the certificate is committed to the server. This feature is available in GnuTLS and is now exported to EvdTlsSession. Also, EvdTlsCredentials added a callback to select the certificate to send to the peer during the TLS handshake. The combination of these two features is critical to implement an SSL/TLS capable reverse Web proxy. I’m seriously considering to include one such proxy inside EventDance, that would export a D-Bus API over the system bus to allow applications to easily add/remove virtual hosts and server backends on-the-fly.
Websockets mechanism into EvdWebTransport
Now the web transport negotiates mechanism with the browser during handshake and uses websockets if supported, otherwise falls back to long-polling. Only version 76 (hybi-00) of the spec is implemented so far.
EvdDBusBridge
A component to connect a web page to a D-Bus message bus running in the server, allowing client-side Web applications to proxy/export objects and acquire bus names. Check my previous post introducing this feature for details
EvdJsonrpc
An asynchronous, GIO-friendly implementation of the JSON-RPC protocol version 1.0, specifically designed to work well with EventDance transports.
EvdDaemon
An abstraction for any program that runs as a service daemon. The purpose is that if you are implementing a daemon, you just use an EvdDaemon instance and automagically get an event-loop (GMainLoop), pid-file management, syslog-based logging, daemonizing (console detaching) and clean program termination. The pid-file and syslog functionalities are still on the way though.
EvdDBusDaemon
A component that launches a custom D-Bus message bus and tracks its execution. This is useful when an application needs to use a custom message bus instead of the well-known ones; for security or sandboxing reasons.

Also, as usual, lots of bugfixes and random improvements. A dependency on json-glib was added too.

Now and for the next weeks, I’m running a documentation and annotations sprint, something I have delayed too much already. I will also write a couple of basic tutorials on how to build and use EventDance. Stay tuned.