Example: Run a headless OpenGL (ES) compute shader via DRM render-nodes

It has been a long time indeed since my last entry here. But I have actually been quite busy on a new adventure: graphics driver development.

Two years ago I started contributing to Mesa, mostly to the Intel i965 backend, as a member of the Igalia Graphics Team. During this time I have been building my knowledge around GPU driver development, the different parts of the Linux graphics stack, its different projects, tools, and the developer communities around them. It is still the beginning of the journey, but I’m already very proud of the multiple contributions our team has made.

But today I want to start a series of articles discussing basic examples of how to do cool things with the Khronos APIs and the Linux graphics stack. It will be a collection of short and concise programs that achieve a concrete goal, pretty much what I wish existed when I looked myself up for them in the public domain.

If you, like me, are the kind of person that learns by doing, growing existing examples; then I hope you will find this series interesting, and also encouraging to write your own examples.

Before finishing this sort of introduction and before we enter into the matter, I want to leave my stance on what I consider a good (minimal) example program, because I think it will be useful for this and future examples, to help the reader understand what to expect and how are they different from similar code found online. That said, not all examples in the series will be minimal, though.

A minimal example should:

  • provide as minimum boilerplate as possible. E.g, wrapping the example in a C++ class is unnecessary noise unless your are demoing OOP.
  • be contained in a single source code file if possible. Following the code flow across multiple files adds mental overhead.
  • have none or as minimum dependencies as possible. The reader should not need to install stuff that are not strictly necessary to try the example.
  • not favor function over readability. It doesn’t matter if it is not a complete or well-behaving program if that adds stuff not related with the one thing being showcased.

Ok, now we are ready to move to the first example in this series: the simplest way to embed and run an OpenGL-ES compute shader in your C program on Linux, without the need of any GUI window or connection to the X (or Wayland) server. Actually, it isn’t even necessary to be running any window system at all. Also, the program doesn’t need any special privilege, other than access to the unprivileged DRI/DRM infrastructure. In modern distros, this access is typically granted by being member of the video group.

The code is fairly short, so I’m inline-ing it below. It can also be checked-up on my gpu-playground repository. See below for a quick explanation of its most interesting parts.

#include <EGL/egl.h>
#include <EGL/eglext.h>
#include <GLES3/gl31.h>
#include <assert.h>
#include <fcntl.h>
#include <gbm.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

/* a dummy compute shader that does nothing */
#define COMPUTE_SHADER_SRC "          \
#version 310 es\n                                                       \
                                                                        \
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;       \
                                                                        \
void main(void) {                                                       \
   /* awesome compute code here */                                      \
}                                                                       \
"

int32_t
main (int32_t argc, char* argv[])
{
   bool res;

   int32_t fd = open ("/dev/dri/renderD128", O_RDWR);
   assert (fd > 0);

   struct gbm_device *gbm = gbm_create_device (fd);
   assert (gbm != NULL);

   /* setup EGL from the GBM device */
   EGLDisplay egl_dpy = eglGetPlatformDisplay (EGL_PLATFORM_GBM_MESA, gbm, NULL);
   assert (egl_dpy != NULL);

   res = eglInitialize (egl_dpy, NULL, NULL);
   assert (res);

   const char *egl_extension_st = eglQueryString (egl_dpy, EGL_EXTENSIONS);
   assert (strstr (egl_extension_st, "EGL_KHR_create_context") != NULL);
   assert (strstr (egl_extension_st, "EGL_KHR_surfaceless_context") != NULL);

   static const EGLint config_attribs[] = {
      EGL_RENDERABLE_TYPE, EGL_OPENGL_ES3_BIT_KHR,
      EGL_NONE
   };
   EGLConfig cfg;
   EGLint count;

   res = eglChooseConfig (egl_dpy, config_attribs, &cfg, 1, &count);
   assert (res);

   res = eglBindAPI (EGL_OPENGL_ES_API);
   assert (res);

   static const EGLint attribs[] = {
      EGL_CONTEXT_CLIENT_VERSION, 3,
      EGL_NONE
   };
   EGLContext core_ctx = eglCreateContext (egl_dpy,
                                           cfg,
                                           EGL_NO_CONTEXT,
                                           attribs);
   assert (core_ctx != EGL_NO_CONTEXT);

   res = eglMakeCurrent (egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, core_ctx);
   assert (res);

   /* setup a compute shader */
   GLuint compute_shader = glCreateShader (GL_COMPUTE_SHADER);
   assert (glGetError () == GL_NO_ERROR);

   const char *shader_source = COMPUTE_SHADER_SRC;
   glShaderSource (compute_shader, 1, &shader_source, NULL);
   assert (glGetError () == GL_NO_ERROR);

   glCompileShader (compute_shader);
   assert (glGetError () == GL_NO_ERROR);

   GLuint shader_program = glCreateProgram ();

   glAttachShader (shader_program, compute_shader);
   assert (glGetError () == GL_NO_ERROR);

   glLinkProgram (shader_program);
   assert (glGetError () == GL_NO_ERROR);

   glDeleteShader (compute_shader);

   glUseProgram (shader_program);
   assert (glGetError () == GL_NO_ERROR);

   /* dispatch computation */
   glDispatchCompute (1, 1, 1);
   assert (glGetError () == GL_NO_ERROR);

   printf ("Compute shader dispatched and finished successfully\n");

   /* free stuff */
   glDeleteProgram (shader_program);
   eglDestroyContext (egl_dpy, core_ctx);
   eglTerminate (egl_dpy);
   gbm_device_destroy (gbm);
   close (fd);

   return 0;
}

You have probably noticed that the program can be divided in 4 main parts:

  1. Creating a GBM device from a render-node
  2. Setting up a (surfaceless) EGL/OpenGL-ES context
  3. Creating a compute shader program
  4. Dispatching the compute shader

The first two parts are the most relevant to the purpose of this article, since they allow our program to run without requiring a window system. The rest is standard OpenGL code to setup and execute a compute shader, which is out of scope for this example.

Creating a GBM device from a render-node

What’s a render-node anyway? Render-nodes are the “new” DRM interface to access the unprivileged functions of a DRI-capable GPU. It is actually not new, only that this API was previously part of the single (privileged) interface exposed at /dev/dri/cardX. During Linux kernel 3.X series, the DRM driver started exposing the unprivileged part of its user-space API via the render-node interface, as a separate device file (/dev/dri/renderDXX). If you want to know more about render-nodes, there is a section about it in the Linux Kernel documentation and also a brief explanation on Wikipedia.

   int32_t fd = open ("/dev/dri/renderD128", O_RDWR);

The first step is to open the render-node file for reading and writing. On my system, it is exposed as /dev/dri/renderD128. It may be different on other systems, and any serious code would want to detect and select the appropiate render node file first. There can be more than one, one for each DRI-capable GPU in your system.

It is the render-node interface that ultimately allows us to use the GPU for computing only, from an unprivileged program. However, to be able to plug into this interface, we need to wrap it up with a Generic Buffer Manager (GBM) device, since this is the interface that Mesa (the EGL/OpenGL driver we are using) understands.

   struct gbm_device *gbm = gbm_create_device (fd);

That’s it. We now have a GBM device that is able to send commands to a GPU via its render-node interface. Next step is to setup an EGL and OpenGL context.

Setting up a (surfaceless) EGL/OpenGL-ES context

EGLDisplay egl_dpy = eglGetPlatformDisplay (EGL_PLATFORM_GBM_MESA, gbm, NULL);

This is the most interesting line of code in this section. We get access to an EGL display using the previously created GBM device as the platform-specific handle. Notice the EGL_PLATFORM_GBM_MESA identifier, which is the Mesa specific enum to interface a GBM device.

   const char *egl_extension_st = eglQueryString (egl_dpy, EGL_EXTENSIONS);
   assert (strstr (egl_extension_st, "EGL_KHR_surfaceless_context") != NULL);

Here we check that we got a valid EGL display that is able to create a surfaceless context.

   res = eglMakeCurrent (egl_dpy, EGL_NO_SURFACE, EGL_NO_SURFACE, core_ctx);

After setting up the EGL context and binding the OpenGL API we want (ES 3.1 in this case), we have the last line of code that requires attention. When activating the EGL/OpenGL-ES context, we want to make sure we specify EGL_NO_SURFACE, because that’s what we want right?

And that’s it. We now have an OpenGL context and we can start making OpenGL calls normally. Setting up a compute shader and dispatching it should be uncontroversial, so we leave it out of this article for simplicity.

What next?

Ok, with around 100 lines of code we were able to dispatch a (useless) compute shader program. Here are some basic ideas on how to grow this example to do something useful:

  • Dynamically detect and select among the different render-node interfaces available.
  • Add some input and output buffers to get data in and out of the compute shader execution.
  • Do some cool processing of the input data in the shader, then use the result back in the C program.
  • Add proper error checking, obviously.

In future entries, I would like to explore also minimal examples on how to do this using the Vulkan and OpenCL APIs. And also implement some cool routine in the shader that does something interesting.

Conclusion

With this example I tried to demonstrate how easy is to exploit the capabilities of modern GPUs for general purpose computing. Whether for off-screen image processing, computer vision, crypto, physics simulation, machine learning, etc; there are potentially many cases where off-loading work from the CPU to the GPU results in greater performance and/or energy efficiency.

Using a fairly modern Linux graphics stack, Mesa, and a DRI-capable GPU such as an Intel integrated graphics card; any C program or library can embed a compute shader today, and use the GPU as another computing unit.

Does your application have routines that could potentially be moved to an GLSL compute program? I would love to hear about it.

Drawing Web content with OpenGL (ES 3.0) instanced rendering

This is a follow up article about my ongoing research on Web content rendering using aggressive batching and merging of draw operations, together with OpenGL (ES 3.0) instanced rendering.

In a previous post, I discussed how relying on the Web engine’s layer tree to figure out non-overlapping content (layers) of a Web page, would (theoretically) allow an OpenGL based rasterizer to ignore the order of the drawing operations. This would allow the rasterizer to group together drawing of similar geometry and submit them efficiently to the GPU using instanced rendering.

I also presented some basic examples and comparisons of this technique with Skia, a popular 2D rasterizer, giving some hints on how much we can accelerate rendering if the overhead of the OpenGL API calls is reduced by using the instanced rendering technique.

However, this idea remained to be validated for real cases and in real hardware, specially because of the complexity and pressure imposed on shader programs, which now become responsible for de-referencing the attributes of each batched geometry and render them correctly.

Also, there are potential API changes in the rasterizer that could make this technique impractical to implement in any existing Web engine without significant changes in the rendering process.

To try keep this article short and focused, today I want to talk only about my latest experiments rendering some fairly complex Web elements using this technique; and leave the discussion about performance to future entries.

Everything is a rectangle

As mentioned in my previous article, almost everything in a Web page can be rendered with a rectangle primitive.

Web pages are mostly character glyphs, which today’s rasterizers normally draw by texture mapping a pre-rendered image of the glyph onto a rectangular area. Then you have boxes, images, shadows, lines, etc; which can all be drawn with a rectangle with the correct layout, transformation and/or texturing.

Primitives that are not rectangles are mostly seen in the element’s border specification, where you have borders with radius, and different styles: double, dotted, grooved, etc. There is a rich set of primitives coming from the combination of features in the borders spec alone.

There is also the Canvas 2D and SVG APIs, which are created specifically for arbitrary 2D content. The technique I’m discussing here purposely ignores these APIs and focuses on accelerating the rest.

In practice, however, these non-rectangular geometries account for just a tiny fraction of the typical rendering of a Web page, which allows me to effectively call them “exceptions”.

The approach I’m currently following assumes everything in a Web page is a rectangle, and all non-rectangular geometry is treated as exceptions and handled differently on shader code.

This means I no longer need to ignore the ordering problem since I always batch a rectangle for every single draw operation, and then render all rectangles in order. This introduces a dramatic change compared to the previous approach I discussed. Now I can (partially) implement this technique without changing the API of existing rasterizers. I say “partially” because to take full advantage of the performance gain, some API changes would be desired.

Drawing non-rectangular geometry using rectangles

So, how do we deal with these exceptions? Remember that we want to draw only with rectangles so that no operation could ever break our batch, if we want to take full advantage of the instanced rendering acceleration.

There are 3 ways of rendering non-rectangular geometry using rectangles:

  • 1. Using a geometry shader:

    This is the most elegant solution, and looks like it was designed for this case. But since it isn’t yet widely deployed, I will not make much emphasis on it here. But we need to follow its evolution closely.

  • 2. Degenerating rectangles:

    This is basically to turn a rectangle into a triangle by degenerating one of its vertices. Then, with a set of degenerated rectangles one could draw any arbitrary geometry as we do today with triangles.

  • 3. Drawing geometry in the fragment shader:

    This sounds like a bad idea, and it is definitely a bad idea! However, given the small and limited amount of cases that we need to consider, it can be feasible.

I’m currently experimenting with 3). You might ask why?, it looks like the worse option. The reason is that going for 2), degenerating rectangles, seems overkill at this point, lacking a deeper understanding of exactly what non-rectangle geometry we will ever need. Implementing a generic rectangle degeneration just for a few tiny set of cases would have been initially a bad choice and a waste of time.

So I decided to explore first the option of drawing these exceptions in the fragment shader and see how far I could go in terms of shader code complexity and performance (un)loss.

Next, I will show some examples of simple Web features rendered this way.

Experiments

The setup:

While my previous screen-casts were ran in my working laptop with a powerful Haswell GPU, one of my goals then was to focus on mobile devices. Hence, I started developing on an Arndale board I happen to have around. Details of the exact setup is out of the scope now, but I will just mention that the board is running a Linaro distribution with the official Mali T604 drivers by ARM.

My Arndale board

Following is a video I ensambled to show the different examples running on the Arndale board (and my laptop at the same time). This time I had to record using an external camera instead of screen-casting to avoid interference with the performance, so please bear with my camera-on-hand video recording skills.



This video file is also available on Vimeo.

I won’t talk about performance now, since I plan to cover that in future deliveries. Enough to be said that the performance is pretty good, comparable to my laptop in most of the examples. Also, there are a lot of simple known optimizations that I have not done because I’m focusing on validating the method first.

One important thing to note is that when drawing is done in a fragment shader, you cannot benefit from multi-sampling anti-aliasing (MSAA), since sampling occurs at an earlier stage. Hence, you have to implement anti-aliasing your self. In this case, I implemented a simple distance-to-edge linear anti-aliasing, and to my surprise, the end result is much better than the MSAA with 8 samples I was trying on my Haswell laptop before, and it is also faster.

On a related note, I have found out that MSAA does not give me much when rendering character glyphs (the majority of content) since they come already anti-aliased by FreeType2. And MSAA will slow down the rendering of the entire scene for every single frame.

I continue to dump the code from this research into a personal repository on GitHub. Go take a look if you are interested in the prototyping of these experiments.

Conclusions and next steps

There is one important conclusion coming out from these experiments: The fact that the rasterizer is stateless makes it very inefficient to modify a single element in a scene.

By stateless I mean they do not keep semantic information about the elements being drawn. For example, lets say I draw a rectangle in one frame, and in the next frame I want to draw the same rectangle somewhere else on the canvas. I already have a batch with all the elements of the scene happily stored in a vertex buffer object on GPU memory, and the rectangle in question is there somewhere. If I could keep the offset where that rectangle is in the batch, I could modify its attributes without having to drop and re-submit the entire buffer.

The solution: Moving to a scene graph. Web engines already implement a scene graph but at a higher level. Here I’m talking about a scene graph in the rasterizer itself, where nodes keep the offset of their attributes in the batch (layout, transformation, color, etc); and when you modify any of these attributes, only the deltas are uploaded to the GPU, rather than the whole batch.

I believe a scene graph approach has the potential to open a whole new set of opportunities for acceleration, specially for transitions and animations, and scrolling.

And that’s exciting!

Apart from this, I also want to:

  • Benchmark! set up a platform for reliable benchmarking and perf comparison with Skia/Cairo.
  • Take a subset of this technique and test it in Skia, behind current API.
  • Validate the case of drawing drop shadows and multi-step gradient backgrounds.
  • Test in other different OpenGL ES 3.0 implementations (and more devices!).

Let us not forget the fight we are fighting: Web applications must be as fast as native. I truly think we can do it.