Story and status of ARB_gpu_shader_fp64 on Intel GPUs

In case you haven’t heard yet, with the recently announced Mesa 12.0 release, Intel gen8+ GPUs expose OpenGL 4.3, which is quite a big leap from the previous OpenGL 3.3!

OpenGL 4.3
The Mesa i965 Intel driver now exposes OpenGL 4.3 on Broadwell and later!

Although this might surprise some, the truth is that even if the i965 driver only exposed OpenGL 3.3 it had been exposing many of the OpenGL 4.x extensions for quite some time, however, there was one OpenGL 4.0 extension in particular that was still missing and preventing the driver from exposing a higher version: ARB_gpu_shader_fp64 (fp64 for short). There was a good reason for this: it is a very large feature that has been in the works by Intel first and Igalia later for quite some time. We first started to work on this as far back as November 2015 and by that time Intel had already been working on it for months.

I won’t cover here what made this such a large effort because there would be a lot of stuff to cover and I don’t feel like spending weeks writing a series of posts on the subject :). Hopefully I will get a chance to talk about all that at XDC in September, so instead I’ll focus on explaining why we only have this working in gen8+ at the moment and the status of gen7 hardware.

The plan for ARB_gpu_shader_fp64 was always to focus on gen8+ hardware (Broadwell and later) first because it has better support for the feature. I must add that it also has fewer hardware bugs too, although we only found out about that later ;). So the plan was to do gen8+ and then extend the implementation to cover the quirks required by gen7 hardware (IvyBridge, Haswell, ValleyView).

At this point I should explain that Intel GPUs have two code generation backends: scalar and vector. The main difference between both backends is that the vector backend (also known as align16) operates on vectors (surprise, right?) and has native support for things like swizzles and writemasks, while the scalar backend (known as align1) operates on scalars, which means that, for example, a vec4 GLSL operation running is broken up into 4 separate instructions, each one operating on a single component. You might think that this makes the scalar backend slower, but that would not be accurate. In fact it is usually faster because it allows the GPU to exploit SIMD better than the vector backend.

The thing is that different hardware generations use one backend or the other for different shader stages. For example, gen8+ used to run Vertex, Fragment and Compute shaders through the scalar backend and Geometry and Tessellation shaders via the vector backend, whereas Haswell and IvyBridge use the vector backend also for Vertex shaders.

Because you can use 64-bit floating point in any shader stage, the original plan was to implement fp64 support on both backends. Implementing fp64 requires a lot of changes throughout the driver compiler backends, which makes the task anything but trivial, but the vector backend is particularly difficult to implement because the hardware only supports 32-bit swizzles. This restriction means that a hardware swizzle such as XYZW only selects components XY in a dvecN and therefore, there is no direct mechanism to access components ZW. As a consequence, dealing with anything bigger than a dvec2 requires more creative solutions, which then need to face some other hardware limitations and bugs, etc, which eventually makes the vector backend require a significantly larger development effort than the scalar backend.

Thankfully, gen8+ hardware supports scalar Geometry and Tessellation shaders and Intel‘s Kenneth Graunke had been working on enabling that for a while. When we realized that the vector fp64 backend was going to require much more effort than what we had initially thought, he gave a final push to the full scalar gen8+ implementation, which in turn allowed us to have a full fp64 implementation for this hardware and expose OpenGL 4.0, and soon after, OpenGL 4.3.

That does not mean that we don’t care about gen7 though. As I said above, the plan has always been to bring fp64 and OpenGL4 to gen7 as well. In fact, we have been hard at work on that since even before we started sending the gen8+ implementation for review and we have made some good progress.

Besides addressing the quirks of fp64 for IvyBridge and Haswell (yes, they have different implementation requirements) we also need to implement the full fp64 vector backend support from scratch, which as I said, is not a trivial undertaking. Because Haswell seems to require fewer changes we have started with that and I am happy to report that we have a working version already. In fact, we have already sent a small set of patches for review that implement Haswell‘s requirements for the scalar backend and as I write this I am cleaning-up an initial implementation of the vector backend in preparation for review (currently at about 100 patches, but I hope to trim it down a bit before we start the review process). IvyBridge and ValleView will come next.

The initial implementation for the vector backend has room for improvement since the focus was on getting it working first so we can expose OpenGL4 in gen7 as soon as possible. The good thing is that it is more or less clear how we can improve the implementation going forward (you can see an excellent post by Curro on that topic here).

You might also be wondering about OpenGL 4.1’s ARB_vertex_attrib_64bit, after all, that kind of goes hand in hand with ARB_gpu_shader_fp64 and we implemented the extension for gen8+ too. There is good news here too, as my colleague Juan Suárez has already implemented this for Haswell and I would expect it to mostly work on IvyBridge as is or with minor tweaks. With that we should be able to expose at least OpenGL 4.2 on all gen7 hardware once we are done.

So far, implementing ARB_gpu_shader_fp64 has been quite the ride and I have learned a lot of interesting stuff about how the i965 driver and Intel GPUs operate in the process. Hopefully, I’ll get to talk about all this in more detail at XDC later this year. If you are planning to attend and you are interested in discussing this or other Mesa stuff with me, please find me there, I’ll be looking forward to it.

Finally, I’d like to thank both Intel and Igalia for supporting my work on Mesa and i965 all this time, my igalian friends Samuel Iglesias, who has been hard at work with me on the fp64 implementation all this time, Juan Suárez and Andrés Gómez, who have done a lot of work to improve the fp64 test suite in Piglit and all the friends at Intel who have been helping us in the process, very especially Connor Abbot, Francisco Jerez, Jason Ekstrand and Kenneth Graunke.

Implementing ARB_shader_storage_buffer

In my previous post I introduced ARB_shader_storage_buffer, an OpenGL 4.3 feature that is coming soon to Mesa and the Intel i965 driver. While that post focused on explaining the features introduced by the extension, in this post I’ll dive into some of the implementation aspects, for those who are curious about this kind of stuff. Be warned that some parts of this post will be specific to Intel hardware.

Following the trail of UBOs

As I explained in my previous post, SSBOs are similar to UBOs, but they are read-write. Because there is a lot of code already in place in Mesa’s GLSL compiler to deal with UBOs, it made sense to try and reuse all the data structures and code we had for UBOs and specialize the behavior for SSBOs where that was needed, that allows us to build on code paths that are already working well and reuse most of the code.

That path, however, had some issues that bit me a bit further down the road. When it comes to representing these operations in the IR, my first idea was to follow the trail of UBO loads as well, which are represented as ir_expression nodes. There is a fundamental difference between the two though: UBO loads are constant operations because uniform buffers are read-only. This means that a UBO load operation with the same parameters will always return the same value. This has implications related to certain optimization passes that work based on the assumption that other ir_expression operations share this feature. SSBO loads are not like this: since the shader storage buffer is read-write, two identical SSBO load operations in the same shader may not return the same result if the underlying buffer storage has been altered in between by SSBO write operations within the same or other threads. This forced me to alter a number of optimization passes in Mesa to deal with this situation (mostly disabling them for the cases of SSBO loads and stores).

The situation was worse with SSBO stores. These just did not fit into ir_expression nodes: they did not return a value and had side-effects (memory writes) so we had to come up with a different way to represent them. My initial implementation created a new IR node for these, ir_ssbo_store. That worked well enough, but it left us with an implementation of loads and stores that was a bit inconsistent since both operations used very different IR constructs.

These issues were made clear during the review process, where it was suggested that we used GLSL IR intrinsics to represent load and store operations instead. This has the benefit that we can make the implementation more consistent, having both loads and stores represented with the same IR construct and follow a similar treatment in both the GLSL compiler and the i965 backend. It would also remove the need to disable or alter certain optimization passes to be SSBO friendly.

Read/Write coherence

One of the issues we detected early in development was that our reads and writes did not seem to work very well together: some times a read after a write would fail to see the last value written to a buffer variable. The problem here also spawned from following the implementation trail of the UBO path. In the Intel hardware, there are various interfaces to access memory, like the Sampling Engine and the Data Port. The former is a read-only interface and is used, for example, for texture and UBO reads. The Data Port allows for read-write access. Although both interfaces give access to the same memory region, there is something to consider here: if you mix reads through the Sampling Engine and writes through the Data Port you can run into cache coherence issues, this is because the caches in use by the Sampling Engine and the Data Port functions are different. Initially, we implemented SSBO load operations like UBO loads, so we used the Sampling Engine, and ended up running into this problem. The solution, of course, was to rewrite SSBO loads to go though the Data Port as well.

Parallel reads and writes

GPUs are highly parallel hardware and this has some implications for driver developers. Take a sentence like this in a fragment shader program:

float cx = 1.0;

This is a simple assignment of the value 1.0 to variable cx that is supposed to happen for each fragment produced. In Intel hardware running in SIMD16 mode, we process 16 fragments simultaneously in the same GPU thread, this means that this instruction is actually 16 elements wide. That is, we are doing 16 assignments of the value 1.0 simultaneously, each one is stored at a different offset into the GPU register used to hold the value of cx.

If cx was a buffer variable in a SSBO, it would also mean that the assignment above should translate to 16 memory writes to the same offset into the buffer. That may seem a bit absurd: why would we want to write 16 times if we are always assigning the same value? Well, because things can get more complex, like this:

float cx = gl_FragCoord.x;

Now we are no longer assigning the same value for all fragments, each of the 16 values assigned with this instruction could be different. If cx was a buffer variable inside a SSBO, then we could be potentially writing 16 different values to it. It is still a bit silly, since only one of the values (the one we write last), would prevail.

Okay, but what if we do something like this?:

int index = int(mod(gl_FragCoord.x, 8));
cx[index] = 1;

Now, depending on the value we are reading for each fragment, we are writing to a separate offset into the SSBO. We still have a single assignment in the GLSL program, but that translates to 16 different writes, and in this case the order may not be relevant, but we want all of them to happen to achieve correct behavior.

The bottom line is that when we implement SSBO load and store operations, we need to understand the parallel environment in which we are running and work with test scenarios that allow us to verify correct behavior in these situations. For example, if we only test scenarios with assignments that give the same value to all the fragments/vertices involved in the parallel instructions (i.e. assignments of values that do not depend on properties of the current fragment or vertex), we could easily overlook fundamental defects in the implementation.

Dealing with helper invocations

From Section 7.1 of the GLSL spec version 4.5:

“Fragment shader helper invocations execute the same shader code
as non-helper invocations, but will not have side effects that
modify the framebuffer or other shader-accessible memory.”

To understand what this means I have to introduce the concept of helper invocations: certain operations in the fragment shader need to evaluate derivatives (explicitly or implicitly) and for that to work well we need to make sure that we compute values for adjacent fragments that may not be inside the primitive that we are rendering. The fragment shader executions for these added fragments are called helper invocations, meaning that they are only needed to help in computations for other fragments that are part of the primitive we are rendering.

How does this affect SSBOs? Because helper invocations are not part of the primitive, they cannot have side-effects, after they had served their purpose it should be as if they had never been produced, so in the case of SSBOs we have to be careful not to do memory writes for helper fragments. Notice also, that in a SIMD16 execution, we can have both proper and helper fragments mixed in the group of 16 fragments we are handling in parallel.

Of course, the hardware knows if a fragment is part of a helper invocation or not and it tells us about this through a pixel mask register that is delivered with all executions of a fragment shader thread, this register has a bitmask stating which pixels are proper and which are helper. The Intel hardware also provides developers with various kinds of messages that we can use, via the Data Port interface, to write to memory, however, the tricky thing is that not all of them incorporate pixel mask information, so for use cases where you need to disable writes from helper fragments you need to be careful with the write message you use and select one that accepts this sort of information.

Vector alignments

Another interesting thing we had to deal with are address alignments. UBOs work with layout std140. In this setup, elements in the UBO definition are aligned to 16-byte boundaries (the size of a vec4). It turns out that GPUs can usually optimize reads and writes to multiples of 16 bytes, so this makes sense, however, as I explained in my previous post, SSBOs also introduce a packed layout mode known as std430.

Intel hardware provides a number of messages that we can use through the Data Port interface to write to memory. Each message has different characteristics that makes it more suitable for certain scenarios, like the pixel mask I discussed before. For example, some of these messages have the capacity to write data in chunks of 16-bytes (that is, they write vec4 elements, or OWORDS in the language of the technical docs). One could think that these messages are great when you work with vector data types, however, they also introduce the problem of dealing with partial writes: what happens when you only write to an element of a vector? or to a buffer variable that is smaller than the size of a vector? what if you write columns in a row_major matrix? etc

In these scenarios, using these messages introduces the need to mask the writes because you need to disable the channels in the vec4 element that you don’t want to write. Of course, the hardware provides means to do this, we only need to set the writemask of the destination register of the message instruction to select the right channels. Consider this example:

struct TB {
    float a, b, c, d;

layout(std140, binding=0) buffer Fragments {
   TB s[3];
   int index;

void main()
   s[0].d = -1.0;

In this case, we could use a 16-byte write message that takes 0 as offset (i.e writes at the beginning of the buffer, where s[0] is stored) and then set the writemask on that instruction to WRITEMASK_W so that only the fourth data element is actually written, this way we only write one data element of 4 bytes (-1) at offset 12 bytes (s[0].d). Easy, right? However, how do we know, in general, the writemask that we need to use? In std140 layout mode this is easy: since each element in the SSBO is aligned to a 16-byte boundary, we simply need to take the byte offset at which we are writing, divide it by 16 (to convert it to units of vec4) and the modulo of that operation is the byte offset into the chunk of 16-bytes that we are writing into, then we only have to divide that by 4 to get the component slot we need to write to (a number between 0 and 3).

However, there is a restriction: we can only set the writemask of a register at compile/link time, so what happens when we have something like this?:

s[i].d = -1.0;

The problem with this is that we cannot evaluate the value of i at compile/link time, which inevitably makes our solution invalid for this. In other words, if we cannot evaluate the actual value of the offset at which we are writing at compile/link time, we cannot use the writemask to select the channels we want to use when we don’t want to write a vec4 worth of data and we have to use a different type of message.

That said, in the case of std140 layout mode, since each data element in the SSBO is aligned to a 16-byte boundary you may realize that the actual value of i is irrelevant for the purpose of the modulo operation discussed above and we can still manage to make things work by completely ignoring it for the purpose of computing the writemask, but in std430 that trick won’t work at all, and even in std140 we would still have row_major matrix writes to deal with.

Also, we may need to tweak the message depending on whether we are running on the vertex shader or the fragment shader because not all message types have appropriate SIMD modes (SIMD4x2, SIMD8, SIMD16, etc) for both, or because different hardware generations may not have all the message types or support all the SIMD modes we need need, etc

The point of this is that selecting the right message to use can be tricky, there are multiple things and corner cases to consider and you do not want to end up with an implementation that requires using many different messages depending on various circumstances because of the increasing complexity that it would add to the implementation and maintenance of the code.

Closing notes

This post did not cover all the intricacies of the implementation of ARB_shader_storage_buffer_object, I did not discuss things like the optional unsized array or the compiler details of std430 for example, but, hopefully, I managed to give an idea of the kind of problems one would have to deal with when coding driver support for this or other similar features.

Bringing ARB_shader_storage_buffer_object to Mesa and i965

In the last weeks I have been working together with my colleague Samuel on bringing support for ARB_shader_storage_buffer_object, an OpenGL 4.3 feature, to Mesa and the Intel i965 driver, so I figured I would write a bit on what this brings to OpenGL/GLSL users. If you are interested, read on.

Introducing Shader Storage Buffer Objects

This extension introduces the concept of shader storage buffer objects (SSBOs), which is a new type of OpenGL buffer. SSBOs allow GL clients to create buffers that shaders can then map to variables (known as buffer variables) via interface blocks. If you are familiar with Uniform Buffer Objects (UBOs), SSBOs are pretty similar but:

  • They are read/write, unlike UBOs, which are read-only.
  • They allow a number of atomic operations on them.
  • They allow an optional unsized array at the bottom of their definitions.

Since SSBOs are read/write, they create a bidirectional channel of communication between the GPU and CPU spaces: the GL application can set the value of shader variables by writing to a regular OpenGL buffer, but the shader can also update the values stored in that buffer by assigning values to them in the shader code, making the changes visible to the GL application. This is a major difference with UBOs.

In a parallel environment such as a GPU where we can have multiple shader instances running simultaneously (processing multiple vertices or fragments from a specific rendering call) we should be careful when we use SSBOs. Since all these instances will be simultaneously accessing the same buffer there are implications to consider relative to the order of reads and writes. The spec does not make many guarantees about the order in which these take place, other than ensuring that the order of reads and writes within a specific execution of a shader is preserved. Thus, it is up to the graphics developer to ensure, for example, that each execution of a fragment or vertex shader writes to a different offset into the underlying buffer, or that writes to the same offset always write the same value. Otherwise the results would be undefined, since they would depend on the order in which writes and reads from different instances happen in a particular execution.

The spec also allows to use glMemoryBarrier() from shader code and glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) from a GL application to add sync points. These ensure that all memory accesses to buffer variables issued before the barrier are completely executed before moving on.

Another tool for developers to deal with concurrent accesses is atomic operations. The spec introduces a number of new atomic memory functions for use with buffer variables: atomicAdd, atomicMin, atomicMax, atomicAnd, atomicOr, atomicXor, atomicExchange (atomic assignment to a buffer variable), atomicCompSwap (atomic conditional assignment to a buffer variable).

The optional unsized array at the bottom of an SSBO definition can be used to push a dynamic number of entries to the underlying buffer storage, up to the total size of the buffer allocated by the GL application.

Using shader storage buffer objects (GLSL)

Okay, so how do we use SSBOs? We will introduce this through an example: we will use a buffer to record information about the fragments processed by the fragment shader. Specifically, we will group fragments according to their X coordinate (by computing an index from the coordinate using a modulo operation). We will then record how many fragments are assigned to a particular index, the first fragment to be assigned to a given index, the last fragment assigned to a given index, the total number of fragments processed and the complete list of fragments processed.

To store all this information we will use the SSBO definition below:

layout(std140, binding=0) buffer SSBOBlock {
   vec4 first[8];     // first fragment coordinates assigned to index
   vec4 last[8];      // last fragment coordinates assigned to index
   int counter[8];    // number of fragments assigned to index
   int total;         // number of fragments processed
   vec4 fragments[];  // coordinates of all fragments processed

Notice the use of the keyword buffer to tell the compiler that this is a shader storage buffer object. Also notice that we have included an unsized array called fragments[], there can only be one of these in an SSBO definition, and in case there is one, it has to be the last field defined.

In this case we are using std140 layout mode, which imposes certain alignment rules for the buffer variables within the SSBO, like in the case of UBOs. These alignment rules may help the driver implement read/write operations more efficiently since the underlying GPU hardware can usually read and write faster from and to aligned addresses. The downside of std140 is that because of these alignment rules we also waste some memory and we need to know the alignment rules on the GL side if we want to read/write from/to the buffer. Specifically for SSBOs, the specification introduces a new layout mode: std430, which removes these alignment restrictions, allowing for a more efficient memory usage implementation, but possibly at the expense of some performance impact.

The binding keyword, just like in the case of UBOs, is used to select the buffer that we will be reading from and writing to when accessing these variables from the shader code. It is the application’s responsibility to bound the right buffer to the binding point we specify in the shader code.

So with that done, the shader can read from and write to these variables as we see fit, but we should be aware of the fact that multiple instances of the shader could be reading from and writing to them simultaneously. Let’s look at the fragment shader that stores the information we want into the SSBO:

void main() {
   int index = int(mod(gl_FragCoord.x, 8));

   int i = atomicAdd(counter[index], 1);
   if (i == 0)
      first[index] = gl_FragCoord;
      last[index] = gl_FragCoord;

   i = atomicAdd(total, 1);
   fragments[i] = gl_FragCoord;

The first line computes an index into our integer array buffer variable by using gl_FragCoord. Notice that different fragments could get the same index. Next we increase in one unit counter[index]. Since we know that different fragments can get to do this at the same time we use an atomic operation to make sure that we don’t lose any increments.

Notice that if two fragments can write to the same index, reading the value of counter[index] after the atomicAdd can lead to different results. For example, if two fragments have already executed the atomicAdd, and assuming that counter[index] is initialized to 0, then both would read counter[index] == 2, however, if only one of the fragments has executed the atomic operation by the time it reads counter[index] it would read a value of 1, while the other fragment would read a value of 2 when it reaches that point in the shader execution. Since our shader intends to record the coordinates of the first fragment that writes to counter[index], that won’t work for us. Instead, we use the return value of the atomic operation (which returns the value that the buffer variable had right before changing it) and we write to first[index] only when that value was 0. Because we use the atomic operation to read the previous value of counter[index], only one fragment will read a value of 0, and that will be the fragment that first executed the atomic operation.

If this is not the first fragment assigned to that index, we write to last[index] instead. Again, multiple fragments assigned to the same index could do this simultaneously, but that is okay here, because we only care about the the last write. Also notice that it is possible that different executions of the same rendering command produce different values of first[] and last[].

The remaining instructions unconditionally push the fragment coordinates to the unsized array. We keep the last index into the unsized array fragments[] we have written to in the buffer variable total. Each fragment will atomically increase total before writing to the unsized array. Notice that, once again, we have to be careful when reading the value of total to make sure that each fragment reads a different value and we never have two fragments write to the same entry.

Using shader storage buffer objects (GL)

On the side of the GL application, we need to create the buffer, bind it to the appropriate binding point and initialize it. We do this as usual, only that we use the new GL_SHADER_STORAGE_BUFFER target:

typedef struct {
   float first[8*4];      // vec4[8]
   float last[8*4];       // vec4[8]
   int counter[8*4];      // int[8] padded as per std140
   int total;             // int
   int pad[3];            // padding: as per std140 rules
   char fragments[1024];  // up to 1024 bytes of unsized array

SSBO data;


memset(&data, 0, sizeof(SSBO));

GLuint buf;
glGenBuffers(1, &buf);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buf);

The code creates a buffer, binds it to binding point 0 of GL_SHADER_STORAGE_BUFFER (the same we have bound our shader to) and initializes the buffer data to 0. Notice that because we are using std140 we have to be aware of the alignment rules at work. We could have used std430 instead to avoid this.

Since we have 1024 bytes for the fragments[] unsized array and we are pushing a vec4 (16 bytes) worth of data to it with every fragment we process then we have enough room for 64 fragments. It is the developer’s responsibility to ensure that this limit is not surpassed, otherwise we would write beyond the allocated space for our buffer and the results would be undefined.

The next step is to do some rendering so we get our shaders to work. That would trigger the execution of our fragment shader for each fragment produced, which will generate writes into our buffer for each buffer variable the shader code writes to. After rendering, we can map the buffer and read its contents from the GL application as usual:

SSBO *ptr = (SSBO *) glMapNamedBuffer(buf, GL_READ_ONLY);

/* List of fragments recorded in the unsized array */
printf("%d fragments recorded:\n", ptr->total);
float *coords = (float *) ptr->fragments;
for (int i = 0; i < ptr->total; i++, coords +=4) {
   printf("Fragment %d: (%.1f, %.1f, %.1f, %.1f)\n",
          i, coords[0], coords[1], coords[2], coords[3]);

/* First fragment for each index used */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 0)
      printf("First fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->first[i*4], ptr->first[i*4+1],
             ptr->first[i*4+2], ptr->first[i*4+3]);

/* Last fragment for each index used */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 1)
      printf("Last fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->last[i*4], ptr->last[i*4+1],
             ptr->last[i*4+2], ptr->last[i*4+3]);
   else if (ptr->counter[i*4] == 1)
      printf("Last fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->first[i*4], ptr->first[i*4+1],
             ptr->first[i*4+2], ptr->first[i*4+3]);

/* Fragment counts for each index */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 0)
      printf("Fragment count at index %d: %d\n", i, ptr->counter[i*4]);

I get this result for an execution where I am drawing a handful of points:

4 fragments recorded:
Fragment 0: (199.5, 150.5, 0.5, 1.0)
Fragment 1: (39.5, 150.5, 0.5, 1.0)
Fragment 2: (79.5, 150.5, 0.5, 1.0)
Fragment 3: (139.5, 150.5, 0.5, 1.0)

First fragment for index 3: (139.5, 150.5, 0.5, 1.0)
First fragment for index 7: (39.5, 150.5, 0.5, 1.0)

Last fragment for index 3: (139.5, 150.5, 0.5, 1.0)
Last fragment for index 7: (79.5, 150.5, 0.5, 1.0)

Fragment count at index 3: 1
Fragment count at index 7: 3

It recorded 4 fragments that the shader mapped to indices 3 and 7. Multiple fragments where assigned to index 7 but we could handle that gracefully by using the corresponding atomic functions. Different executions of the same program will produce the same 4 fragments and map them to the same indices, but the first and last fragments recorded for index 7 can change between executions.

Also notice that the first fragment we recorded in the unsized array (fragments[0]) is not the first fragment recorded for index 7 (fragments[1]). That means that the execution of fragments[0] got first to the unsized array addition code, but the execution of fragments[1] beat it in the race to execute the code that handled the assignment to the first/last arrays, making clear that we cannot make any assumptions regarding the execution order of reads and writes coming from different instances of the same shader execution.

So that’s it, the patches are now in the mesa-dev mailing list undergoing review and will hopefully land soon, so look forward to it! Also, if you have any interesting uses for this new feature, let me know in the comments.

Free access to Valve-produced games on Steam for Mesa contributors

Just like they did for Debian developers before, it is Valve’s way of saying thanks and giving something back to the community. This is great news for all Mesa contributors, now we can play some great Valve games for free and we can also have an easier time looking into bug reports for them, which also works great for Valve, closing a perfect circle 🙂

An introduction to Mesa’s GLSL compiler (II)


My previous post served as an initial look into Mesa’s GLSL compiler, where we discussed the Mesa IR, which is a core aspect of the compiler. In this post I’ll introduce another relevant aspect: IR lowering.

IR lowering

There are multiple lowering passes implemented in Mesa (check src/glsl/lower_*.cpp for a complete list) but they all share a common denominator: their purpose is to re-write certain constructs in the IR so they fit better the underlying GPU hardware.

In this post we will look into the lower_instructions.cpp lowering pass, which rewrites expression operations that may not be supported directly by GPU hardware with different implementations.

The lowering process involves traversing the IR, identifying the instructions we want to lower and modifying the IR accordingly, which fits well into the visitor pattern strategy discussed in my previous post. In this case, expression lowering is handled by the lower_instructions_visitor class, which implements the lowering pass in the visit_leave() method for ir_expression nodes.

The hierarchical visitor class, which serves as the base class for most visitors in Mesa, defines visit() methods for leaf nodes in the IR tree, and visit_leave()/visit_enter() methods for non-leaf nodes. This way, when traversing intermediary nodes in the IR we can decide to take action as soon as we enter them or when we are about to leave them.

In the case of our lower_instructions_visitor class, the visit_leave() method implementation is a large switch() statement with all the operators that it can lower.

The code in this file lowers common scenarios that are expected to be useful for most GPU drivers, but individual drivers can still select which of these lowering passes they want to use. For this purpose, hardware drivers create instances of the lower_instructions class passing the list of lowering passes to enable. For example, the Intel i965 driver does:

const int bitfield_insert = brw->gen >= 7
                            ? BITFIELD_INSERT_TO_BFM_BFI
                            : 0;
                   MOD_TO_FLOOR |
                   DIV_TO_MUL_RCP |
                   SUB_TO_ADD_NEG |
                   EXP_TO_EXP2 |
                   LOG_TO_LOG2 |
                   bitfield_insert |

Notice how in the case of Intel GPUs, one of the lowering passes is conditionally selected depending on the hardware involved. In this case, brw->gen >= 7 selects GPU generations since IvyBridge.

Let’s have a look at the implementation of some of these lowering passes. For example, SUB_TO_ADD_NEG is a very simple one that transforms subtractions into negative additions:

lower_instructions_visitor::sub_to_add_neg(ir_expression *ir)
   ir->operation = ir_binop_add;
   ir->operands[1] =
      new(ir) ir_expression(ir_unop_neg, ir->operands[1]->type,
                            ir->operands[1], NULL);
   this->progress = true;

As we can see, the lowering pass simply changes the operator used by the ir_expression node, and negates the second operand using the unary negate operator (ir_unop_neg), thus, converting the original a = b – c into a = b + (-c).

Of course, if a driver does not have native support for the subtraction operation, it could still do this when it processes the IR to produce native code, but this way Mesa is saving driver developers that work. Also, some lowering passes may enable optimization passes after the lowering that drivers might miss otherwise.

Let’s see a more complex example: MOD_TO_FLOOR. In this case the lowering pass provides an implementation of ir_binop_mod (modulo) for GPUs that don’t have a native modulo operation.

The modulo operation takes two operands (op0, op1) and implements the C equivalent of the ‘op0 % op1‘, that is, it computes the remainder of the division of op0 by op1. To achieve this the lowering pass breaks the modulo operation as mod(op0, op1) = op0 – op1 * floor(op0 / op1), which requires only multiplication, division and subtraction. This is the implementation:

ir_variable *x = new(ir) ir_variable(ir->operands[0]->type, "mod_x",
ir_variable *y = new(ir) ir_variable(ir->operands[1]->type, "mod_y",

ir_assignment *const assign_x =
   new(ir) ir_assignment(new(ir) ir_dereference_variable(x),
                         ir->operands[0], NULL);
ir_assignment *const assign_y =
   new(ir) ir_assignment(new(ir) ir_dereference_variable(y),
                         ir->operands[1], NULL);


ir_expression *const div_expr =
   new(ir) ir_expression(ir_binop_div, x->type,
                         new(ir) ir_dereference_variable(x),
                         new(ir) ir_dereference_variable(y));

/* Don't generate new IR that would need to be lowered in an additional
 * pass.
if (lowering(DIV_TO_MUL_RCP) && (ir->type->is_float() ||

ir_expression *const floor_expr =
   new(ir) ir_expression(ir_unop_floor, x->type, div_expr);

if (lowering(DOPS_TO_DFRAC) && ir->type->is_double())

ir_expression *const mul_expr =
   new(ir) ir_expression(ir_binop_mul,
                         new(ir) ir_dereference_variable(y),

ir->operation = ir_binop_sub;
ir->operands[0] = new(ir) ir_dereference_variable(x);
ir->operands[1] = mul_expr;
this->progress = true;

Notice how the first thing this does is to assign the operands to a variable. The reason for this is a bit tricky: since we are going to implement ir_binop_mod as op0 – op1 * floor(op0 / op1), we will need to refer to the IR nodes op0 and op1 twice in the tree. However, we can’t just do that directly, for that would mean that we have the same node (that is, the same pointer) linked from two different places in the IR expression tree. That is, we want to have this tree:

                           /     \
                        op0       mult
                                 /    \
                              op1     floor
                                      /   \
                                   op0     op1

Instead of this other tree:

                           /     \
                           |      mult
                           |     /   \
                           |   floor  |
                           |     |    |
                           |    div   |
                           |   /   \  |
                            op0     op1   

This second version of the tree is problematic. For example, let’s say that a hypothetical optimization pass detects that op1 is a constant integer with value 1, and realizes that in this case div(op0/op1) == op0. When doing that optimization, our div subtree is removed, and with that, op1 could be removed too (and possibily freed), leaving the other reference to that operand in the IR pointing to an invalid memory location… we have just corrupted our IR:

                           /     \
                           |      mult
                           |     /    \
                           |   floor   op1 [invalid pointer reference]
                           |     |
                           |    /
                           |   /

Instead, what we want to do here is to clone the nodes each time we need a new reference to them in the IR. All IR nodes have a clone() method for this purpose. However, in this particular case, cloning the nodes creates a new problem: op0 and op1 are ir_expression nodes so, for example, op0 could be the expression a + b * c, so cloning the expression would produce suboptimal code where the expression gets replicated. This, at best, will lead to slower
compilation times due to optimization passes needing to detect and fix that, and at worse, that would go undetected by the optimizer and lead to worse performance where we compute the value of the expression multiple times:

                           /        \
                         add         mult
                        /   \       /    \
                      a     mult  op1     floor
                            /   \          |
                           b     c        div
                                         /   \
                                      add     op1
                                     /   \
                                    a    mult
                                        /    \
                                        b     c

The solution to this problem is to assign the expression to a variable, then dereference that variable (i.e., read its value) wherever we need. Thus, the implementation defines two variables (x, y), assigns op0 and op1 to them and creates new dereference nodes wherever we need to access the value of the op0 and op1 expressions:

                       =               =
                     /   \           /   \
                    x     op0       y     op1

                           /     \
                         *x       mult
                                 /    \
                               *y     floor
                                      /   \
                                    *x     *y

In the diagram above, each variable dereference is marked with an ‘*’, and each one is a new IR node (so both appearances of ‘*x’ refer to different IR nodes, both representing two different reads of the same variable). With this solution we only evaluate the op0 and op1 expressions once (when they get assigned to the corresponding variables) and we never refer to the same IR node twice from different places (since each variable dereference is a new IR node).

Now that we know why we assign these two variables, let’s continue looking at the code of the lowering pass:

In the next step we implement op0 / op1 using a ir_binop_div expression. To speed up compilation, if the driver has the DIV_TO_MUL_RCP lowering pass enabled, which transforms a / b into a * 1 / b (where 1 / b could be a native instruction), we immediately execute the lowering pass for that expression. If we didn’t do this here, the resulting IR would contain a division operation that might have to be lowered in a later pass, making the compilation process slower.

The next step uses a ir_unop_floor expression to compute floor(op0/op1), and again, tests if this operation should be lowered too, which might be the case if the type of the operands is a 64bit double instead of a regular 32bit float, since GPUs may only have a native floor instruction for 32bit floats.

Next, we multiply the result by op1 to get op1 * floor(op0 / op1).

Now we only need to subtract this from op0, which would be the root IR node for this expression. Since we want the new IR subtree spawning from this root node to replace the old implementation, we directly edit the IR node we are lowering to replace the ir_binop_mod operator with ir_binop_sub, make a dereference to op1 in the first operand and link the expression holding op1 * floor(op0 / op1) in the second operand, effectively attaching our new implementation in place of the old version. This is how the original and lowered IRs look like:

Original IR:

[prev inst] -> mod -> [next inst]
              /   \            
           op0     op1         

Lowered IR:

[prev inst] -> var x -> var y ->   =   ->   =   ->   sub   -> [next inst]
                                  / \      / \      /   \
                                 x  op0   y  op1  *x     mult
                                                        /    \
                                                      *y      floor
                                                              /   \
                                                            *x     *y

Finally, we return true to let the compiler know that we have optimized the IR and that as a consequence we have introduced new nodes that may be subject to further lowering passes, so it can run a new pass. For example, the subtraction we just added may be lowered again to a negative addition as we have seen before.

Coming up next

Now that we learnt about lowering passes we can also discuss optimization passes, which are very similar since they are also based on the visitor implementation in Mesa and also transform the Mesa IR in a similar way.

An introduction to Mesa’s GLSL compiler (I)


In my last post I explained that modern 3D pipelines are programmable and how this has impacted graphics drivers. In the following posts we will go deeper into this aspect by looking at different parts of Mesa’s GLSL compiler. Specifically, this post will cover the GLSL parser, the Mesa IR and built-in variables and functions.

The GLSL parser

The job of the parser is to process the shader source code string provided via glShaderSource and transform it into a suitable binary representation that is stored in RAM and can be efficiently processed by other parts of the compiler in later stages.

The parser consists of a set of Lex/Yacc rules to process the incoming shader source. The lexer (glsl_parser.ll) takes care of tokenizing the source code and the parser (glsl_parser.yy) adds meaning to the stream of tokens identified in
the lexer stage.

Similarly, just like in C or C++, GLSL includes a pre-processor that goes through the shader source code before the main parser kicks in. Mesa’s implementation of the GLSL pre-processor lives in src/glsl/glcpp and is also based on Lex/Yacc rules.

The output of the parser is an Abstract Syntax Tree (AST) that lives in RAM memory, which is a binary representation of the shader source code. The nodes that make this tree are defined in src/glsl/ast.h.

For someone familiar with all the Lex/Yacc stuff, the parser implementation in Mesa should feel familiar enough.

The next step takes care of converting from the AST to a different representation that is better suited for the kind of operations that drivers will have to do with it. This new representation, called the IR (Intermediate Representation), is usually referenced in Mesa as Mesa IR, GLSL IR or simply HIR.

The AST to Mesa IR conversion is driven by the code in src/glsl/ast_to_hir.cpp.

Mesa IR

The Mesa IR is the main data structure used in the compiler. Most of the work that the compiler does can be summarized as:

  • Optimizations in the IR
  • Modifications in the IR for better/easier integration with GPU hardware
  • Linking multiple shaders (multiple IR instances) into a single program.
  • Generating native assembly code for the target GPU from the IR

As we can see, the Mesa IR is at the core of all the work that the compiler has to do, so understanding how it is setup is necessary to work in this part of Mesa.

The nodes in the Mesa IR tree are defined in src/glsl/ir.h. Let’s have a look at the most important ones:

At the top of the class hierarchy for the IR nodes we have exec_node, which is Mesa’s way of linking independent instructions together in a list to make a program. This means that each instruction has previous and next pointers to the instructions that are before and after it respectively. So, we have ir_instruction, the base class for all nodes in the tree, inherit from exec_node.

Another important node is ir_rvalue, which is the base class used to represent expressions. Generally, anything that can go on the right side of an assignment is an ir_rvalue. Subclasses of ir_rvalue include ir_expression, used to represent all kinds of unary, binary or ternary operations (supported operators are defined in the ir_expression_operation enumeration), ir_texture, which is used to represent texture operations like a texture lookup, ir_swizzle, which is used for swizzling values in vectors, all the ir_dereference nodes, used to access the values stored in variables, arrays, structs, etc. and ir_constant, used to represent constants of all basic types (bool, float, integer, etc).

We also have ir_variable, which represents variables in the shader code. Notice that the definition of ir_variable is quite large… in fact, this is by large the node with the most impact in the memory footprint of the compiler when compiling shaders in large games/applications. Also notice that the IR differentiates between variables and variable dereferences (the fact of looking into a variable’s value), which are represented as an ir_rvalue.

Similarly, the IR also defines nodes for other language constructs like ir_loop, ir_if, ir_assignment, etc.

Debugging the IR is not easy, since the representation of a shader program in IR nodes can be quite complex to traverse and inspect with a debugger. To help with this Mesa provides means to print the IR to a human-readable text format. We can enable this by using the environment variable MESA_GLSL=dump. This will instruct Mesa to print both the original shader source code and its IR representation. For example:

$ MESA_GLSL=dump ./test_program

GLSL source for vertex shader 1:
#version 140
#extension GL_ARB_explicit_attrib_location : enable

layout(location = 0) in vec3 inVertexPosition;
layout(location = 1) in vec3 inVertexColor;

uniform mat4 MVP;
smooth out vec3 out0;

void main()
  gl_Position = MVP * vec4(inVertexPosition, 1);
  out0 = inVertexColor;

GLSL IR for shader 1:
(declare (sys ) int gl_InstanceID)
(declare (sys ) int gl_VertexID)
(declare (shader_out ) (array float 0) gl_ClipDistance)
(declare (shader_out ) float gl_PointSize)
(declare (shader_out ) vec4 gl_Position)
(declare (uniform ) (array vec4 56) gl_CurrentAttribFragMESA)
(declare (uniform ) (array vec4 33) gl_CurrentAttribVertMESA)
(declare (uniform ) gl_DepthRangeParameters gl_DepthRange)
(declare (uniform ) int gl_NumSamples)
(declare () int gl_MaxVaryingComponents)
(declare () int gl_MaxClipDistances)
(declare () int gl_MaxFragmentUniformComponents)
(declare () int gl_MaxVaryingFloats)
(declare () int gl_MaxVertexUniformComponents)
(declare () int gl_MaxDrawBuffers)
(declare () int gl_MaxTextureImageUnits)
(declare () int gl_MaxCombinedTextureImageUnits)
(declare () int gl_MaxVertexTextureImageUnits)
(declare () int gl_MaxVertexAttribs)
(declare (shader_in ) vec3 inVertexPosition)
(declare (shader_in ) vec3 inVertexColor)
(declare (uniform ) mat4 MVP)
(declare (shader_out smooth) vec3 out0)
(function main
  (signature void
      (declare (temporary ) vec4 vec_ctor)
      (assign  (w) (var_ref vec_ctor)  (constant float (1.000000)) ) 
      (assign  (xyz) (var_ref vec_ctor)  (var_ref inVertexPosition) ) 
      (assign  (xyzw) (var_ref gl_Position)
            (expression vec4 * (var_ref MVP) (var_ref vec_ctor) ) ) 
      (assign  (xyz) (var_ref out0)  (var_ref inVertexColor) ) 

Notice, however, that the IR representation we get is not the one that is produced by the parser. As we will see later, that initial IR will be modified in multiple ways by Mesa, for example by adding different kinds of optimizations, so the IR that we see is the result after all these processing passes over the original IR. Mesa refers to this post-processed version of the IR as LIR (low-level IR) and to the initial version of the IR as produced by the parser as HIR (high-level IR). If we want to print the HIR (or any intermediary version of the IR as it transforms into the final LIR), we can edit the compiler and add calls to _mesa_print_ir as needed.

Traversing the Mesa IR

We mentioned before that some of the compiler’s work (a big part, in fact) has to do with optimizations and modifications of the IR. This means that the compiler needs to traverse the IR tree and identify subtrees that are relevant to this kind of operations. To achieve this, Mesa uses the visitor design pattern.

Basically, the idea is that we have a visitor object that can traverse the IR tree and we can define the behavior we want to execute when it finds specific nodes.

For instance, there is a very simple example of this in src/glsl/linker.cpp: find_deref_visitor, which detects if a variable is ever read. This involves traversing the IR, identifying ir_dereference_variable nodes (the ones where a variable’s value is accessed) and check if the name of that variable matches the one we are looking for. Here is the visitor class definition:

 * Visitor that determines whether or not a variable is ever read.
class find_deref_visitor : public ir_hierarchical_visitor {
   find_deref_visitor(const char *name)
      : name(name), found(false)
      /* empty */

   virtual ir_visitor_status visit(ir_dereference_variable *ir)
      if (strcmp(this->name, ir->var->name) == 0) {
         this->found = true;
         return visit_stop;

      return visit_continue;

   bool variable_found() const
      return this->found;

   const char *name;       /**< Find writes to a variable with this name. */
   bool found;             /**< Was a write to the variable found? */

And this is how we get to use this, for example to check if the shader code ever reads gl_Vertex:

find_deref_visitor find("gl_Vertex");>ir);
if (find.variable_found()) {

Most optimization and lowering passes in Mesa are implemented as visitors and follow a similar idea. We will look at examples of these in a later post.

Built-in variables and functions

GLSL defines a set of built-in variables (with ‘gl_’ prefix) for each shader stage which Mesa injects into the shader code automatically. If you look at the example where we used MESA_GLSL=dump to obtain the generated Mesa IR you can see some of these variables.

Mesa implements support for built-in variables in _mesa_glsl_initialize_variables(), defined in src/glsl/builtin_variables.cpp.

Notice that some of these variables are common to all shader stages, while some are specific to particular stages or available only in specific versions of GLSL.

Depending on the type of variable, Mesa or the hardware driver may be able to provide the value immediately (for example for variables holding constant values like gl_MaxVertexAttribs or gl_MaxDrawBuffers). Otherwise, the driver will probably have to fetch (or generate) the value for the variable from the hardware at program run-time by generating native code that is added to the user program. For example, a geometry shader that uses gl_PrimitiveID will need that variable updated for each primitive processed by the Geometry Shader unit in a draw call. To achieve this, a driver might have to generate native code that fetches the current primitive ID value from the hardware and puts stores it in the register that provides the storage for the gl_PrimitveID variable before the user code is executed.

The GLSL language also defines a number of available built-in functions that must be provided by implementators, like texture(), mix(), or dot(), to name a few examples. The entry point in Mesa’s GLSL compiler for built-in functions
is src/glsl/builtin_functions.cpp.

The method builtin_builder::create_builtins() takes care of registering built-in functions, and just like with built-in variables, not all functions are always available: some functions may only be available in certain shading units, others may only be available in certain GLSL versions, etc. For that purpose, each built-in function is registered with a predicate that can be used to test if that function is at all available in a specific scenario.

Built-in functions are registered by calling the add_function() method, which registers all versions of a specific function. For example mix() for float, vec2, vec3, vec4, etc Each of these versions has its own availability predicate. For instance, mix() is always available for float arguments, but using it with integers requires GLSL 1.30 and the EXT_shader_integer_mix extension.

Besides the availability predicate, add_function() also takes an ir_function_signature, which tells Mesa about the specific signature of the function being registered. Notice that when Mesa creates signatures for the functions it also defines the function body. For example, the following code snippet defines the signature for modf():

ir_function_signature *
builtin_builder::_modf(builtin_available_predicate avail,
                       const glsl_type *type)
   ir_variable *x = in_var(type, "x");
   ir_variable *i = out_var(type, "i");
   MAKE_SIG(type, avail, 2, x, i);

   ir_variable *t = body.make_temp(type, "t");
   body.emit(assign(t, expr(ir_unop_trunc, x)));
   body.emit(assign(i, t));
   body.emit(ret(sub(x, t)));

   return sig;

GLSL’s modf() splits a number in its integer and fractional parts. It assigns the integer part to an output parameter and the function return value is the fractional part.

This signature we see above defines input parameter ‘x’ of type ‘type’ (the number we want to split), an output parameter ‘i’ of the same type (which will hold the integer part of ‘x’) and a return type ‘type’.

The function implementation is based on the existence of the unary operator ir_unop_trunc, which can take a number and extract its integer part. Then it computes the fractional part by subtracting that from the original number.

When the modf() built-in function is used, the call will be expanded to include this IR code, which will later be transformed into native code for the GPU by the corresponding hardware driver. In this case, it means that the hardware driver is expected to provide an implementation of the ir_unop_trunc operator, for example, which in the case of the Intel i965 driver is implemented as a single hardware instruction (see brw_vec4_visitor.cpp or brw_fs_visitor.cpp
in src/mesa/drivers/dri/i965).

In some cases, the implementation of a built-in function can’t be defined at the IR level. In this case the implementation simply emits an ad-hoc IR node that drivers can identify and expand appropriately. An example of this is EmitVertex() in a geometry shader. This is not really a function call in the traditional sense, but a way to signal the driver that we have defined all the attributes of a vertex and it is time to “push” that vertex into the current primitive. The meaning of “pushing the vertex” is something that can’t be defined at the IR level because it will be different for each driver/hardware. Because of that, the built-in function simply injects an IR node ir_emit_vertex that drivers can identify and implement properly when the time comes. In the case of the Intel code, pushing a vertex involves a number of steps that are very intertwined with the hardware, but it basically amounts to generating native code that implements the behavior that the hardware expects for that to happen. If you are curious, the implementation of this in the i965 driver code can be found in brw_vec4_gs_visitor.cpp, in the visit() method that takes an ir_emit_vertex IR node as parameter.

Coming up next

In this post we discussed the parser, which is the entry point for the compiler, and introduced the Mesa IR, the main data structure. In following posts we will delve deeper into the GLSL compiler implementation. Specifically, we will look into the lowering and optimization passes as well as the linking process and the hooks for hardware drivers that deal with native code generation.

A brief overview of the 3D pipeline


In the previous post I discussed the Mesa development environment and gave a few tips for newcomers, but before we start hacking on the code we should have a look at how modern GPUs look like, since that has a definite impact on the design and implementation of driver code. Let’s get to it.

Fixed Function vs Programmable hardware

Before the advent of shading languages like GLSL we did not have the option to program the 3D hardware at will. Instead, the hardware would have specific units dedicated to implement certain operations (like vertex transformations) that could only be used through specific APIs, like those exposed by OpenGL. These units are usually labeled as Fixed Function, to differentiate them from modern GPUs that also expose fully programmable units.

What we have now in modern GPUs is a fully programmable pipeline, where graphics developers can code graphics algorithms of various sorts in high level programming languages like GLSL. These programs are then compiled and loaded into the GPU to execute specific tasks. This gives graphics developers a huge amount of freedom and power, since they are no longer limited to preset APIs exposing fixed functionality (like the old OpenGL lightning models for example).

Modern graphics drivers

But of course all this flexibility and power that graphics developers enjoy today come at the expense of significantly more complex hardware and drivers, since the drivers are responsible for exposing all that flexibility to the developers while ensuring that we still obtain the best performance out of the hardware in each scenario.

Rather than acting as a bridge between a fixed API like OpenGL and fixed function hardware, drivers also need to handle general purpose graphics programs written in high-level languages. This is a big change. In the case of OpenGL, this means that the driver needs to provide an implementation of the GLSL language, so suddenly, the driver is required to incorporate a full compiler and deal with all sort of problems that belong to the realm of compilers, like choosing an intermediary representation for the program code (IR), performing optimization passes and generating native code for the GPU.

Overview of a modern 3D pipeline

I have mentioned that modern GPUs expose fully programmable hardware units. These are called shading units, and the idea is that these units are connected in a pipeline so that the output of a shading unit becomes the input of the next. In this model, the application developer pushes vertices to one end of the pipeline and usually obtains rendered pixels on the other side. In between these two ends there are a number of units making this transition possible and a number of these will be programmable, which means that the graphics developer can control how these vertices are transformed into pixels at different stages.

The image below shows a simplified example of a 3D graphics pipeline, in this case as exposed by the OpenGL 4.3 specification. Let’s have a quick look at some of its main parts:

The OpenGL 4.3 3D pipeline (image via

Vertex Shader (VS)

This programmable shading unit takes vertices as input and produces vertices as output. Its main job is to transform these vertices in any way the graphics developer sees fit. Typically, this is were we would do transforms like vertex projection,
rotation, translation and, generally, compute per-vertex attributes that we won’t to provide to later stages in the pipeline.

The vertex shader processes vertex data as provided by APIs like glDrawArrays or glDrawElements and outputs shaded vertices that will be assembled into primitives as indicated by the OpenGL draw command (GL_TRIANGLES, GL_LINES, etc).

Geometry Shader

Geometry shaders are similar to vertex shaders, but instead of operating on individual vertices, they operate on a geometry level (that is, a line, a triangle, etc), so they can take the output of the vertex shader as their input.

The geometry shader unit is programmable and can be used to add or remove vertices from a primitive, clip primitives, spawn entirely new primitives or modify the geometry of a primitive (like transforming triangles into quads or points into triangles, etc). Geometry shaders can also be used to implement basic tessellation even if dedicated tessellation units present in modern hardware are a better fit for this job.

In GLSL, some operations like layered rendering (which allows rendering to multiple textures in the same program) are only accessible through geometry shaders, although this is now also possible in vertex shaders via a particular extension.

The output of a geometry shader are also primitives.


So far all the stages we discussed manipulated vertices and geometry. At some point, however, we need to render pixels. For this, primitives need to be rasterized, which is the process by which they are broken into individual fragments that would then be colored by a fragment shader and eventually turn into pixels in a frame buffer. Rasterization is handled by the rasterizer fixed function unit.

The rasterization process also assigns depth information to these fragments. This information is necessary when we have a 3D scene where multiple polygons overlap on the screen and we need to decide which polygon’s fragments should be rendered and which should be discarded because they are hidden by other polygons.

Finally, the rasterization also interpolates per-vertex attributes in order to compute the corresponding fragment values. For example, let’s say that we have a line primitive where each vertex has a different color attribute, one red and one green. For each fragment in the line the rasterizer will compute interpolated color values by combining red and green depending on how close or far the fragments are to each vertex. With this, we will obtain red fragments on the side of the red vertex that will smoothly transition to green as we move closer to the green vertex.

In summary, the input of the rasterizer are the primitives coming from a vertex, tessellation or geometry shader and the output are the fragments that build the primitive’s surface as projected on the screen including color, depth and other interpolated per-vertex attributes.

Fragment Shader (FS)

The programmable fragment shader unit takes the fragments produced by the rasterization process and executes an algorithm provided by a graphics developer to compute the final color, depth and stencil values for each fragment. This unit can be used to achieve numerous visual effects, including all kinds of post-processing filters, it is usually where we will sample textures to color polygon surfaces, etc.

This covers some of the most important elements in 3D the graphics pipeline and should be sufficient, for now, to understand some of the basics of a driver. Notice, however that have not covered things like transform feedback, tessellation or compute shaders. I hope I can get to cover some of these in future posts.

But before we are done with the overview of the 3D pipeline we should cover another topic that is fundamental to how the hardware works: parallelization.


Graphics processing is a very resource demanding task. We are continuously updating and redrawing our graphics 30/60 times per second. For a full HD resolution of 1920×1080 that means that we need to redraw over 2 million pixels in each go (124.416.000 pixels per second if we are doing 60 FPS). That’s a lot.

To cope with this the architecture of GPUs is massively parallel, which means that the pipeline can process many vertices/pixels simultaneously. For example, in the case of the Intel Haswell GPUs, programmable units like the VS and GS have multiple Execution Units (EU), each with their own set of ALUs, etc that can spawn up to 70 threads each (for GS and VS) while the fragment shader can spawn up to 102 threads. But that is not the only source of parallelism: each thread may handle multiple objects (vertices or pixels depending on the case) at the same time. For example, a VS thread in Intel hardware can shade two vertices simultaneously, while a FS thread can shade up to 8 (SIMD8) or 16 (SIMD16) pixels in one go.

Some of these means of parallelism are relatively transparent to the driver developer and some are not. For example, SIMD8 vs SIMD16 or single vertex shading vs double vertex shading requires specific configuration and writing driver code that is aligned with the selected configuration. Threads are more transparent, but in certain situations the driver developer may need to be careful when writing code that can require a sync between all running threads, which would obviously hurt performance, or at least be careful to do that kind of thing when it would hurt performance the least.

Coming up next

So that was a very brief introduction to how modern 3D pipelines look like. There is still plenty of stuff I have not covered but I think we can go through a lot of that in later posts as we dig deeper into the driver code. My next post will discuss how Mesa models various of the programmable pipeline stages I have introduced here, so stay tuned!

Setting up a development environment for Mesa


In my previous post I provided an overview of the Mesa source tree and identified some of its main modules.

Since we are on that subject I thought it would make sense to give a few tips on how to setup the development environment for Mesa too, so here I go.

Development environment

Mesa is mostly written in a combination of C and C++, uses autotools for its build system and Git for version control, so it should be a fairly familiar environment for many people. I am not going to explain how to build autotools projects here, there is plenty of documentation available on that subject, so instead I will focus on the specifics of Mesa.

First we need to checkout the source code. If you do not have a developer account then do an anonymous checkout:

# git clone git://

If you do have a developer account do this instead:

# git clone git+ssh://

Next, we will have to deal with dependencies. This should not be too hard though. Mesa is fairly low in the software stack so it does not have many and the ones it has seem to have a fairly stable API and don’t change too often, so typically, you should be able to build Mesa if you have a recent distribution and you keep it up to date. For reference, as of now I can build Mesa on my Ubuntu 14.04 without any problems.

In any case, the actual dependencies you will need to get may vary depending on the drivers you want to build, the target platform and the features you want to enable. For example, the R300 Gallium driver requires LLVM, but the Intel i965 driver doesn’t.

Notice, however, that if you are hacking on features that require specific builds of the XServer, Wayland/Weston or similar stuff the required setup will be more complex, since you would probably need to include these other projects into the mix, together with their respective dependencies.

Configuring the source tree

Here I will mention some of the Mesa specific options that I found to be more useful in my time with Mesa:

–enable-debug: This is necessary, at least, to get assertions to work, and you want this while you are developing. Mesa and the drivers have assertions on many places to make sure that new code does not break certain assumptions or violate hardware constraints, so you really want to make sure that you have these activated when you are developing. It also adds “-g -O0” to enable debug support.

–with-dri-drivers: This is the list of classic Mesa DRI drivers you want to build. If you know you will only hack on the i965 driver, for example, then building other drivers will only slow down your builds.

–with-gallium-drivers: This is the list of Gallium drivers you want to build. Again, if you are hacking on the classic DRI i965 driver you are probably not interested in building any Gallium drivers.

Notice that if you are working on the Mesa framework layer, that is, the bits shared by all drivers, instead of the internals of a specific driver, you will probably want to include more drivers in the build to make sure that they keep building after your changes.

–with-egl-platforms: This is a list of supported platforms. Same as with the options above, you probably only want to build Mesa for the platform or platforms you are working on.

Besides using a combination of these options, you probably also want to set your CFLAGS and CXXFLAGS (remember that Mesa uses both C and C++). I for one like to pass “-g3”, for example.

Using your built version of Mesa

Once you have built Mesa you can type ‘make install’ to install the libraries and drivers. Probably, you have configured autotools (via the –-prefix option) to do this to a safe location that does not conflict with your distribution installation of Mesa and now your problem is to tell your OpenGL programs that they should use this version of Mesa instead of the one provided by your distro.

You will have to adjust a couple of environment variables for this:

LIBGL_DRIVERS_PATH: Set this to the path where your built drivers have been installed. This will tell Mesa’s loader to look for the drivers here.

LD_LIBRARY_PATH: Set this to the path where your Mesa libraries have been installed. This will make it so that OpenGL programs load your recently built rather than your system’s.

For more tips I’d suggest to read this short thread in the Mesa mailing list, which has some Mesa developers discussing their development environment setup.

Coming up next

In the next post I will provide an introduction to modern 3D graphics hardware. After all, the job of the graphics driver is all about programming the hardware, so having a basic understanding of how it works is a requirement if want to do any meaningful driver development.

An eagle eye view into the Mesa source tree


My last post introduced Mesa’s loader as the module that takes care of auto-selecting the right driver for our hardware. If the loader fails to find a suitable hardware driver it will fall back to a software driver, but we can also force this situation ourselves, which may come in handy in some scenarios. We also took a quick look at the glxinfo tool that we can use to query the capabilities and features exposed by the selected driver.

The topic of today focuses on providing a quick overview of the Mesa source code tree, which will help us identify the parts of the code that are relevant to our interests depending on the driver and/or the feature we intend to work on.

Browsing the source code

First off, there is already some documentation on this topic available on the Mesa 3D website that is a good place to start. Since that already gives some insight on what goes into each part of the repository I’ll focus on complementing that information with a little bit more of detail for some of the most important parts I have interacted with so far:

  • In src/egl/ we have the implementation of the EGL standard. If you are working on EGL-specific features, tracking down an EGL-specific problem or you are simply curious about how EGL links into the GL implementation, this is the place you want to visit. This includes the EGL implementations for the X11, DRM and Wayland platforms.
  • In src/glx/ we have the OpenGL bits relating specifically to X11 platforms, known as GLX. So if you are working on the GLX layer, this is the place to go. Here there is all the stuff that takes care of interacting with the XServer, the client-side DRI implementation, etc.
  • src/glsl/ contains a critical aspect of Mesa: the GLSL compiler used by all Mesa drivers. It includes a GLSL parser, the definition of the Mesa IR, also referred to as GLSL IR, used to represent shader programs internally, the shader linker and various optimization passes that operate on the Mesa IR. The resulting Mesa IR produced by the GLSL compiler is then consumed by the various drivers which transform it into native GPU code that can be loaded and run in the hardware.
  • src/mesa/main/ contains the core Mesa elements. This includes hardware-independent views of core objects like textures, buffers, vertex array objects, the OpenGL context, etc as well as basic infrastructure, like linked lists.
  • src/mesa/drivers/ contains the actual classic drivers (not Gallium). DRI drivers in particular go into src/mesa/drivers/dri. For example the Intel i965 driver goes into src/mesa/drivers/dri/i965. The code here is, for the most part, very specific to the underlying hardware platforms.
  • src/mesa/swrast*/ and src/mesa/tnl*/ provide software implementations for things like rasterization or vertex transforms. Used by some software drivers and also by some hardware drivers to implement certain features for which they don’t have hardware support or for which hardware support is not yet available in the driver. For example, the i965 driver implements operations on the accumulation and selection buffers in software via these modules.
  • src/mesa/vbo/ is another important module. Across its various versions, OpenGL has specified many ways in which a program can tell OpenGL about its vertex data, from using functions of the glVertex*() family inside glBegin()/glEnd() blocks, to things like vertex arrays, vertex array objects, display lists, etc… The drivers, however, do not need to deal with all this, Mesa makes it so that they always receive their vertex data as collection of vertex arrays, significantly reducing complexity on the side of the driver implementator. This is the module that takes care of managing all this, so no matter what type of drawing you GL program is doing or how it specifies its vertex data, it will always go through this module before it reaches the driver.
  • src/loader/, as we have seen in my previous post, contains the Mesa driver loader, which provides the logic necessary to decide which Mesa driver is the right one to use for a specific hardware so that Mesa’s can auto-select the right driver when loaded.
  • src/gallium/ contains the Gallium3D framework implementation. If, like me, you only work on a classic driver, you don’t need to care about the contents of this at all. If you are working on Gallium drivers however, this is the place where you will find the various Gallium drivers in development (inside src/gallium/drivers/), like the various Gallium ATI/AMD drivers, Nouveau or the LLVM based software driver (llvmpipe) and the Gallium state trackers.

So with this in mind, one should have enough information to know where to start looking for something specific:

  • If are interested in how vertex data provided to OpenGL is manipulated and uploaded to the GPU, the vbo module is probably the right place to look.
  • If we are looking to work on a specific aspect of a concrete hardware driver, we should go to the corresponding directory in src/mesa/drivers/ if it is a classic driver, or src/gallium/drivers if it is a Gallium driver.
  • If we want to know about how Mesa, the framework, abstracts various OpenGL concepts like textures, vertex array objects, shader programs, etc. we should look into src/mesa/main/.
  • If we are interested in the platform specific support, be it EGL or GLX, we want to look into src/egl or src/glx.
  • If we are interested in the GLSL implementation, which involves anything from the compiler to the intermediary IR and the various optimization passes, we need to look into src/glsl/.

Coming up next

So now that we have an eagle view of the contents of the Mesa repository let’s see how we can prepare a development environment so we can start hacking on
some stuff. I’ll cover this in my next post.

Driver loading and querying in Mesa


In my previous post I explained that Mesa is a framework for OpenGL driver development. As such, it provides code that can be reused by multiple driver implementations. This code is, of course, hardware agnostic, but frees driver developers from doing a significant part of the work. The framework also provides hooks for developers to add the bits of code that deal with the actual hardware. This design allows multiple drivers to co-exist and share a significant amount of code.

I also explained that among the various drivers that Mesa provides, we can find both hardware drivers that take advantage of a specific GPU and software drivers, that are implemented entirely in software (so they work on the CPU and do not depend on a specific GPU). The latter are obviously slower, but as I discussed, they may come in handy in some scenarios.

Driver selection

So, Mesa provides multiple drivers, but how does it select the one that fits the requirements of a specific system?

You have probably noticed that Mesa is deployed in multiple packages. In my Ubuntu system, the one that deploys the DRI drivers is libgl1-mesa-dri:amd64. If you check its contents you will see that this package installs OpenGL drivers for various GPUs:

# dpkg -L libgl1-mesa-dri:amd64 

Since I have a relatively recent Intel GPU, the driver I need is the one provided in So how do we tell Mesa that this is the one we need? Well, the answer is that we don’t, Mesa is smart enough to know which driver is the right one for our GPU, and selects it automatically when you load The part of Mesa that takes care of this is called the ‘loader’.

You can, however, point Mesa to look for suitable drivers in a specific directory other than the default, or force it to use a software driver using various environment variables.

What driver is Mesa actually loading?

If you want to know exactly what driver Mesa is loading, you can instruct it to dump this (and other) information to stderr via the LIBGL_DEBUG environment variable:

# LIBGL_DEBUG=verbose glxgears 
libGL: screen 0 does not appear to be DRI3 capable
libGL: pci id for fd 4: 8086:0126, driver i965
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/tls/
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/

So we see that Mesa checks the existing hardware and realizes that the i965 driver is the one to use, so it first attempts to load the TLS version of that driver and, since I don’t have the TLS version, falls back to the normal version, which I do have.

The code in src/loader/loader.c (loader_get_driver_for_fd) is the one responsible for detecting the right driver to use (i965 in my case). This receives a device fd as input parameter that is acquired previously by calling DRI2Connect() as part of the DRI bring up process. Then the actual driver file is loaded in glx/dri_common.c (driOpenDriver).

We can also obtain a more descriptive indication of the driver we are loading by using the glxinfo program that comes with the mesa-utils package:

# glxinfo | grep -i "opengl renderer"
OpenGL renderer string: Mesa DRI Intel(R) Sandybridge Mobile 

This tells me that I am using the Intel hardware driver, and it also shares information related with the specific Intel GPU I have (SandyBridge).

Forcing a software driver

I have mentioned that having software drivers available comes in handy at times, but how do we tell the loader to use them? Mesa provides an environment variable that we can set for this purpose, so switching between a hardware driver and a software one is very easy to do:

libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/tls/
libGL: OpenDriver: trying /usr/lib/x86_64-linux-gnu/dri/

As we can see, setting LIBGL_ALWAYS_SOFTWARE will make the loader select a software driver (swrast).

If I force a software driver and call glxinfo like I did before, this is what I get:

# LIBGL_ALWAYS_SOFTWARE=1 glxinfo | grep -i "opengl renderer"
OpenGL renderer string: Software Rasterizer

So it is clear that I am using a software driver in this case.

Querying the driver for OpenGL features

The glxinfo program also comes in handy to obtain information about the specific OpenGL features implemented by the driver. If you want to check if the Mesa driver for your hardware implements a specific OpenGL extension you can inspect the output of glxinfo and look for that extension:

# glxinfo | grep GL_ARB_texture_multisample

You can also ask glxinfo to include hardware limits for certain OpenGL features including the -l switch. For example:

# glxinfo -l | grep GL_MAX_TEXTURE_SIZE

Coming up next

In my next posts I will cover the directory structure of the Mesa repository, identifying its main modules, which should give Mesa newcomers some guidance as to where they should look for when they need to find the code that deals with something specific. We will then discuss how modern 3D hardware has changed the way GPU drivers are developed and explain how a modern 3D graphics pipeline works, which should pave the way to start looking into the real guts of Mesa: the implementation of shaders.