Bringing ARB_shader_storage_buffer_object to Mesa and i965

In the last weeks I have been working together with my colleague Samuel on bringing support for ARB_shader_storage_buffer_object, an OpenGL 4.3 feature, to Mesa and the Intel i965 driver, so I figured I would write a bit on what this brings to OpenGL/GLSL users. If you are interested, read on.

Introducing Shader Storage Buffer Objects

This extension introduces the concept of shader storage buffer objects (SSBOs), which is a new type of OpenGL buffer. SSBOs allow GL clients to create buffers that shaders can then map to variables (known as buffer variables) via interface blocks. If you are familiar with Uniform Buffer Objects (UBOs), SSBOs are pretty similar but:

  • They are read/write, unlike UBOs, which are read-only.
  • They allow a number of atomic operations on them.
  • They allow an optional unsized array at the bottom of their definitions.

Since SSBOs are read/write, they create a bidirectional channel of communication between the GPU and CPU spaces: the GL application can set the value of shader variables by writing to a regular OpenGL buffer, but the shader can also update the values stored in that buffer by assigning values to them in the shader code, making the changes visible to the GL application. This is a major difference with UBOs.

In a parallel environment such as a GPU where we can have multiple shader instances running simultaneously (processing multiple vertices or fragments from a specific rendering call) we should be careful when we use SSBOs. Since all these instances will be simultaneously accessing the same buffer there are implications to consider relative to the order of reads and writes. The spec does not make many guarantees about the order in which these take place, other than ensuring that the order of reads and writes within a specific execution of a shader is preserved. Thus, it is up to the graphics developer to ensure, for example, that each execution of a fragment or vertex shader writes to a different offset into the underlying buffer, or that writes to the same offset always write the same value. Otherwise the results would be undefined, since they would depend on the order in which writes and reads from different instances happen in a particular execution.

The spec also allows to use glMemoryBarrier() from shader code and glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT) from a GL application to add sync points. These ensure that all memory accesses to buffer variables issued before the barrier are completely executed before moving on.

Another tool for developers to deal with concurrent accesses is atomic operations. The spec introduces a number of new atomic memory functions for use with buffer variables: atomicAdd, atomicMin, atomicMax, atomicAnd, atomicOr, atomicXor, atomicExchange (atomic assignment to a buffer variable), atomicCompSwap (atomic conditional assignment to a buffer variable).

The optional unsized array at the bottom of an SSBO definition can be used to push a dynamic number of entries to the underlying buffer storage, up to the total size of the buffer allocated by the GL application.

Using shader storage buffer objects (GLSL)

Okay, so how do we use SSBOs? We will introduce this through an example: we will use a buffer to record information about the fragments processed by the fragment shader. Specifically, we will group fragments according to their X coordinate (by computing an index from the coordinate using a modulo operation). We will then record how many fragments are assigned to a particular index, the first fragment to be assigned to a given index, the last fragment assigned to a given index, the total number of fragments processed and the complete list of fragments processed.

To store all this information we will use the SSBO definition below:

layout(std140, binding=0) buffer SSBOBlock {
   vec4 first[8];     // first fragment coordinates assigned to index
   vec4 last[8];      // last fragment coordinates assigned to index
   int counter[8];    // number of fragments assigned to index
   int total;         // number of fragments processed
   vec4 fragments[];  // coordinates of all fragments processed
};

Notice the use of the keyword buffer to tell the compiler that this is a shader storage buffer object. Also notice that we have included an unsized array called fragments[], there can only be one of these in an SSBO definition, and in case there is one, it has to be the last field defined.

In this case we are using std140 layout mode, which imposes certain alignment rules for the buffer variables within the SSBO, like in the case of UBOs. These alignment rules may help the driver implement read/write operations more efficiently since the underlying GPU hardware can usually read and write faster from and to aligned addresses. The downside of std140 is that because of these alignment rules we also waste some memory and we need to know the alignment rules on the GL side if we want to read/write from/to the buffer. Specifically for SSBOs, the specification introduces a new layout mode: std430, which removes these alignment restrictions, allowing for a more efficient memory usage implementation, but possibly at the expense of some performance impact.

The binding keyword, just like in the case of UBOs, is used to select the buffer that we will be reading from and writing to when accessing these variables from the shader code. It is the application’s responsibility to bound the right buffer to the binding point we specify in the shader code.

So with that done, the shader can read from and write to these variables as we see fit, but we should be aware of the fact that multiple instances of the shader could be reading from and writing to them simultaneously. Let’s look at the fragment shader that stores the information we want into the SSBO:

void main() {
   int index = int(mod(gl_FragCoord.x, 8));

   int i = atomicAdd(counter[index], 1);
   if (i == 0)
      first[index] = gl_FragCoord;
   else
      last[index] = gl_FragCoord;

   i = atomicAdd(total, 1);
   fragments[i] = gl_FragCoord;
}

The first line computes an index into our integer array buffer variable by using gl_FragCoord. Notice that different fragments could get the same index. Next we increase in one unit counter[index]. Since we know that different fragments can get to do this at the same time we use an atomic operation to make sure that we don’t lose any increments.

Notice that if two fragments can write to the same index, reading the value of counter[index] after the atomicAdd can lead to different results. For example, if two fragments have already executed the atomicAdd, and assuming that counter[index] is initialized to 0, then both would read counter[index] == 2, however, if only one of the fragments has executed the atomic operation by the time it reads counter[index] it would read a value of 1, while the other fragment would read a value of 2 when it reaches that point in the shader execution. Since our shader intends to record the coordinates of the first fragment that writes to counter[index], that won’t work for us. Instead, we use the return value of the atomic operation (which returns the value that the buffer variable had right before changing it) and we write to first[index] only when that value was 0. Because we use the atomic operation to read the previous value of counter[index], only one fragment will read a value of 0, and that will be the fragment that first executed the atomic operation.

If this is not the first fragment assigned to that index, we write to last[index] instead. Again, multiple fragments assigned to the same index could do this simultaneously, but that is okay here, because we only care about the the last write. Also notice that it is possible that different executions of the same rendering command produce different values of first[] and last[].

The remaining instructions unconditionally push the fragment coordinates to the unsized array. We keep the last index into the unsized array fragments[] we have written to in the buffer variable total. Each fragment will atomically increase total before writing to the unsized array. Notice that, once again, we have to be careful when reading the value of total to make sure that each fragment reads a different value and we never have two fragments write to the same entry.

Using shader storage buffer objects (GL)

On the side of the GL application, we need to create the buffer, bind it to the appropriate binding point and initialize it. We do this as usual, only that we use the new GL_SHADER_STORAGE_BUFFER target:

typedef struct {
   float first[8*4];      // vec4[8]
   float last[8*4];       // vec4[8]
   int counter[8*4];      // int[8] padded as per std140
   int total;             // int
   int pad[3];            // padding: as per std140 rules
   char fragments[1024];  // up to 1024 bytes of unsized array
} SSBO;

SSBO data;

(...)

memset(&data, 0, sizeof(SSBO));

GLuint buf;
glGenBuffers(1, &buf);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, buf);
glBufferData(GL_SHADER_STORAGE_BUFFER, sizeof(SSBO), &data, GL_DYNAMIC_DRAW);

The code creates a buffer, binds it to binding point 0 of GL_SHADER_STORAGE_BUFFER (the same we have bound our shader to) and initializes the buffer data to 0. Notice that because we are using std140 we have to be aware of the alignment rules at work. We could have used std430 instead to avoid this.

Since we have 1024 bytes for the fragments[] unsized array and we are pushing a vec4 (16 bytes) worth of data to it with every fragment we process then we have enough room for 64 fragments. It is the developer’s responsibility to ensure that this limit is not surpassed, otherwise we would write beyond the allocated space for our buffer and the results would be undefined.

The next step is to do some rendering so we get our shaders to work. That would trigger the execution of our fragment shader for each fragment produced, which will generate writes into our buffer for each buffer variable the shader code writes to. After rendering, we can map the buffer and read its contents from the GL application as usual:

SSBO *ptr = (SSBO *) glMapNamedBuffer(buf, GL_READ_ONLY);

/* List of fragments recorded in the unsized array */
printf("%d fragments recorded:\n", ptr->total);
float *coords = (float *) ptr->fragments;
for (int i = 0; i < ptr->total; i++, coords +=4) {
   printf("Fragment %d: (%.1f, %.1f, %.1f, %.1f)\n",
          i, coords[0], coords[1], coords[2], coords[3]);
}

/* First fragment for each index used */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 0)
      printf("First fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->first[i*4], ptr->first[i*4+1],
             ptr->first[i*4+2], ptr->first[i*4+3]);
}

/* Last fragment for each index used */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 1)
      printf("Last fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->last[i*4], ptr->last[i*4+1],
             ptr->last[i*4+2], ptr->last[i*4+3]);
   else if (ptr->counter[i*4] == 1)
      printf("Last fragment for index %d: (%.1f, %.1f, %.1f, %.1f)\n",
             i, ptr->first[i*4], ptr->first[i*4+1],
             ptr->first[i*4+2], ptr->first[i*4+3]);
}

/* Fragment counts for each index */
for (int i = 0; i < 8; i++) {
   if (ptr->counter[i*4] > 0)
      printf("Fragment count at index %d: %d\n", i, ptr->counter[i*4]);
}
glUnmapNamedBuffer(buf);

I get this result for an execution where I am drawing a handful of points:

4 fragments recorded:
Fragment 0: (199.5, 150.5, 0.5, 1.0)
Fragment 1: (39.5, 150.5, 0.5, 1.0)
Fragment 2: (79.5, 150.5, 0.5, 1.0)
Fragment 3: (139.5, 150.5, 0.5, 1.0)

First fragment for index 3: (139.5, 150.5, 0.5, 1.0)
First fragment for index 7: (39.5, 150.5, 0.5, 1.0)

Last fragment for index 3: (139.5, 150.5, 0.5, 1.0)
Last fragment for index 7: (79.5, 150.5, 0.5, 1.0)

Fragment count at index 3: 1
Fragment count at index 7: 3

It recorded 4 fragments that the shader mapped to indices 3 and 7. Multiple fragments where assigned to index 7 but we could handle that gracefully by using the corresponding atomic functions. Different executions of the same program will produce the same 4 fragments and map them to the same indices, but the first and last fragments recorded for index 7 can change between executions.

Also notice that the first fragment we recorded in the unsized array (fragments[0]) is not the first fragment recorded for index 7 (fragments[1]). That means that the execution of fragments[0] got first to the unsized array addition code, but the execution of fragments[1] beat it in the race to execute the code that handled the assignment to the first/last arrays, making clear that we cannot make any assumptions regarding the execution order of reads and writes coming from different instances of the same shader execution.

So that’s it, the patches are now in the mesa-dev mailing list undergoing review and will hopefully land soon, so look forward to it! Also, if you have any interesting uses for this new feature, let me know in the comments.

2 comments

  1. Thank you Igor for your work on free and opensource graphic drivers. I was really excited to see some development on OpenGL 4.3 in Mesa!