Implementing ARB_shader_storage_buffer

In my previous post I introduced ARB_shader_storage_buffer, an OpenGL 4.3 feature that is coming soon to Mesa and the Intel i965 driver. While that post focused on explaining the features introduced by the extension, in this post I’ll dive into some of the implementation aspects, for those who are curious about this kind of stuff. Be warned that some parts of this post will be specific to Intel hardware.

Following the trail of UBOs

As I explained in my previous post, SSBOs are similar to UBOs, but they are read-write. Because there is a lot of code already in place in Mesa’s GLSL compiler to deal with UBOs, it made sense to try and reuse all the data structures and code we had for UBOs and specialize the behavior for SSBOs where that was needed, that allows us to build on code paths that are already working well and reuse most of the code.

That path, however, had some issues that bit me a bit further down the road. When it comes to representing these operations in the IR, my first idea was to follow the trail of UBO loads as well, which are represented as ir_expression nodes. There is a fundamental difference between the two though: UBO loads are constant operations because uniform buffers are read-only. This means that a UBO load operation with the same parameters will always return the same value. This has implications related to certain optimization passes that work based on the assumption that other ir_expression operations share this feature. SSBO loads are not like this: since the shader storage buffer is read-write, two identical SSBO load operations in the same shader may not return the same result if the underlying buffer storage has been altered in between by SSBO write operations within the same or other threads. This forced me to alter a number of optimization passes in Mesa to deal with this situation (mostly disabling them for the cases of SSBO loads and stores).

The situation was worse with SSBO stores. These just did not fit into ir_expression nodes: they did not return a value and had side-effects (memory writes) so we had to come up with a different way to represent them. My initial implementation created a new IR node for these, ir_ssbo_store. That worked well enough, but it left us with an implementation of loads and stores that was a bit inconsistent since both operations used very different IR constructs.

These issues were made clear during the review process, where it was suggested that we used GLSL IR intrinsics to represent load and store operations instead. This has the benefit that we can make the implementation more consistent, having both loads and stores represented with the same IR construct and follow a similar treatment in both the GLSL compiler and the i965 backend. It would also remove the need to disable or alter certain optimization passes to be SSBO friendly.

Read/Write coherence

One of the issues we detected early in development was that our reads and writes did not seem to work very well together: some times a read after a write would fail to see the last value written to a buffer variable. The problem here also spawned from following the implementation trail of the UBO path. In the Intel hardware, there are various interfaces to access memory, like the Sampling Engine and the Data Port. The former is a read-only interface and is used, for example, for texture and UBO reads. The Data Port allows for read-write access. Although both interfaces give access to the same memory region, there is something to consider here: if you mix reads through the Sampling Engine and writes through the Data Port you can run into cache coherence issues, this is because the caches in use by the Sampling Engine and the Data Port functions are different. Initially, we implemented SSBO load operations like UBO loads, so we used the Sampling Engine, and ended up running into this problem. The solution, of course, was to rewrite SSBO loads to go though the Data Port as well.

Parallel reads and writes

GPUs are highly parallel hardware and this has some implications for driver developers. Take a sentence like this in a fragment shader program:

float cx = 1.0;

This is a simple assignment of the value 1.0 to variable cx that is supposed to happen for each fragment produced. In Intel hardware running in SIMD16 mode, we process 16 fragments simultaneously in the same GPU thread, this means that this instruction is actually 16 elements wide. That is, we are doing 16 assignments of the value 1.0 simultaneously, each one is stored at a different offset into the GPU register used to hold the value of cx.

If cx was a buffer variable in a SSBO, it would also mean that the assignment above should translate to 16 memory writes to the same offset into the buffer. That may seem a bit absurd: why would we want to write 16 times if we are always assigning the same value? Well, because things can get more complex, like this:

float cx = gl_FragCoord.x;

Now we are no longer assigning the same value for all fragments, each of the 16 values assigned with this instruction could be different. If cx was a buffer variable inside a SSBO, then we could be potentially writing 16 different values to it. It is still a bit silly, since only one of the values (the one we write last), would prevail.

Okay, but what if we do something like this?:

int index = int(mod(gl_FragCoord.x, 8));
cx[index] = 1;

Now, depending on the value we are reading for each fragment, we are writing to a separate offset into the SSBO. We still have a single assignment in the GLSL program, but that translates to 16 different writes, and in this case the order may not be relevant, but we want all of them to happen to achieve correct behavior.

The bottom line is that when we implement SSBO load and store operations, we need to understand the parallel environment in which we are running and work with test scenarios that allow us to verify correct behavior in these situations. For example, if we only test scenarios with assignments that give the same value to all the fragments/vertices involved in the parallel instructions (i.e. assignments of values that do not depend on properties of the current fragment or vertex), we could easily overlook fundamental defects in the implementation.

Dealing with helper invocations

From Section 7.1 of the GLSL spec version 4.5:

“Fragment shader helper invocations execute the same shader code
as non-helper invocations, but will not have side effects that
modify the framebuffer or other shader-accessible memory.”

To understand what this means I have to introduce the concept of helper invocations: certain operations in the fragment shader need to evaluate derivatives (explicitly or implicitly) and for that to work well we need to make sure that we compute values for adjacent fragments that may not be inside the primitive that we are rendering. The fragment shader executions for these added fragments are called helper invocations, meaning that they are only needed to help in computations for other fragments that are part of the primitive we are rendering.

How does this affect SSBOs? Because helper invocations are not part of the primitive, they cannot have side-effects, after they had served their purpose it should be as if they had never been produced, so in the case of SSBOs we have to be careful not to do memory writes for helper fragments. Notice also, that in a SIMD16 execution, we can have both proper and helper fragments mixed in the group of 16 fragments we are handling in parallel.

Of course, the hardware knows if a fragment is part of a helper invocation or not and it tells us about this through a pixel mask register that is delivered with all executions of a fragment shader thread, this register has a bitmask stating which pixels are proper and which are helper. The Intel hardware also provides developers with various kinds of messages that we can use, via the Data Port interface, to write to memory, however, the tricky thing is that not all of them incorporate pixel mask information, so for use cases where you need to disable writes from helper fragments you need to be careful with the write message you use and select one that accepts this sort of information.

Vector alignments

Another interesting thing we had to deal with are address alignments. UBOs work with layout std140. In this setup, elements in the UBO definition are aligned to 16-byte boundaries (the size of a vec4). It turns out that GPUs can usually optimize reads and writes to multiples of 16 bytes, so this makes sense, however, as I explained in my previous post, SSBOs also introduce a packed layout mode known as std430.

Intel hardware provides a number of messages that we can use through the Data Port interface to write to memory. Each message has different characteristics that makes it more suitable for certain scenarios, like the pixel mask I discussed before. For example, some of these messages have the capacity to write data in chunks of 16-bytes (that is, they write vec4 elements, or OWORDS in the language of the technical docs). One could think that these messages are great when you work with vector data types, however, they also introduce the problem of dealing with partial writes: what happens when you only write to an element of a vector? or to a buffer variable that is smaller than the size of a vector? what if you write columns in a row_major matrix? etc

In these scenarios, using these messages introduces the need to mask the writes because you need to disable the channels in the vec4 element that you don’t want to write. Of course, the hardware provides means to do this, we only need to set the writemask of the destination register of the message instruction to select the right channels. Consider this example:

struct TB {
    float a, b, c, d;
};

layout(std140, binding=0) buffer Fragments {
   TB s[3];
   int index;
};

void main()
{
   s[0].d = -1.0;
}

In this case, we could use a 16-byte write message that takes 0 as offset (i.e writes at the beginning of the buffer, where s[0] is stored) and then set the writemask on that instruction to WRITEMASK_W so that only the fourth data element is actually written, this way we only write one data element of 4 bytes (-1) at offset 12 bytes (s[0].d). Easy, right? However, how do we know, in general, the writemask that we need to use? In std140 layout mode this is easy: since each element in the SSBO is aligned to a 16-byte boundary, we simply need to take the byte offset at which we are writing, divide it by 16 (to convert it to units of vec4) and the modulo of that operation is the byte offset into the chunk of 16-bytes that we are writing into, then we only have to divide that by 4 to get the component slot we need to write to (a number between 0 and 3).

However, there is a restriction: we can only set the writemask of a register at compile/link time, so what happens when we have something like this?:

s[i].d = -1.0;

The problem with this is that we cannot evaluate the value of i at compile/link time, which inevitably makes our solution invalid for this. In other words, if we cannot evaluate the actual value of the offset at which we are writing at compile/link time, we cannot use the writemask to select the channels we want to use when we don’t want to write a vec4 worth of data and we have to use a different type of message.

That said, in the case of std140 layout mode, since each data element in the SSBO is aligned to a 16-byte boundary you may realize that the actual value of i is irrelevant for the purpose of the modulo operation discussed above and we can still manage to make things work by completely ignoring it for the purpose of computing the writemask, but in std430 that trick won’t work at all, and even in std140 we would still have row_major matrix writes to deal with.

Also, we may need to tweak the message depending on whether we are running on the vertex shader or the fragment shader because not all message types have appropriate SIMD modes (SIMD4x2, SIMD8, SIMD16, etc) for both, or because different hardware generations may not have all the message types or support all the SIMD modes we need need, etc

The point of this is that selecting the right message to use can be tricky, there are multiple things and corner cases to consider and you do not want to end up with an implementation that requires using many different messages depending on various circumstances because of the increasing complexity that it would add to the implementation and maintenance of the code.

Closing notes

This post did not cover all the intricacies of the implementation of ARB_shader_storage_buffer_object, I did not discuss things like the optional unsized array or the compiler details of std430 for example, but, hopefully, I managed to give an idea of the kind of problems one would have to deal with when coding driver support for this or other similar features.