{"id":496,"date":"2015-07-08T09:02:47","date_gmt":"2015-07-08T07:02:47","guid":{"rendered":"http:\/\/blogs.igalia.com\/itoral\/?p=496"},"modified":"2015-07-08T09:02:47","modified_gmt":"2015-07-08T07:02:47","slug":"implementing-arb_shader_storage_buffer","status":"publish","type":"post","link":"https:\/\/blogs.igalia.com\/itoral\/2015\/07\/08\/implementing-arb_shader_storage_buffer\/","title":{"rendered":"Implementing ARB_shader_storage_buffer"},"content":{"rendered":"<p>In <a target=\"_blank\" href=\"http:\/\/blogs.igalia.com\/itoral\/2015\/05\/20\/bringing-arb_shader_storage_buffer_object-to-mesa-and-i965\/\">my previous post<\/a> I introduced <em>ARB_shader_storage_buffer<\/em>, an <em>OpenGL 4.3<\/em> feature that is coming soon to <em>Mesa<\/em> and the <em>Intel i965 driver<\/em>. While that post focused on explaining the features introduced by the extension, in this post I&#8217;ll dive into some of the implementation aspects, for those who are curious about this kind of stuff. Be warned that some parts of this post will be specific to Intel hardware.<\/p>\n<p><strong>Following the trail of UBOs<\/strong><\/p>\n<p>As I explained in my previous post, <em>SSBOs<\/em> are similar to <em>UBOs<\/em>, but they are read-write. Because there is a lot of code already in place in Mesa&#8217;s GLSL compiler to deal with <em>UBOs<\/em>, it made sense to try and reuse all the data structures and code we had for <em>UBOs<\/em> and specialize the behavior for <em>SSBOs<\/em> where that was needed, that allows us to build on code paths that are already working well and reuse most of the code.<\/p>\n<p>That path, however, had some issues that bit me a bit further down the road. When it comes to representing these operations in the IR, my first idea was to follow the trail of <em>UBO<\/em> loads as well, which are represented as <em>ir_expression<\/em> nodes. There is a fundamental difference between the two though: <em>UBO<\/em> loads are constant operations because uniform buffers are read-only. This means that a <em>UBO<\/em> load operation with the same parameters will always return the same value. This has implications related to certain optimization passes that work based on the assumption that other <em>ir_expression<\/em> operations share this feature. <em>SSBO<\/em> loads are not like this: since the shader storage buffer is read-write, two identical <em>SSBO<\/em> load operations in the same shader may not return the same result if the underlying buffer storage has been altered in between by <em>SSBO<\/em> write operations within the same or other threads. This forced me to alter a number of optimization passes in Mesa to deal with this situation (mostly disabling them for the cases of <em>SSBO<\/em> loads and stores).<\/p>\n<p>The situation was worse with <em>SSBO<\/em> stores. These just did not fit into <em>ir_expression<\/em> nodes: they did not return a value and had side-effects (memory writes) so we had to come up with a different way to represent them. My initial implementation created a new IR node for these, <em>ir_ssbo_store<\/em>. That worked well enough, but it left us with an implementation of loads and stores that was a bit inconsistent since both operations used very different IR constructs.<\/p>\n<p>These issues were made clear during the review process, where it was suggested that we used <em>GLSL IR intrinsics<\/em> to represent load and store operations instead. This has the benefit that we can make the implementation more consistent, having both loads and stores represented with the same IR construct and follow a similar treatment in both the GLSL compiler and the i965 backend. It would also remove the need to disable or alter certain optimization passes to be <em>SSBO<\/em> friendly.<\/p>\n<p><strong>Read\/Write coherence<\/strong><\/p>\n<p>One of the issues we detected early in development was that our reads and writes did not seem to work very well together: some times a read after a write would fail to see the last value written to a buffer variable. The problem here also spawned from following the implementation trail of the <em>UBO<\/em> path. In the Intel hardware, there are various interfaces to access memory, like the Sampling Engine and the Data Port. The former is a read-only interface and is used, for example, for texture and <em>UBO<\/em> reads. The Data Port allows for read-write access. Although both interfaces give access to the same memory region, there is something to consider here: if you mix reads through the <em>Sampling Engine<\/em> and writes through the <em>Data Port<\/em> you can run into cache coherence issues, this is because the caches in use by the  Sampling Engine and the Data Port functions are different. Initially, we implemented <em>SSBO<\/em> load operations like <em>UBO<\/em> loads, so we used the <em>Sampling Engine<\/em>, and ended up running into this problem. The solution, of course, was to rewrite <em>SSBO<\/em> loads to go though the <em>Data Port<\/em> as well.<\/p>\n<p><strong>Parallel reads and writes<\/strong><\/p>\n<p>GPUs are highly parallel hardware and this has some implications for driver developers. Take a sentence like this in a fragment shader program:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nfloat cx = 1.0;\r\n<\/pre>\n<p>This is a simple assignment of the value <em>1.0<\/em> to variable <em>cx<\/em> that is supposed to happen for each fragment produced. In Intel hardware running in <em>SIMD16<\/em> mode, we process 16 fragments simultaneously in the same GPU thread, this means that this instruction is actually 16 elements wide. That is, we are doing 16 assignments of the value <em>1.0<\/em> simultaneously, each one is stored at a different offset into the GPU register used to hold the value of <em>cx<\/em>.<\/p>\n<p>If <em>cx<\/em> was a <em>buffer variable<\/em> in a <em>SSBO<\/em>, it would also mean that the assignment above should translate to 16 memory writes to the same offset into the buffer. That may seem a bit absurd: why would we want to write 16 times if we are always assigning the same value? Well, because things can get more complex, like this:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nfloat cx = gl_FragCoord.x;\r\n<\/pre>\n<p>Now we are no longer assigning the same value for all fragments, each of the 16 values assigned with this instruction could be different. If cx was a buffer variable inside a <em>SSBO<\/em>, then we could be potentially writing 16 different values to it. It is still a bit silly, since only one of the values (the one we write last), would prevail.<\/p>\n<p>Okay, but what if we do something like this?:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nint index = int(mod(gl_FragCoord.x, 8));\r\ncx&#x5B;index] = 1;\r\n<\/pre>\n<p>Now, depending on the value we are reading for each fragment, we are writing to a separate offset into the <em>SSBO<\/em>. We still have a single assignment in the GLSL program, but that translates to 16 different writes, and in this case the order may not be relevant, but we want all of them to happen to achieve correct behavior.<\/p>\n<p>The bottom line is that when we implement <em>SSBO<\/em> load and store operations, we  need to understand the parallel environment in which we are running and work with test scenarios that allow us to verify correct behavior in these situations. For example, if we only test scenarios with assignments that give the same value to all the fragments\/vertices involved in the parallel instructions (i.e. assignments of values that do not depend on properties of the current fragment or vertex), we could easily overlook fundamental defects in the implementation.<\/p>\n<p><strong>Dealing with helper invocations<\/strong><\/p>\n<p>From Section 7.1 of the GLSL spec version 4.5:<\/p>\n<blockquote><p>&#8220;Fragment shader helper invocations execute the same shader code<br \/>\n as non-helper invocations, but will not have side effects that<br \/>\n modify the framebuffer or other shader-accessible memory.&#8221;\n<\/p><\/blockquote>\n<p>To understand what this means I have to introduce the concept of <a href=\"https:\/\/www.opengl.org\/sdk\/docs\/man4\/html\/gl_HelperInvocation.xhtml\" target=\"_blank\">helper invocations<\/a>: certain operations in the fragment shader need to evaluate derivatives (explicitly or implicitly) and for that to work well we need to make sure that we compute values for adjacent fragments that may not be inside the primitive that we are rendering. The fragment shader executions for these added fragments are called <em>helper invocations<\/em>, meaning that they are only needed to help in computations for other fragments that are part of the primitive we are rendering.<\/p>\n<p>How does this affect <em>SSBOs<\/em>? Because helper invocations are not part of the primitive, they cannot have side-effects, after they had served their purpose it should be as if they had never been produced, so in the case of <em>SSBOs<\/em> we have to be careful not to do memory writes for helper fragments. Notice also, that in a <em>SIMD16<\/em> execution, we can have both proper and helper fragments mixed in the group of 16 fragments we are handling in parallel.<\/p>\n<p>Of course, the hardware knows if a fragment is part of a helper invocation or not and it tells us about this through a <em>pixel mask register<\/em> that is  delivered with all executions of a fragment shader thread, this register has a bitmask stating which pixels are proper and which are helper.  The Intel hardware also provides developers with various kinds of messages that we can use, via the Data Port interface, to write to memory, however, the tricky thing is that not all of them incorporate pixel mask information, so for use cases where you need to disable writes from helper fragments you need to be careful with the write message you use and select one that accepts this sort of information.<\/p>\n<p><strong>Vector alignments<\/strong><\/p>\n<p>Another interesting thing we had to deal with are address alignments. <em>UBOs<\/em> work with layout <em>std140<\/em>. In this setup, elements in the <em>UBO<\/em> definition are aligned to 16-byte boundaries (the size of a vec4). It turns out that GPUs can usually optimize reads and writes to multiples of 16 bytes, so this makes sense, however, as I explained in my previous post, <em>SSBOs<\/em> also introduce a packed layout mode known as <em>std430<\/em>.<\/p>\n<p>Intel hardware provides a number of messages that we can use through the <em>Data Port<\/em> interface to write to memory. Each message has different characteristics that makes it more suitable for certain scenarios, like the pixel mask I discussed before. For example, some of these messages have the capacity to write data in chunks of 16-bytes (that is, they write vec4 elements, or <em>OWORDS<\/em> in the language of the technical docs). One could think that these messages are great when you work with vector data types, however, they also introduce the problem of dealing with partial writes: what happens when you only write to an element of a vector? or to a buffer variable that is smaller than the size of a vector? what if you write columns in a row_major matrix? etc<\/p>\n<p>In these scenarios, using these messages introduces the need to mask the writes because you need to disable the channels in the vec4 element that you don&#8217;t want to write. Of course, the hardware provides means to do this, we only need to set the writemask of the destination register of the message instruction to select the right channels. Consider this example:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nstruct TB {\r\n    float a, b, c, d;\r\n};\r\n\r\nlayout(std140, binding=0) buffer Fragments {\r\n   TB s&#x5B;3];\r\n   int index;\r\n};\r\n\r\nvoid main()\r\n{\r\n   s&#x5B;0].d = -1.0;\r\n}\r\n<\/pre>\n<p>In this case, we could use a 16-byte write message that takes 0 as offset (i.e writes at the beginning of the buffer, where <em>s[0]<\/em> is stored) and then set the writemask on that instruction to <em>WRITEMASK_W<\/em> so that only the fourth data element is actually written, this way we only write one data element of 4 bytes (<em>-1<\/em>) at offset 12 bytes (<em>s[0].d<\/em>). Easy, right? However, how do we know, in general, the writemask that we need to use? In <em>std140<\/em> layout mode this is easy: since each element in the <em>SSBO<\/em> is aligned to a 16-byte boundary, we simply need to take the byte offset at which we are writing, divide it by 16 (to convert it to units of vec4) and the modulo of that operation is the byte offset into the chunk of 16-bytes that we are writing into, then we only have to divide that by 4 to get the component slot we need to write to (a number between 0 and 3).<\/p>\n<p>However, there is a restriction: we can only set the writemask of a register at compile\/link time, so what happens when we have something like this?:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\ns&#x5B;i].d = -1.0;\r\n<\/pre>\n<p>The problem with this is that we cannot evaluate the value of <em>i<\/em> at compile\/link time, which inevitably makes our solution invalid for this. In other words, if we cannot evaluate the actual value of the offset at which we are writing at compile\/link time, we cannot use the writemask to select the channels we want to use when we don&#8217;t want to write a vec4 worth of data and we have to use a different type of message.<\/p>\n<p>That said, in the case of <em>std140<\/em> layout mode, since each data element in the <em>SSBO<\/em> is aligned to a 16-byte boundary you may realize that the actual value of <em>i<\/em> is irrelevant for the purpose of the modulo operation discussed above and we can still manage to make things work by completely ignoring it for the purpose of computing the writemask, but in <em>std430<\/em> that trick won&#8217;t work at all, and even in <em>std140<\/em> we would still have <em>row_major matrix<\/em> writes to deal with.<\/p>\n<p>Also, we may need to tweak the message depending on whether we are running on the vertex shader or the fragment shader because not all message types have appropriate <em>SIMD<\/em> modes (<em>SIMD4x2, SIMD8, SIMD16,<\/em> etc) for both, or because different hardware generations may not have all the message types or support all the <em>SIMD<\/em> modes we need need, etc<\/p>\n<p>The point of this is that selecting the right message to use can be tricky, there are multiple things and corner cases to consider and you do not want to end up with an implementation that requires using many different messages depending on various circumstances because of the increasing complexity that it would add to the implementation and maintenance of the code.<\/p>\n<p><strong>Closing notes<\/strong><\/p>\n<p>This post did not cover all the intricacies of the implementation of <em>ARB_shader_storage_buffer_object<\/em>, I did not discuss things like the optional unsized array or the compiler details of <em>std430<\/em> for example, but, hopefully, I managed to give an idea of the kind of problems one would have to deal with when coding driver support for this or other similar features.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In my previous post I introduced ARB_shader_storage_buffer, an OpenGL 4.3 feature that is coming soon to Mesa and the Intel i965 driver. While that post focused on explaining the features introduced by the extension, in this post I&#8217;ll dive into some of the implementation aspects, for those who are curious about this kind of stuff. &hellip; <a href=\"https:\/\/blogs.igalia.com\/itoral\/2015\/07\/08\/implementing-arb_shader_storage_buffer\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Implementing ARB_shader_storage_buffer&#8221;<\/span><\/a><\/p>\n","protected":false},"author":16,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-496","post","type-post","status-publish","format-standard","hentry","category-graphics"],"_links":{"self":[{"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/posts\/496","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/comments?post=496"}],"version-history":[{"count":6,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/posts\/496\/revisions"}],"predecessor-version":[{"id":502,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/posts\/496\/revisions\/502"}],"wp:attachment":[{"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/media?parent=496"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/categories?post=496"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/tags?post=496"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}