{"id":454,"date":"2015-03-06T15:06:10","date_gmt":"2015-03-06T14:06:10","guid":{"rendered":"http:\/\/blogs.igalia.com\/itoral\/?p=454"},"modified":"2015-03-06T15:06:10","modified_gmt":"2015-03-06T14:06:10","slug":"an-introduction-to-mesas-glsl-compiler-ii","status":"publish","type":"post","link":"https:\/\/blogs.igalia.com\/itoral\/2015\/03\/06\/an-introduction-to-mesas-glsl-compiler-ii\/","title":{"rendered":"An introduction to Mesa\u2019s GLSL compiler (II)"},"content":{"rendered":"<p><strong>Recap<\/strong><\/p>\n<p><a href=\"http:\/\/blogs.igalia.com\/itoral\/2015\/03\/03\/an-introduction-to-mesas-glsl-compiler-i\/\" title=\"An introduction to Mesa\u2019s GLSL compiler (I)\" target=\"_blank\">My previous post<\/a> served as an initial look into Mesa&#8217;s GLSL compiler, where we discussed the Mesa IR, which is a core aspect of the compiler. In this post I&#8217;ll introduce another relevant aspect: IR lowering.<\/p>\n<p><strong>IR lowering<\/strong><\/p>\n<p>There are multiple lowering passes implemented in Mesa (check <em>src\/glsl\/lower_*.cpp<\/em> for a complete list) but they all share a common denominator: their purpose is to re-write certain constructs in the IR so they fit better the underlying GPU hardware.<\/p>\n<p>In this post we will look into the <em>lower_instructions.cpp<\/em> lowering pass, which rewrites expression operations that may not be supported directly by GPU hardware with different implementations.<\/p>\n<p>The lowering process involves traversing the IR, identifying the instructions we want to lower and modifying the IR accordingly, which fits well into the visitor pattern strategy discussed in my previous post. In this case, expression lowering is handled by the <em>lower_instructions_visitor<\/em> class, which implements the lowering pass in the <em>visit_leave()<\/em> method for ir_expression nodes.<\/p>\n<p>The hierarchical visitor class, which serves as the base class for most visitors in Mesa, defines <em>visit()<\/em> methods for leaf nodes in the IR tree, and <em>visit_leave()\/visit_enter()<\/em> methods for non-leaf nodes. This way, when traversing intermediary nodes in the IR we can decide to take action as soon as we enter them or when we are about to leave them.<\/p>\n<p>In the case of our <em>lower_instructions_visitor<\/em> class, the <em>visit_leave()<\/em> method implementation is a large <em>switch()<\/em> statement with all the operators that it can lower.<\/p>\n<p>The code in this file lowers common scenarios that are expected to be useful for most GPU drivers, but individual drivers can still select which of these lowering passes they want to use. For this purpose, hardware drivers create instances of the <em>lower_instructions<\/em> class passing the list of lowering passes to enable. For example, the Intel i965 driver does:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nconst int bitfield_insert = brw-&gt;gen &gt;= 7\r\n                            ? BITFIELD_INSERT_TO_BFM_BFI\r\n                            : 0;\r\nlower_instructions(shader-&gt;base.ir,\r\n                   MOD_TO_FLOOR |\r\n                   DIV_TO_MUL_RCP |\r\n                   SUB_TO_ADD_NEG |\r\n                   EXP_TO_EXP2 |\r\n                   LOG_TO_LOG2 |\r\n                   bitfield_insert |\r\n                   LDEXP_TO_ARITH);\r\n<\/pre>\n<p>Notice how in the case of Intel GPUs, one of the lowering passes is conditionally selected depending on the hardware involved. In this case, <em>brw->gen >= 7<\/em> selects GPU generations since <em>IvyBridge<\/em>.<\/p>\n<p>Let&#8217;s have a look at the implementation of some of these lowering passes. For example, <em>SUB_TO_ADD_NEG<\/em> is a very simple one that transforms subtractions into negative additions:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nvoid\r\nlower_instructions_visitor::sub_to_add_neg(ir_expression *ir)\r\n{\r\n   ir-&gt;operation = ir_binop_add;\r\n   ir-&gt;operands&#x5B;1] =\r\n      new(ir) ir_expression(ir_unop_neg, ir-&gt;operands&#x5B;1]-&gt;type,\r\n                            ir-&gt;operands&#x5B;1], NULL);\r\n   this-&gt;progress = true;\r\n}\r\n<\/pre>\n<p>As we can see, the lowering pass simply changes the operator used by the <em>ir_expression<\/em> node, and negates the second operand using the unary negate operator (ir_unop_neg), thus, converting the original <em>a = b &#8211; c<\/em> into <em>a = b + (-c)<\/em>.<\/p>\n<p>Of course, if a driver does not have native support for the subtraction operation, it could still do this when it processes the IR to produce native code, but this way Mesa is saving driver developers that work. Also, some lowering passes may enable optimization passes after the lowering that drivers might miss otherwise.<\/p>\n<p>Let&#8217;s see a more complex example: <em>MOD_TO_FLOOR<\/em>. In this case the lowering pass provides an implementation of <em>ir_binop_mod<\/em> (modulo) for GPUs that don&#8217;t have a native modulo operation.<\/p>\n<p>The modulo operation takes two operands (op0, op1) and implements the C equivalent of the &#8216;<em>op0 % op1<\/em>&#8216;, that is, it computes the remainder of the division of op0 by op1. To achieve this the lowering pass breaks the modulo operation as <em>mod(op0, op1) = op0 &#8211; op1 * floor(op0 \/ op1)<\/em>, which requires only multiplication, division and subtraction. This is the implementation:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nir_variable *x = new(ir) ir_variable(ir-&gt;operands&#x5B;0]-&gt;type, &quot;mod_x&quot;,\r\n                                     ir_var_temporary);\r\nir_variable *y = new(ir) ir_variable(ir-&gt;operands&#x5B;1]-&gt;type, &quot;mod_y&quot;,\r\n                                     ir_var_temporary);\r\nthis-&gt;base_ir-&gt;insert_before(x);\r\nthis-&gt;base_ir-&gt;insert_before(y);\r\n\r\nir_assignment *const assign_x =\r\n   new(ir) ir_assignment(new(ir) ir_dereference_variable(x),\r\n                         ir-&gt;operands&#x5B;0], NULL);\r\nir_assignment *const assign_y =\r\n   new(ir) ir_assignment(new(ir) ir_dereference_variable(y),\r\n                         ir-&gt;operands&#x5B;1], NULL);\r\n\r\nthis-&gt;base_ir-&gt;insert_before(assign_x);\r\nthis-&gt;base_ir-&gt;insert_before(assign_y);\r\n\r\nir_expression *const div_expr =\r\n   new(ir) ir_expression(ir_binop_div, x-&gt;type,\r\n                         new(ir) ir_dereference_variable(x),\r\n                         new(ir) ir_dereference_variable(y));\r\n\r\n\/* Don't generate new IR that would need to be lowered in an additional\r\n * pass.\r\n *\/\r\nif (lowering(DIV_TO_MUL_RCP) &amp;&amp; (ir-&gt;type-&gt;is_float() ||\r\n    ir-&gt;type-&gt;is_double()))\r\n   div_to_mul_rcp(div_expr);\r\n\r\nir_expression *const floor_expr =\r\n   new(ir) ir_expression(ir_unop_floor, x-&gt;type, div_expr);\r\n\r\nif (lowering(DOPS_TO_DFRAC) &amp;&amp; ir-&gt;type-&gt;is_double())\r\n   dfloor_to_dfrac(floor_expr);\r\n\r\nir_expression *const mul_expr =\r\n   new(ir) ir_expression(ir_binop_mul,\r\n                         new(ir) ir_dereference_variable(y),\r\n                         floor_expr);\r\n\r\nir-&gt;operation = ir_binop_sub;\r\nir-&gt;operands&#x5B;0] = new(ir) ir_dereference_variable(x);\r\nir-&gt;operands&#x5B;1] = mul_expr;\r\nthis-&gt;progress = true;\r\n<\/pre>\n<p>Notice how the first thing this does is to assign the operands to a variable. The reason for this is a bit tricky: since we are going to implement <em>ir_binop_mod<\/em> as <em>op0 &#8211; op1 * floor(op0 \/ op1)<\/em>, we will need to refer to the IR nodes op0 and op1 twice in the tree. However, we can&#8217;t just do that directly, for that would mean that we have the same node (that is, the same pointer) linked from two different places in the IR expression tree.  That is, we want to have this tree:<\/p>\n<pre>\r\n                             sub\r\n                           \/     \\\r\n                        op0       mult\r\n                                 \/    \\\r\n                              op1     floor\r\n                                        |\r\n                                       div\r\n                                      \/   \\\r\n                                   op0     op1\r\n<\/pre>\n<p>Instead of this other tree:<\/p>\n<pre>\r\n\r\n                             sub\r\n                           \/     \\\r\n                           |      mult\r\n                           |     \/   \\\r\n                           |   floor  |\r\n                           |     |    |\r\n                           |    div   |\r\n                           |   \/   \\  |\r\n                            op0     op1   \r\n<\/pre>\n<p>This second version of the tree is problematic. For example, let&#8217;s say that a hypothetical optimization pass detects that <em>op1<\/em> is a constant integer with value <em>1<\/em>, and realizes that in this case <em>div(op0\/op1) == op0<\/em>. When doing that optimization, our <em>div<\/em> subtree is removed, and with that, <em>op1<\/em> could be removed too (and possibily freed), leaving the other reference to that operand in the IR pointing to an invalid memory location&#8230; we have just corrupted our IR:<\/p>\n<pre>\r\n                             sub\r\n                           \/     \\\r\n                           |      mult\r\n                           |     \/    \\\r\n                           |   floor   op1 [invalid pointer reference]\r\n                           |     |\r\n                           |    \/\r\n                           |   \/\r\n                            op0 \r\n<\/pre>\n<p>Instead, what we want to do here is to clone the nodes each time we need a new reference to them in the IR. All IR nodes have a <em>clone()<\/em> method for this purpose. However, in this particular case, cloning the nodes creates a new problem: <em>op0<\/em> and <em>op1<\/em> are <em>ir_expression<\/em> nodes so, for example, <em>op0<\/em> could be the expression <em>a + b * c<\/em>, so cloning the expression would produce suboptimal code where the expression gets replicated. This, at best, will lead to slower<br \/>\ncompilation times due to optimization passes needing to detect and fix that, and at worse, that would go undetected by the optimizer and lead to worse performance where we compute the value of the expression multiple times:<\/p>\n<pre>\r\n                              sub\r\n                           \/        \\\r\n                         add         mult\r\n                        \/   \\       \/    \\\r\n                      a     mult  op1     floor\r\n                            \/   \\          |\r\n                           b     c        div\r\n                                         \/   \\\r\n                                      add     op1\r\n                                     \/   \\\r\n                                    a    mult\r\n                                        \/    \\\r\n                                        b     c\r\n<\/pre>\n<p>The solution to this problem is to assign the expression to a variable, then dereference that variable (i.e., read its value) wherever we need. Thus, the implementation defines two variables <em>(x, y)<\/em>, assigns <em>op0<\/em> and <em>op1<\/em> to them and creates new dereference nodes wherever we need to access the value of the <em>op0<\/em> and <em>op1<\/em> expressions:<\/p>\n<pre>\r\n                       =               =\r\n                     \/   \\           \/   \\\r\n                    x     op0       y     op1\r\n\r\n\r\n                             sub\r\n                           \/     \\\r\n                         *x       mult\r\n                                 \/    \\\r\n                               *y     floor\r\n                                        |\r\n                                       div\r\n                                      \/   \\\r\n                                    *x     *y\r\n<\/pre>\n<p>In the diagram above, each variable dereference is marked with an <em>&#8216;*&#8217;<\/em>, and each one is a new IR node (so both appearances of <em>&#8216;*x&#8217;<\/em> refer to different  IR nodes, both representing two different reads of the same variable). With this solution we only evaluate the <em>op0<\/em> and <em>op1<\/em> expressions once (when they get assigned to the corresponding variables) and we never refer to the same IR node twice from different places (since each variable dereference is a new IR node).<\/p>\n<p>Now that we know why we assign these two variables, let&#8217;s continue looking at the code of the lowering pass:<\/p>\n<p>In the next step we implement <em>op0 \/ op1<\/em> using a ir_binop_div expression. To speed up compilation, if the driver has the <em>DIV_TO_MUL_RCP<\/em> lowering pass enabled, which transforms <em>a \/ b<\/em> into <em>a * 1 \/ b<\/em> (where <em>1 \/ b<\/em> could be a native instruction), we immediately execute the lowering pass for that expression. If we didn&#8217;t do this here, the resulting IR would contain a division  operation that might have to be lowered in a later pass, making the compilation process slower.<\/p>\n<p>The next step uses a <em>ir_unop_floor<\/em> expression to compute <em>floor(op0\/op1)<\/em>, and again, tests if this operation should be lowered too, which might be the case if the type of the operands is a 64bit double instead of a regular 32bit float, since GPUs may only have a native floor instruction for 32bit floats.<\/p>\n<p>Next, we multiply the result by <em>op1<\/em> to get <em>op1 * floor(op0 \/ op1)<\/em>.<\/p>\n<p>Now we only need to subtract this from op0, which would be the root IR node for this expression. Since we want the new IR subtree spawning from this root node to replace the old implementation, we directly edit the IR node we are lowering to replace the <em>ir_binop_mod<\/em> operator with <em>ir_binop_sub<\/em>, make a dereference to <em>op1<\/em> in the first operand and link the expression holding <em>op1 * floor(op0 \/ op1)<\/em> in the second operand, effectively attaching our new implementation in place of the old version. This is how the original and lowered IRs look like:<\/p>\n<p>Original IR:<\/p>\n<pre>\r\n[prev inst] -> mod -> [next inst]\r\n              \/   \\            \r\n           op0     op1         \r\n<\/pre>\n<p>Lowered IR:<\/p>\n<pre>\r\n[prev inst] -> var x -> var y ->   =   ->   =   ->   sub   -> [next inst]\r\n                                  \/ \\      \/ \\      \/   \\\r\n                                 x  op0   y  op1  *x     mult\r\n                                                        \/    \\\r\n                                                      *y      floor\r\n                                                                |\r\n                                                               div\r\n                                                              \/   \\\r\n                                                            *x     *y\r\n<\/pre>\n<p>Finally, we return true to let the compiler know that we have optimized the IR and that as a consequence we have introduced new nodes that may be subject to further lowering passes, so it can run a new pass. For example, the subtraction we just added may be lowered again to a negative addition as we have seen before.<\/p>\n<p><strong>Coming up next<\/strong><\/p>\n<p>Now that we learnt about lowering passes we can also discuss optimization passes, which are very similar since they are also based on the visitor implementation in Mesa and also transform the <em>Mesa IR<\/em> in a similar way.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recap My previous post served as an initial look into Mesa&#8217;s GLSL compiler, where we discussed the Mesa IR, which is a core aspect of the compiler. In this post I&#8217;ll introduce another relevant aspect: IR lowering. IR lowering There are multiple lowering passes implemented in Mesa (check src\/glsl\/lower_*.cpp for a complete list) but they &hellip; <a href=\"https:\/\/blogs.igalia.com\/itoral\/2015\/03\/06\/an-introduction-to-mesas-glsl-compiler-ii\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;An introduction to Mesa\u2019s GLSL compiler (II)&#8221;<\/span><\/a><\/p>\n","protected":false},"author":16,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7],"tags":[],"class_list":["post-454","post","type-post","status-publish","format-standard","hentry","category-graphics"],"_links":{"self":[{"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/posts\/454","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/comments?post=454"}],"version-history":[{"count":7,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/posts\/454\/revisions"}],"predecessor-version":[{"id":461,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/posts\/454\/revisions\/461"}],"wp:attachment":[{"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/media?parent=454"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/categories?post=454"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.igalia.com\/itoral\/wp-json\/wp\/v2\/tags?post=454"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}