An introduction to Mesa’s GLSL compiler (II)

Recap

My previous post served as an initial look into Mesa’s GLSL compiler, where we discussed the Mesa IR, which is a core aspect of the compiler. In this post I’ll introduce another relevant aspect: IR lowering.

IR lowering

There are multiple lowering passes implemented in Mesa (check src/glsl/lower_*.cpp for a complete list) but they all share a common denominator: their purpose is to re-write certain constructs in the IR so they fit better the underlying GPU hardware.

In this post we will look into the lower_instructions.cpp lowering pass, which rewrites expression operations that may not be supported directly by GPU hardware with different implementations.

The lowering process involves traversing the IR, identifying the instructions we want to lower and modifying the IR accordingly, which fits well into the visitor pattern strategy discussed in my previous post. In this case, expression lowering is handled by the lower_instructions_visitor class, which implements the lowering pass in the visit_leave() method for ir_expression nodes.

The hierarchical visitor class, which serves as the base class for most visitors in Mesa, defines visit() methods for leaf nodes in the IR tree, and visit_leave()/visit_enter() methods for non-leaf nodes. This way, when traversing intermediary nodes in the IR we can decide to take action as soon as we enter them or when we are about to leave them.

In the case of our lower_instructions_visitor class, the visit_leave() method implementation is a large switch() statement with all the operators that it can lower.

The code in this file lowers common scenarios that are expected to be useful for most GPU drivers, but individual drivers can still select which of these lowering passes they want to use. For this purpose, hardware drivers create instances of the lower_instructions class passing the list of lowering passes to enable. For example, the Intel i965 driver does:

const int bitfield_insert = brw->gen >= 7
                            ? BITFIELD_INSERT_TO_BFM_BFI
                            : 0;
lower_instructions(shader->base.ir,
                   MOD_TO_FLOOR |
                   DIV_TO_MUL_RCP |
                   SUB_TO_ADD_NEG |
                   EXP_TO_EXP2 |
                   LOG_TO_LOG2 |
                   bitfield_insert |
                   LDEXP_TO_ARITH);

Notice how in the case of Intel GPUs, one of the lowering passes is conditionally selected depending on the hardware involved. In this case, brw->gen >= 7 selects GPU generations since IvyBridge.

Let’s have a look at the implementation of some of these lowering passes. For example, SUB_TO_ADD_NEG is a very simple one that transforms subtractions into negative additions:

void
lower_instructions_visitor::sub_to_add_neg(ir_expression *ir)
{
   ir->operation = ir_binop_add;
   ir->operands[1] =
      new(ir) ir_expression(ir_unop_neg, ir->operands[1]->type,
                            ir->operands[1], NULL);
   this->progress = true;
}

As we can see, the lowering pass simply changes the operator used by the ir_expression node, and negates the second operand using the unary negate operator (ir_unop_neg), thus, converting the original a = b – c into a = b + (-c).

Of course, if a driver does not have native support for the subtraction operation, it could still do this when it processes the IR to produce native code, but this way Mesa is saving driver developers that work. Also, some lowering passes may enable optimization passes after the lowering that drivers might miss otherwise.

Let’s see a more complex example: MOD_TO_FLOOR. In this case the lowering pass provides an implementation of ir_binop_mod (modulo) for GPUs that don’t have a native modulo operation.

The modulo operation takes two operands (op0, op1) and implements the C equivalent of the ‘op0 % op1‘, that is, it computes the remainder of the division of op0 by op1. To achieve this the lowering pass breaks the modulo operation as mod(op0, op1) = op0 – op1 * floor(op0 / op1), which requires only multiplication, division and subtraction. This is the implementation:

ir_variable *x = new(ir) ir_variable(ir->operands[0]->type, "mod_x",
                                     ir_var_temporary);
ir_variable *y = new(ir) ir_variable(ir->operands[1]->type, "mod_y",
                                     ir_var_temporary);
this->base_ir->insert_before(x);
this->base_ir->insert_before(y);

ir_assignment *const assign_x =
   new(ir) ir_assignment(new(ir) ir_dereference_variable(x),
                         ir->operands[0], NULL);
ir_assignment *const assign_y =
   new(ir) ir_assignment(new(ir) ir_dereference_variable(y),
                         ir->operands[1], NULL);

this->base_ir->insert_before(assign_x);
this->base_ir->insert_before(assign_y);

ir_expression *const div_expr =
   new(ir) ir_expression(ir_binop_div, x->type,
                         new(ir) ir_dereference_variable(x),
                         new(ir) ir_dereference_variable(y));

/* Don't generate new IR that would need to be lowered in an additional
 * pass.
 */
if (lowering(DIV_TO_MUL_RCP) && (ir->type->is_float() ||
    ir->type->is_double()))
   div_to_mul_rcp(div_expr);

ir_expression *const floor_expr =
   new(ir) ir_expression(ir_unop_floor, x->type, div_expr);

if (lowering(DOPS_TO_DFRAC) && ir->type->is_double())
   dfloor_to_dfrac(floor_expr);

ir_expression *const mul_expr =
   new(ir) ir_expression(ir_binop_mul,
                         new(ir) ir_dereference_variable(y),
                         floor_expr);

ir->operation = ir_binop_sub;
ir->operands[0] = new(ir) ir_dereference_variable(x);
ir->operands[1] = mul_expr;
this->progress = true;

Notice how the first thing this does is to assign the operands to a variable. The reason for this is a bit tricky: since we are going to implement ir_binop_mod as op0 – op1 * floor(op0 / op1), we will need to refer to the IR nodes op0 and op1 twice in the tree. However, we can’t just do that directly, for that would mean that we have the same node (that is, the same pointer) linked from two different places in the IR expression tree. That is, we want to have this tree:

                             sub
                           /     \
                        op0       mult
                                 /    \
                              op1     floor
                                        |
                                       div
                                      /   \
                                   op0     op1

Instead of this other tree:


                             sub
                           /     \
                           |      mult
                           |     /   \
                           |   floor  |
                           |     |    |
                           |    div   |
                           |   /   \  |
                            op0     op1   

This second version of the tree is problematic. For example, let’s say that a hypothetical optimization pass detects that op1 is a constant integer with value 1, and realizes that in this case div(op0/op1) == op0. When doing that optimization, our div subtree is removed, and with that, op1 could be removed too (and possibily freed), leaving the other reference to that operand in the IR pointing to an invalid memory location… we have just corrupted our IR:

                             sub
                           /     \
                           |      mult
                           |     /    \
                           |   floor   op1 [invalid pointer reference]
                           |     |
                           |    /
                           |   /
                            op0 

Instead, what we want to do here is to clone the nodes each time we need a new reference to them in the IR. All IR nodes have a clone() method for this purpose. However, in this particular case, cloning the nodes creates a new problem: op0 and op1 are ir_expression nodes so, for example, op0 could be the expression a + b * c, so cloning the expression would produce suboptimal code where the expression gets replicated. This, at best, will lead to slower
compilation times due to optimization passes needing to detect and fix that, and at worse, that would go undetected by the optimizer and lead to worse performance where we compute the value of the expression multiple times:

                              sub
                           /        \
                         add         mult
                        /   \       /    \
                      a     mult  op1     floor
                            /   \          |
                           b     c        div
                                         /   \
                                      add     op1
                                     /   \
                                    a    mult
                                        /    \
                                        b     c

The solution to this problem is to assign the expression to a variable, then dereference that variable (i.e., read its value) wherever we need. Thus, the implementation defines two variables (x, y), assigns op0 and op1 to them and creates new dereference nodes wherever we need to access the value of the op0 and op1 expressions:

                       =               =
                     /   \           /   \
                    x     op0       y     op1


                             sub
                           /     \
                         *x       mult
                                 /    \
                               *y     floor
                                        |
                                       div
                                      /   \
                                    *x     *y

In the diagram above, each variable dereference is marked with an ‘*’, and each one is a new IR node (so both appearances of ‘*x’ refer to different IR nodes, both representing two different reads of the same variable). With this solution we only evaluate the op0 and op1 expressions once (when they get assigned to the corresponding variables) and we never refer to the same IR node twice from different places (since each variable dereference is a new IR node).

Now that we know why we assign these two variables, let’s continue looking at the code of the lowering pass:

In the next step we implement op0 / op1 using a ir_binop_div expression. To speed up compilation, if the driver has the DIV_TO_MUL_RCP lowering pass enabled, which transforms a / b into a * 1 / b (where 1 / b could be a native instruction), we immediately execute the lowering pass for that expression. If we didn’t do this here, the resulting IR would contain a division operation that might have to be lowered in a later pass, making the compilation process slower.

The next step uses a ir_unop_floor expression to compute floor(op0/op1), and again, tests if this operation should be lowered too, which might be the case if the type of the operands is a 64bit double instead of a regular 32bit float, since GPUs may only have a native floor instruction for 32bit floats.

Next, we multiply the result by op1 to get op1 * floor(op0 / op1).

Now we only need to subtract this from op0, which would be the root IR node for this expression. Since we want the new IR subtree spawning from this root node to replace the old implementation, we directly edit the IR node we are lowering to replace the ir_binop_mod operator with ir_binop_sub, make a dereference to op1 in the first operand and link the expression holding op1 * floor(op0 / op1) in the second operand, effectively attaching our new implementation in place of the old version. This is how the original and lowered IRs look like:

Original IR:

[prev inst] -> mod -> [next inst]
              /   \            
           op0     op1         

Lowered IR:

[prev inst] -> var x -> var y ->   =   ->   =   ->   sub   -> [next inst]
                                  / \      / \      /   \
                                 x  op0   y  op1  *x     mult
                                                        /    \
                                                      *y      floor
                                                                |
                                                               div
                                                              /   \
                                                            *x     *y

Finally, we return true to let the compiler know that we have optimized the IR and that as a consequence we have introduced new nodes that may be subject to further lowering passes, so it can run a new pass. For example, the subtraction we just added may be lowered again to a negative addition as we have seen before.

Coming up next

Now that we learnt about lowering passes we can also discuss optimization passes, which are very similar since they are also based on the visitor implementation in Mesa and also transform the Mesa IR in a similar way.

One comment

  1. hello,
    thanks for your useful articles. I was confused about Linux graphic stack and now ,thanks to your articles, I understand Linux graphic stack.
    now here is my problem: I am a college student and I have a GPU that implemented on a FPGA. I have to write a GPU driver for that. I’ll be very grateful if you introduce some resource to start .(I am beginner in driver)
    thanks in advance