Optimizing shader assembly instruction on Mesa using shader-db (II)

On my previous post I mentioned that I have been working on optimizing the shader instruction count for specific shaders guided by shader-db, and showed one specific example. In this post I will show another one, slightly more complex on the triaging and solution.

Some of the shaders with a worse instruction count can be foundn at shader-db/shaders/dolphin. Again I analyzed it in order to get the more simpler shader possible with the same issue:

   #version 130

   in vec2 myData;

   void main()
   {
      gl_Position = vec4(myData, 3.0, 4.0);
   }

Some comments:

  • It also happens with uniforms (so you can replace “in vec2” for “uniform vec2”)
  • It doesn’t happens if you use directly the input. You need to do this kind of “input and const” combination.

So as in my previous post I executed the compilation, using the option optimizer. In the case of IR I got the following files:

  • VS-0001-00-start
  • VS-0001-01-01-opt_reduce_swizzle
  • VS-0001-01-04-opt_copy_propagation
  • VS-0001-01-07-opt_register_coalesce
  • VS-0001-02-02-dead_code_eliminate
  • VS-0001-02-07-opt_register_coalesce

Being this the desired outcome (so the content of VS-0001-02-02-dead_code_eliminate):

0: mov m3.z:F, 3.000000F
1: mov m3.w:F, 4.000000F
2: mov m3.xy:F, attr17.xyyy:F
3: mov m2:D, 0D
4: vs_urb_write (null):UD

Unsurprisingly it is mostly movs. In opposite to the shader I mentioned on my previous post, where on both cases the same optimizations were applied, in this case the NIR path doesn’t apply the last optimization (the second register coalesce). So this time I will focus on the starting point and the state just after the dead code eliminate pass.

So on IR, the starting point (VS-0001-00-start) is:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov vgrf1.0.zw:F, vgrf2.xxxy:F
3: mov vgrf1.0.xy:F, attr17.xyxx:F
4: mov vgrf0.0:F, vgrf1.xyzw:F
5: mov m2:D, 0D
6: mov m3:F, vgrf0.xyzw:F
7: vs_urb_write (null):UD

and the state after the dead code eliminate is the following one:

0: mov vgrf1.0.z:F, 3.000000F
1: mov vgrf1.0.w:F, 4.000000F
2: mov vgrf1.0.xy:F, attr17.xyyy:F
3: mov m2:D, 0D
4: mov m3:F, vgrf1.xyzw:F
5: vs_urb_write (null):UD

On NIR, the starting point is:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov vgrf0.0.xy:F, attr17.xyyy:F
3: mov vgrf1.0.xy:D, vgrf0.xyzw:D
4: mov vgrf1.0.zw:D, vgrf2.xxxy:D
5: mov m2:D, 0D
6: mov m3:F, vgrf1.xyzw:F
7: vs_urb_write (null):UD

and the state after the dead code eliminate is the following one:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov m3.xy:D, attr17.xyyy:D
3: mov m3.zw:D, vgrf2.xxxy:D
4: mov m2:D, 0D
5: vs_urb_write (null):UD

The first difference we can see is that although the instructions are basically the same at the starting point, the order is not the same. In fact if we check the different intermediate steps (I will not show them here to avoid a post too long), although the optimizations are the same, how and which get optimized are somewhat different. One could conclude that the problem is this order, but if we take a look to the final step on the NIR assembly shader, there isn’t anything clearly indicating that that shader can’t be simplified. Specifically instruction #3 could go away if instruction #0 and #1 writes directly to m3 instead of vgrf2, that is what the IR path does. So it seems that the problem is on the register coalesce optimization.

As I mentioned, there is a slight order difference between NIR and IR. That leads that on the NIR case, between instruction #3 and #0/#1 there is another instruction, that is in a different place on IR. So my first thought was that the optimization was only checking against the immediate previous instruction. Once I started to look to the code it showed that I was wrong. For each instruction, there was a loop checking for all the previous instructions. What I noticed is that on that loop, all the checks that rejected one previous instruction was a break. So I initially thought that perhaps one of those breaks was in fact a continue. This seemed to be confirmed when I did the quick hack of replace everything for continues. It proved wrong as soon as I saw all the piglit regressions I had in hand. So after that I did the proper, and do a proper debug. So using gdb, the condition it was stopping the optimization to check previous instructions was the following one:

/* If somebody else writes our destination here, we can't coalesce
* before that.
*/
if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written))
break;

Probably the code is hard to understand out of context, but the comment is clear. When we coalesce two instructions, that is possible when the previous one writes to a register we are reading on the current instruction. But obviously, that can’t be done if there is a instruction in the middle that writes on the same register. And it is true that is what is happening here. If you look at the final state of the NIR path, we want to coalesce instruction #3 with instruction #1 and #0, but instruction #2 is writing on m3 too.

So, that’s over? Not exactly. Remember that IR was able to simplify this, and it can’t be only because the order was different. If you take a deeper look to those instructions, there are some x, y, z, w after the register names. Those report in which channels those instructions are writing. As I mentioned on my previous post, this work is about providing a NIR to vec4 pass. So those registers are vectors. Instruction #3 can be read as “move the content of components x and y from register vgrf2 to components z and w of register m3”.  And instruction #2 can be read as “move the content of components x and y from register attr17 to components x and y of register m3”. So although we are writing to the same destination, we are writing to different components, meaning that it would be safe to do the coalescing. We just need to be sure that there isn’t any component overlap between current instruction and the previous one we are checking against. Fortunately the registers already save in which registers they are writing on in a variable called “writemask”. So we only need to change that code for the following one:

/* If somebody else writes the same channels of our destination here,
* we can't coalesce before that.
*/
if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written) &&
(inst->dst.writemask & scan_inst->dst.writemask) != 0) {
break;
}

The patch with this change was sent to the mesa list (here), approved and pushed to master.

Final words

So again, a problem what was easier to write the solution that to write the patch. But in any case, it showed a significant improvement. Using shader-db tool comparing before and after the patch:

total instructions in shared programs: 1781593 -> 1734957 (-2.62%)
instructions in affected programs:     1238390 -> 1191754 (-3.77%)
helped:                                12782
HURT:                                  0

Optimizing shader assembly instruction on Mesa using shader-db

Lately I have been working on Mesa. Specifically I have been working with my fellow igalians Eduardo Lima and Antía Puentes to provide a NIR to vec4 pass to the i965 backend. I will not go too much in the details, but in summary, NIR is a new intermediate representation for Mesa. Intermediate as being in the middle of the OpenGL GLSL language used for shaders, and the final GPU machine instructions for each specific Mesa backends. NIR is intended to replace the previous GLSL IR, and in some places it is already done. If you are interested on the details, take a look to the NIR announcement and NIR documentation page.

Although the bug is still open, Mesa master already has the functionality for this pass, and in fact, is the default now. This new NIR pass provides the same functionality that the one available with the old GLSL IR pass (from now, just IR). This was properly tested with piglit. But although the total instruction count in general have improved, we are getting worse instruction count compiling some specific known shaders if we use NIR. So the next step would be improve this. This is an ongoing effort, like these patches from Jason Ekstrand, but I would like to share some of my experience so far.

In order to guide this work, we have been using shader-db. shader-db is a shader database, with a executable to compile those shaders, and a tool to compare two executions of that compilation. Usually it is used to verify that the optimization that you are implementing is really improving the instruction count, or even to justify your change. Several Mesa commits include the before and after shader-db statistics. But in this case, we have been using it as a guide of what we could improve. We compile all the shaders using IR and using NIR (using the environment variable INTEL_USE_NIR), and check in which shaders there are a instruction count regression.

Case 1: subtraction needs an extra mov.

Ok, so one of the shaders with a worse instruction count is humus-celshading/4.shader_test. After some analysis of what was the problem, I got a simpler shader with the same problem:

in vec4 inData;

void main(){
gl_Position = gl_Vertex - inData;
}

This simple shader needs one extra instructions using NIR. So yes, a simple subtraction is getting worse. FWIW, this is the desired final shader assembly:

0: add m3:F, attr0.xyzw:F, -attr18.xyzw:F
1: mov m2:D, 0D
2: vs_urb_write (null):UD

Note that there isn’t an assembly subtraction instruction, but it is represented as negating the second parameter and use an add (this seems captain obvious information here, but will be relevant later).

So at this point one option would be start to look at the backend (remember, i965) code for vec4, specifically the optimizations, and check if we see something. Those optimization are called at brw_vec4.cpp. Those optimizations are in general common to any compiler, like dead code elimination, copy propagation, register coalescing, etc. And usually they are executed several times in several passes, and some of those are simplifications to be used by other optimizations (for example, if your copy propagation optimization pass works, then it is common that your dead code elimination pass will get an instruction out). So with all those optimizations and passes, how do you find the problem? Although it is a good idea read the code for those optimizations to know how they work, it is usually not enough to know where the problem is. So this is again a debug problem, and as usually, you want to know what it is happening step by step.

For this I executed again the compilation, with the following environment variable:

INTEL_USE_NIR=1 INTEL_DEBUG=optimizer ./run subtraction.shader_test

This option prints out to a file the shader assembly compiled at each optimization pass (if applied). So for example, I get the following files for both cases:

  • VS-0001-00-start
  • VS-0001-01-04-opt_copy_propagation
  • VS-0001-01-07-opt_register_coalesce
  • VS-0001-02-02-dead_code_eliminate

So in order to get the final shader assemlby, it was executed a copy propagation, a register coalesce, and a dead code eliminate. BTW, I found that environment variable while looking at the code. It is not listed on the mesa envvar page, something I assume is a bug.

So I started to look at the differences between the different steps. Taking into account that on both cases, the same optimizations were executed, and in the same order, I started looking for differences between one and the other at any step. And I found one difference on the copy propagation.

So let’s see the starting point using IR:

0: mov vgrf2.0:F, -attr18.xyzw:F
1: add vgrf0.0:F, attr0.xyzw:F, vgrf2.xyzw:F
2: mov m2:D, 0D
3: mov m3:F, vgrf0.xyzw:F
4: vs_urb_write (null):UD

And the outcome of the copy propagation:

0: mov vgrf2.0:F, -attr18.xyzw:F
1: add vgrf0.0:F, attr0.xyzw:F, -attr18.xyzw:F
2: mov m2:D, 0D
3: mov m3:F, vgrf0.xyzw:F
4: vs_urb_write (null):UD

And the starting point using NIR:

0: mov vgrf0.0:UD, attr0.xyzw:UD
1: mov vgrf1.0:UD, attr18.xyzw:UD
2: add vgrf2.0:F, vgrf0.xyzw:F, -vgrf1.xyzw:F
3: mov m2:D, 0D
4: mov m3:F, vgrf2.xyzw:F
5: vs_urb_write (null):UD

And the outcome of the copy propagation:

0: mov vgrf0.0:UD, attr0.xyzw:UD
1: mov vgrf1.0:UD, attr18.xyzw:UD
2: add vgrf2.0:F, attr0.xyzw:F, -vgrf1.xyzw:F
3: mov m2:D, 0D
4: mov m3:F, vgrf2.xyzw:F
5: vs_urb_write (null):UD

Although it is true that the starting point for NIR already have one extra instruction compared with IR, that extra one gets optimized on following steps. What caught my attention was the difference between what happens with the instruction #1 on the IR case, compared with the equivalent instruction #2 on the NIR case (the add). On the IR case, copy propagation is able to propagate attr18 from the previous instruction. So is easy to see that this could be simplified on following optimization steps. But that doesn’t happen on the NIR case. On NIR, instruction #2 after the copy propagation remains the same.

So I started to take a look to the implementation of the copy propagation optimization code (here). Without entering into details, this pass analyses each instruction, comparing them with the previous ones in order to know if it can do a copy propagation. So I looked why with that specific instruction the pass concludes that it can’t be done. At this point you could use gdb, but I used some extra printfs (sometimes they are useful too). So I find the check that rejected that instruction:

bool has_source_modifiers = value.negate || value.abs;

<skip>

if (has_source_modifiers && value.type != inst->src[arg].type)
    return false;

That means that if the source of the previous instruction you are checking against is negated (or has an abs), and the types are different, you can’t do the propagation. This makes sense, because negation is different on different types. If we go back to check the shader assembly output, we find that it is true that the types (those F, D and UD just after the registers) are different between the IR and the NIR case. Why we didn’t worry before? Why this was not failing on any piglit test? Well, because if you take a look more carefully, the instructions that had different types are the movs. In both cases, the type is correct on the add instruction. And in a mov, the type is somewhat irrelevant. You are just moving raw data from one place to the other. It is important on the ALU operation. But in any case, it is true that the type is wrong on those registers (compared with the original GLSL code), and as we are seeing, is causing some problems on the optimization passes. So next step: check where those types are filled.

Searching a little on the code, and using gdb this time, this is done on the function nir_setup_uniforms at brw_vec4_nir.cpp,while creating a source register variable. But curiously it is using the type that came from NIR:

src_reg src = src_reg(ATTR, var->data.location + i, var->type);

and printing out the content of var->type with gdb, it properly shows the type used at the GLSL code. If we go deeper to the src_reg constructor:

src_reg::src_reg(register_file file, int reg, const glsl_type *type)
{
    init();

    this->file = file;
    this->reg = reg;
    if (type && (type->is_scalar() || type->is_vector() || type->is_matrix()))
        this->swizzle = brw_swizzle_for_size(type->vector_elements);
    else
        this->swizzle = BRW_SWIZZLE_XYZW;
}

We see that the type is only used to fill the swizzle. But if we compare this to the equivalent code for a destination register:

dst_reg::dst_reg(register_file file, int reg, const glsl_type *type,
unsigned writemask)
{
    init();

    this->file = file;
    this->reg = reg;
    this->type = brw_type_for_base_type(type);
    this->writemask = writemask;
}

dst_reg is is also filling internally the type, something src_reg is not doing. So at this point the patch is straightforward. It is just fill src_reg->type using the constructor type parameter. The patch was approved and is already on master.

Coming up next

At the end, I didn’t need to improve at all any of the optimization passes, as the bug was elsewhere, but the debugging steps for this work are still the same. In fact it was the usual bug that was harder to find (for simplicity I summarized the triaging) that to solve. For the next blog post, I will explain how I worked on another instruction count regression, somewhat more complex, that needed a change on one of the optimization passes.