Archive for the ‘OpenGL’ Category

Optimizing shader assembly instruction on Mesa using shader-db (II)

Friday, September 18th, 2015

On my previous post I mentioned that I have been working on optimizing the shader instruction count for specific shaders guided by shader-db, and showed one specific example. In this post I will show another one, slightly more complex on the triaging and solution.

Some of the shaders with a worse instruction count can be foundn at shader-db/shaders/dolphin. Again I analyzed it in order to get the more simpler shader possible with the same issue:

   #version 130

   in vec2 myData;

   void main()
      gl_Position = vec4(myData, 3.0, 4.0);

Some comments:

  • It also happens with uniforms (so you can replace “in vec2” for “uniform vec2”)
  • It doesn’t happens if you use directly the input. You need to do this kind of “input and const” combination.

So as in my previous post I executed the compilation, using the option optimizer. In the case of IR I got the following files:

  • VS-0001-00-start
  • VS-0001-01-01-opt_reduce_swizzle
  • VS-0001-01-04-opt_copy_propagation
  • VS-0001-01-07-opt_register_coalesce
  • VS-0001-02-02-dead_code_eliminate
  • VS-0001-02-07-opt_register_coalesce

Being this the desired outcome (so the content of VS-0001-02-02-dead_code_eliminate):

0: mov m3.z:F, 3.000000F
1: mov m3.w:F, 4.000000F
2: mov m3.xy:F, attr17.xyyy:F
3: mov m2:D, 0D
4: vs_urb_write (null):UD

Unsurprisingly it is mostly movs. In opposite to the shader I mentioned on my previous post, where on both cases the same optimizations were applied, in this case the NIR path doesn’t apply the last optimization (the second register coalesce). So this time I will focus on the starting point and the state just after the dead code eliminate pass.

So on IR, the starting point (VS-0001-00-start) is:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov, vgrf2.xxxy:F
3: mov vgrf1.0.xy:F, attr17.xyxx:F
4: mov vgrf0.0:F, vgrf1.xyzw:F
5: mov m2:D, 0D
6: mov m3:F, vgrf0.xyzw:F
7: vs_urb_write (null):UD

and the state after the dead code eliminate is the following one:

0: mov vgrf1.0.z:F, 3.000000F
1: mov vgrf1.0.w:F, 4.000000F
2: mov vgrf1.0.xy:F, attr17.xyyy:F
3: mov m2:D, 0D
4: mov m3:F, vgrf1.xyzw:F
5: vs_urb_write (null):UD

On NIR, the starting point is:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov vgrf0.0.xy:F, attr17.xyyy:F
3: mov vgrf1.0.xy:D, vgrf0.xyzw:D
4: mov, vgrf2.xxxy:D
5: mov m2:D, 0D
6: mov m3:F, vgrf1.xyzw:F
7: vs_urb_write (null):UD

and the state after the dead code eliminate is the following one:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov m3.xy:D, attr17.xyyy:D
3: mov, vgrf2.xxxy:D
4: mov m2:D, 0D
5: vs_urb_write (null):UD

The first difference we can see is that although the instructions are basically the same at the starting point, the order is not the same. In fact if we check the different intermediate steps (I will not show them here to avoid a post too long), although the optimizations are the same, how and which get optimized are somewhat different. One could conclude that the problem is this order, but if we take a look to the final step on the NIR assembly shader, there isn’t anything clearly indicating that that shader can’t be simplified. Specifically instruction #3 could go away if instruction #0 and #1 writes directly to m3 instead of vgrf2, that is what the IR path does. So it seems that the problem is on the register coalesce optimization.

As I mentioned, there is a slight order difference between NIR and IR. That leads that on the NIR case, between instruction #3 and #0/#1 there is another instruction, that is in a different place on IR. So my first thought was that the optimization was only checking against the immediate previous instruction. Once I started to look to the code it showed that I was wrong. For each instruction, there was a loop checking for all the previous instructions. What I noticed is that on that loop, all the checks that rejected one previous instruction was a break. So I initially thought that perhaps one of those breaks was in fact a continue. This seemed to be confirmed when I did the quick hack of replace everything for continues. It proved wrong as soon as I saw all the piglit regressions I had in hand. So after that I did the proper, and do a proper debug. So using gdb, the condition it was stopping the optimization to check previous instructions was the following one:

/* If somebody else writes our destination here, we can't coalesce
* before that.
if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written))

Probably the code is hard to understand out of context, but the comment is clear. When we coalesce two instructions, that is possible when the previous one writes to a register we are reading on the current instruction. But obviously, that can’t be done if there is a instruction in the middle that writes on the same register. And it is true that is what is happening here. If you look at the final state of the NIR path, we want to coalesce instruction #3 with instruction #1 and #0, but instruction #2 is writing on m3 too.

So, that’s over? Not exactly. Remember that IR was able to simplify this, and it can’t be only because the order was different. If you take a deeper look to those instructions, there are some x, y, z, w after the register names. Those report in which channels those instructions are writing. As I mentioned on my previous post, this work is about providing a NIR to vec4 pass. So those registers are vectors. Instruction #3 can be read as “move the content of components x and y from register vgrf2 to components z and w of register m3”.  And instruction #2 can be read as “move the content of components x and y from register attr17 to components x and y of register m3”. So although we are writing to the same destination, we are writing to different components, meaning that it would be safe to do the coalescing. We just need to be sure that there isn’t any component overlap between current instruction and the previous one we are checking against. Fortunately the registers already save in which registers they are writing on in a variable called “writemask”. So we only need to change that code for the following one:

/* If somebody else writes the same channels of our destination here,
* we can't coalesce before that.
if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written) &&
(inst->dst.writemask & scan_inst->dst.writemask) != 0) {

The patch with this change was sent to the mesa list (here), approved and pushed to master.

Final words

So again, a problem what was easier to write the solution that to write the patch. But in any case, it showed a significant improvement. Using shader-db tool comparing before and after the patch:

total instructions in shared programs: 1781593 -> 1734957 (-2.62%)
instructions in affected programs:     1238390 -> 1191754 (-3.77%)
helped:                                12782
HURT:                                  0

Optimizing shader assembly instruction on Mesa using shader-db

Monday, September 14th, 2015

Lately I have been working on Mesa. Specifically I have been working with my fellow igalians Eduardo Lima and Antía Puentes to provide a NIR to vec4 pass to the i965 backend. I will not go too much in the details, but in summary, NIR is a new intermediate representation for Mesa. Intermediate as being in the middle of the OpenGL GLSL language used for shaders, and the final GPU machine instructions for each specific Mesa backends. NIR is intended to replace the previous GLSL IR, and in some places it is already done. If you are interested on the details, take a look to the NIR announcement and NIR documentation page.

Although the bug is still open, Mesa master already has the functionality for this pass, and in fact, is the default now. This new NIR pass provides the same functionality that the one available with the old GLSL IR pass (from now, just IR). This was properly tested with piglit. But although the total instruction count in general have improved, we are getting worse instruction count compiling some specific known shaders if we use NIR. So the next step would be improve this. This is an ongoing effort, like these patches from Jason Ekstrand, but I would like to share some of my experience so far.

In order to guide this work, we have been using shader-db. shader-db is a shader database, with a executable to compile those shaders, and a tool to compare two executions of that compilation. Usually it is used to verify that the optimization that you are implementing is really improving the instruction count, or even to justify your change. Several Mesa commits include the before and after shader-db statistics. But in this case, we have been using it as a guide of what we could improve. We compile all the shaders using IR and using NIR (using the environment variable INTEL_USE_NIR), and check in which shaders there are a instruction count regression.

Case 1: subtraction needs an extra mov.

Ok, so one of the shaders with a worse instruction count is humus-celshading/4.shader_test. After some analysis of what was the problem, I got a simpler shader with the same problem:

in vec4 inData;

void main(){
gl_Position = gl_Vertex - inData;

This simple shader needs one extra instructions using NIR. So yes, a simple subtraction is getting worse. FWIW, this is the desired final shader assembly:

0: add m3:F, attr0.xyzw:F, -attr18.xyzw:F
1: mov m2:D, 0D
2: vs_urb_write (null):UD

Note that there isn’t an assembly subtraction instruction, but it is represented as negating the second parameter and use an add (this seems captain obvious information here, but will be relevant later).

So at this point one option would be start to look at the backend (remember, i965) code for vec4, specifically the optimizations, and check if we see something. Those optimization are called at brw_vec4.cpp. Those optimizations are in general common to any compiler, like dead code elimination, copy propagation, register coalescing, etc. And usually they are executed several times in several passes, and some of those are simplifications to be used by other optimizations (for example, if your copy propagation optimization pass works, then it is common that your dead code elimination pass will get an instruction out). So with all those optimizations and passes, how do you find the problem? Although it is a good idea read the code for those optimizations to know how they work, it is usually not enough to know where the problem is. So this is again a debug problem, and as usually, you want to know what it is happening step by step.

For this I executed again the compilation, with the following environment variable:

INTEL_USE_NIR=1 INTEL_DEBUG=optimizer ./run subtraction.shader_test

This option prints out to a file the shader assembly compiled at each optimization pass (if applied). So for example, I get the following files for both cases:

  • VS-0001-00-start
  • VS-0001-01-04-opt_copy_propagation
  • VS-0001-01-07-opt_register_coalesce
  • VS-0001-02-02-dead_code_eliminate

So in order to get the final shader assemlby, it was executed a copy propagation, a register coalesce, and a dead code eliminate. BTW, I found that environment variable while looking at the code. It is not listed on the mesa envvar page, something I assume is a bug.

So I started to look at the differences between the different steps. Taking into account that on both cases, the same optimizations were executed, and in the same order, I started looking for differences between one and the other at any step. And I found one difference on the copy propagation.

So let’s see the starting point using IR:

0: mov vgrf2.0:F, -attr18.xyzw:F
1: add vgrf0.0:F, attr0.xyzw:F, vgrf2.xyzw:F
2: mov m2:D, 0D
3: mov m3:F, vgrf0.xyzw:F
4: vs_urb_write (null):UD

And the outcome of the copy propagation:

0: mov vgrf2.0:F, -attr18.xyzw:F
1: add vgrf0.0:F, attr0.xyzw:F, -attr18.xyzw:F
2: mov m2:D, 0D
3: mov m3:F, vgrf0.xyzw:F
4: vs_urb_write (null):UD

And the starting point using NIR:

0: mov vgrf0.0:UD, attr0.xyzw:UD
1: mov vgrf1.0:UD, attr18.xyzw:UD
2: add vgrf2.0:F, vgrf0.xyzw:F, -vgrf1.xyzw:F
3: mov m2:D, 0D
4: mov m3:F, vgrf2.xyzw:F
5: vs_urb_write (null):UD

And the outcome of the copy propagation:

0: mov vgrf0.0:UD, attr0.xyzw:UD
1: mov vgrf1.0:UD, attr18.xyzw:UD
2: add vgrf2.0:F, attr0.xyzw:F, -vgrf1.xyzw:F
3: mov m2:D, 0D
4: mov m3:F, vgrf2.xyzw:F
5: vs_urb_write (null):UD

Although it is true that the starting point for NIR already have one extra instruction compared with IR, that extra one gets optimized on following steps. What caught my attention was the difference between what happens with the instruction #1 on the IR case, compared with the equivalent instruction #2 on the NIR case (the add). On the IR case, copy propagation is able to propagate attr18 from the previous instruction. So is easy to see that this could be simplified on following optimization steps. But that doesn’t happen on the NIR case. On NIR, instruction #2 after the copy propagation remains the same.

So I started to take a look to the implementation of the copy propagation optimization code (here). Without entering into details, this pass analyses each instruction, comparing them with the previous ones in order to know if it can do a copy propagation. So I looked why with that specific instruction the pass concludes that it can’t be done. At this point you could use gdb, but I used some extra printfs (sometimes they are useful too). So I find the check that rejected that instruction:

bool has_source_modifiers = value.negate || value.abs;


if (has_source_modifiers && value.type != inst->src[arg].type)
    return false;

That means that if the source of the previous instruction you are checking against is negated (or has an abs), and the types are different, you can’t do the propagation. This makes sense, because negation is different on different types. If we go back to check the shader assembly output, we find that it is true that the types (those F, D and UD just after the registers) are different between the IR and the NIR case. Why we didn’t worry before? Why this was not failing on any piglit test? Well, because if you take a look more carefully, the instructions that had different types are the movs. In both cases, the type is correct on the add instruction. And in a mov, the type is somewhat irrelevant. You are just moving raw data from one place to the other. It is important on the ALU operation. But in any case, it is true that the type is wrong on those registers (compared with the original GLSL code), and as we are seeing, is causing some problems on the optimization passes. So next step: check where those types are filled.

Searching a little on the code, and using gdb this time, this is done on the function nir_setup_uniforms at brw_vec4_nir.cpp,while creating a source register variable. But curiously it is using the type that came from NIR:

src_reg src = src_reg(ATTR, var->data.location + i, var->type);

and printing out the content of var->type with gdb, it properly shows the type used at the GLSL code. If we go deeper to the src_reg constructor:

src_reg::src_reg(register_file file, int reg, const glsl_type *type)

    this->file = file;
    this->reg = reg;
    if (type && (type->is_scalar() || type->is_vector() || type->is_matrix()))
        this->swizzle = brw_swizzle_for_size(type->vector_elements);
        this->swizzle = BRW_SWIZZLE_XYZW;

We see that the type is only used to fill the swizzle. But if we compare this to the equivalent code for a destination register:

dst_reg::dst_reg(register_file file, int reg, const glsl_type *type,
unsigned writemask)

    this->file = file;
    this->reg = reg;
    this->type = brw_type_for_base_type(type);
    this->writemask = writemask;

dst_reg is is also filling internally the type, something src_reg is not doing. So at this point the patch is straightforward. It is just fill src_reg->type using the constructor type parameter. The patch was approved and is already on master.

Coming up next

At the end, I didn’t need to improve at all any of the optimization passes, as the bug was elsewhere, but the debugging steps for this work are still the same. In fact it was the usual bug that was harder to find (for simplicity I summarized the triaging) that to solve. For the next blog post, I will explain how I worked on another instruction count regression, somewhat more complex, that needed a change on one of the optimization passes.


Update: now the Clutter accessibility library is called Cally

Wednesday, May 20th, 2009

Only a little update. The name I have chosen to the library was a registered trademark, so the library was renamed to cally.

The library was updated properly. You can get the source code here:

git clone

Random thoughts about a11y on clutter

Tuesday, January 6th, 2009

Clutter and a11y support are important issues related with Gnome, so it is normal that sometimes the question “a11y and Clutter” arise. Some time ago this appears on the desktop devel list, and Emmanuel Bassi closes says that there are a plan for it in the future [1]

And now, this topic has return, this time on the clutter mailing list [2] [3]

Some time ago I was playing with this two topics, and I made some coding tests too. As this codings tests are still too basic, here [4] is a link to a pdf with my first thoughts after this implementation, so anyone could read about it if he wants.

Reviving GLSL

Friday, August 10th, 2007

The other day, looking at planet Gnome, I found one more interesting that the others. This was writen by MDK, and its titled Vector drawing: OpenGL shaders and cairo.

This post continues, in some way, a post written by Tim Janick on his blog, about OpenGL for Gdk/Gtk+, that summarizes some of problems about using OpenGL with Gdk/Gtk+.

MDK made some tests to compare one an another, I had a little curiosity and check the code. The shader version was written in Cg, and as there are a lot of time that I doesn’t write nothing related to shaders, I decided to waste the little free time I have making the GLSL version of this test this afternoon.

I same the sames tests, except by environment:

  • AMD 1800+ 784MB RAM
  • Geforce 5900FX 128 MB RAM
  • modified source code (you’ll dont require Cg Toolkit, instead you’ll require Glew)

And the visual results:

Cairo bezier

Cg bezier

GLSL bezier

But the most curious was about the times (sorry, not chart this time);

  • cairo: 0,18494 s
  • cg: 0.01927 s
  • GLSL: 0.00155 s

GLSL better than cg? I suppose that this is the easy answer but:

  • First at all: its too much difference, it has no sense
  • AFAIK GLSL driver compiler its almost the Cg compiler (they use a Unified Arquitecture), you only need to see the extension EXT_Cg_Shader (use Cg code to define a shader object)
  • OpenGL works asynchronously
  • The profile: Cg implementation uses a concrete profile (CG_PROFILE_ARBVP1), GLSL hasn’t got this concept, it tries allways to search the bext vertex/fragment program extension to use.
  • So: the time of only draw some curves its not a accurate (we knowed it before) measure, it is only a guide

Conclusion: it is good to revive some time GLSL, I need to do this more often, thanks to MDK to give us some code to play !

NVPerfKit 2.1 supports linux

Friday, October 20th, 2006

When you are debugging a graphical aplication, using OpenGL, debugging is very complex. You can’t use the common debuggers, as gdb, because the graphical part are executed at the “server side”, at the card, therefore, managed by the OpenGL drivers, asynchronously with respect your program.

But, as in other programs, debugging and profiling a very important part of the development process, more important with the pass of the time, as the graphical cards are more and more complex, and make more and more work, not only graphical. For example, Havok creates and physical effects engine implemented over the graphics card, called Havok FX.

Recently, NVIDIA released NVPerfKit 2.1, a package of performance tools to help debug and profile Direct3D and OpenGL aplications, that for the first time supports Linux!! Its a good new. It isn’t free, and includes gDEbugger as a 30-day-trial version, but, well this is better than nothing 😐

When I was developing my Master Thesis (Proyecto de Fin de Carrera), I missed a lot some kind of tool to debug and test perform of my program, but, or this tools wasn’t released or my card doesn’t support it … so I need to use the “universal flag-method” 😉 to debug the aplication, and use and ad-hoc system to profile it. But well, at this moment there are some NVPerfkit tools that don’t support my card (well really, my card don’t support this tools, for example, don’t support GPU counters :( ), so I would need to do that anyway.

Someday I will write a post about this project. I hope to someday I will get that this compile at linux, and migrate my “Hair Editor” to GTK, but “when I have some time” ….

A 3D view for baobab widget

Thursday, August 3rd, 2006

Meanwhile some igalians (*) was making a very useful view of the baobab, I was spending my time creating a 3D of this widget. To do that I used the great library gtkglext , which allows to add OpenGL drawing to any widget gtk.

While the base 2D view are very intuitive, and allow you a visual track to your directorioes, with tooltips, etc, my 3D view is a very beatiful :) view of a widget, but it is almost useless, as is hard to manage and select the directories. But, as I’d said, this look greats, and its funny create this, although I became a leech of the igalians laptops. For example, the Mario’s one, forcing him to work in this widget, when I kidnapped his laptop … 😉

At the end of the post you can see a screenshoot of an alpha release of this widget. The next step is create the blocks more consistently, as at this moment has some visual “leaks”, and integrate with the original base.

This is a headache problem, as to avoid to add a new dependence with the baobab, I was working in a “parallel experimental” branch. “One day” its possible a total integration, but for this moment, we have two work groups.
3D view of Baobab!!

[*] :: Alex, Miguel, Mario, Henrique