ARB_gl_spirv and ARB_spirv_extension support for i965 landed Mesa master

And something more visible thanks to that: now the Intel Mesa driver exposes OpenGL 4.6 support, the most recent version of OpenGL.

As perhaps you could recall, the i965 Intel driver became 4.6 conformant last year. You have more details about that, and what being conformant means in this Iago blog post. On that blog post Iago mentioned that it was passing with an early version of the ARB_gl_spirv support, that we were improving and interating during this time so it could be included on Mesa master. At the same time, the CTS tests were only testing the specifics of the extensions, and we wanted a more detailed testing, so we also were adding more tests on the piglit test suite, written manually for ARB_gl_spirv or translated from existing GLSL tests.

Why did it take so long?

Perhaps some would wonder why it took so much time. There were several reasons, but the main one, was related to the need to add a lot of code related to linking on NIR. On a previous blog post ( Introducing Mesa intermediate representations on Intel drivers with a practical example) I mentioned that there were several intermediate languages on Mesa.

So, for the case of the Intel driver, for GLSL, we had a chain like this:

GLSL -> AST -> Mesa IR -> NIR -> Intel backend IR

Now ARB_gl_spirv introduces the possibility to use SPIR-V instead of GLSL. Thanks to the Vulkan support on Mesa, there is a SPIR-V to NIR pass, so the chain for that case would be something like this:

SPIR-V -> NIR -> Intel backend IR

So, at first sight, this seems like it should be more simple, as there is one intermediate step less. But there is a problem. On Vulkan there is no need of a really complex shader linking. Basically gathering some info from the NIR shader. But OpenGL requires more. Even though the extension doesn’t required error validation, the spec points that queries and other introspection utilities should work naturally, as long as they don’t require names (names are considered debug info on SPIR-V). So for example, on Vulkan you can’t ask the shader about how big an SSBO is in order to allocate the space needed for it. It is assumed that you know that. But in OpenGL you can, and as you could do that using just the SSBO binding, ARB_gl_spirv requires that you still support that.

And here resided the main issue. On Mesa most of the linking is done at the Mesa IR level, with some help when doing the AST to Mesa IR pass. So the new intermediate language chain lacked it.

The first approach was trying to convert just the needed NIR stuff back to Mesa IR and then call the Mesa IR linker, but that was working only on some limited cases. Additionally, for this extension the linker rules change significantly. As mentioned, on SPIR-V names are optional. So everything needs to work without names. In fact, the current support just ignores names, for simplicity, and to ensure that everything works without it. So we ended up writing a new linker, based on NIR, and based on ARB_gl_spirv needs.

The other main reason for the delay was the significant changes on the SPIR-V to NIR pass, and NIR in general, to support SSBO/UBO (derefs were added), and also the native support of transform feedback, as Vulkan added a new extension, that we wanted to use. Here I would like to thank Jason Ekstrand for his support and patience ensuring that all those changes were compatible with our ARB_gl_spirv work.

So how about removing Mesa IR linking?

So, now that it is done, perhaps some people would wonder if this work could be used to remove Mesa IR on the GLSL intermediate language chain. Specially taking into account that there were already some NIR based linking utilities. My opinion is that no 😉

There are several reasons. The main one is that as mentioned ARB_gl_spirv linking rules are different, specifically on the lack of names. GLSL rules are heavily based on names. During the review it was suggested that need could be abstracted somehow, and reuse code. But, that doesn’t solve the issue that current Mesa IR supports the linkage of several different versions of GLSL, and more important, the validation and error checking done there, that is needed for GLSL, but not for ARB_gl_spirv (so it is not done). And that is a really big amount of work needed. In my opinion, redoing that work on NIR would not bring a significant advantage. So the current NIR linker would be just the implementation of a linker focused on the ARB_gl_spirv new rules, and would just share utility NIR based linking methods, or specific subfeatures, as far as possible (like the mentioned transform feedback support).

Final words

If you are interested on more details about the work done to implement ARB_gl_spirv/ARB_spirv_extensions support, you can check the presentation I gave on FOSDEM 2018 (slides, video) and the update I gave the same year on XDC (slides, video).

And finally I would like to thank all the people involved. First, thanks to Nicolai Hähnle for starting the work, that we used as basis.

The Igalia team that worked on it at some point were: Eduardo Lima, Alejandro Piñeiro, Antía Puentes, Neil Roberts, some help from Iago Toral and Samuel Iglesias at the beginning and finally thanks to Arcady Goldmints-Orlov, that dealt with handling the review feedback for the two ~80 patches MR (Mesa and Piglit test suite MR) that I created, when I became needed elsewhere.

And thanks a lot to the all the the reviewers, specially Timothy Arceri, Jason Ekstrand and Caio Marcelo.

Finally, big thanks to Intel for sponsoring our work on Mesa, Piglit, and CTS, and also to Igalia, for having me working on this wonderful project.

Bringing VK_KHR_16bit_storage to Intel GPUs

Just yesterday, Vulkan 1.0.54 was released. Among other things, it includes the specification for a new extension, VK_KHR_16bit_storage. And just yesterday, we sent to mesa-dev the implementation of this extension for Intel gen8+ GPUs, that is the outcome of the effort from the igalians José María Casanova, Andrés Gómez, Eduardo Lima, and myself.

In short, this extension allows the use of 16-bit types (half floats, 16-bit ints, and 16-bit uints) in shader input and output interfaces, push constant blocks, and buffers (shader storage buffer objects). The only operation that you can do with those 16-bit variables in the shader are 16-bit to 32-bit, and 32-bit to 16-bit conversions. So no arithmetic (adds, muls, etc) operations for now. The value of this feature at this point is to reduce the memory bandwidth when feeding and getting the data from a shader. It will also be the basis for future extensions defining 16-bit arithmetic operations.

Taking into account that the series is still in the review process, I will not go too deep into the technical details of the implementation. In general, most of the changes were related with the assumption that all we had was 32 or 64 bit types, so we just needed to update some conditions to take into account 16-bit types supported by the HW. In any case, I think that I can list three issues that required some extra work from our side:

  • One of the subfeatures we needed to support is being able to define 16-bit input vertex attributes. A really good reading about how this is implemented and supported on Intel HW is Ben Widawsky’s post “GEN Graphics And The URB”. This post explains in detail how this is done for 32-bit vertex inputs. We used this post as another source of documentation when we implemented the support for 64-bit vertex attributes last year (I briefly mentioned it on my previous blog post). In the case of 64-bit, when feeding the shader with the data, you can configure how the 64-bit data is passed to the shader. There is a surface format that do an implicit conversion to 32-bit, and another that pass it without any conversion (PASSTHRU format). You use one or the other depending on the type of your variable at the shader. But, for the case of 16-bit, there is just one surface format. And as the section FormatConversion at the reference manual points, this surface format do an implicit 16-bit to 32-bit conversion. In order to workaround it, we needed to change the surface format on the fly, using a 32-bit format one, and then reorder the data when it arrived to the shader.

  • Most of the surface read/writes used on intel driver are untyped surface readwrite message. Unfourtunately, those are 32-bit width messages. So we needed to implement the support, and use, a different kind of message, byte scattered read/write messages. The reference already warns that it is likely that it would be better to use a different message (for performance reasons). In any case, using this message is only really needed when using variable of one and three components. Eduardo already have a patch that uses 32-bit untyped read/write messages when possible.

  • For a render target write message (so for example, the output of a fragment shader), we enabled the 16-bit payload using the data format bit (Data Format on the Message Descriptor Definition of Send Messages). But this bit is not available on Broadwell, and doesn’t support unsigned ints on Cherryview/Braswell. So for those cases as workaround we needed to use the 32-bit payload, doing an extra conversion from 16-bit to 32-bits before the HW deals with the surface format conversion when writing 32-bit values to a 16-bit format surface.

So the next steps now is getting it reviewed, update the patchs accordingly and land it on master. In parallel we are working on optimizations and other improvements we listed while we were working on the extension (as the already mentioned Eduardo’s patch).

Finally, I would like to thanks Intel for sponsoring this work and for their support. Also, thanks to Iago Toral and Samuel Iglesias for sharing with us their experience while developing the 64-bit support on both OpenGL and Vulkan that helped us to implement this extension.

One year of Mesa

Changes

During the last year and something, my work at Igalia was focused on the Intel i965 driver for Mesa, the open source OpenGL implementation. Iago and Samuel were working for some time, then joined Edu and Antía, and then I joined myself, as the fifth igalian.

One of the reasons I will always remember working on this project is that I was also being a father for a year and something, so it will be always easy to relate one and the other. Changes! A project change plus a total life change.

In fact the parenthood affected a little how I joined the project. We (as Igalia) decided that I would be a good candidate to join the project around April. But as my leave of absence was estimated to be around the last weeks of May, and it would be strange to start a project and suddenly disappear, we decided to postpone the joining just after the parental leave. So I spent some time closing and transferring the stuff I was doing at Igalia at that moment, and starting to get into the specifics of Intel driver, and then at the end of May, my parental leave started. Pedro is here!

 

Pedro’s day zero
Parenthood

Mesa can be a challenging project, but nothing compared to how to parent. A lot of changes. In Spain the more usual parental leave for the father is 15 days, where the only  flexibility is that you need to take 2 just after the child has born, and chose when take the other 13. But those 13 needs to be taken in a row. So I decided to took 15 days in a row. Igalia gives you 15 extra days. In fact, the equivalent to 15 days, as they gives you flexibility of how to take them. So instead of took them as 15 full-day, I decided to take 30 part-time days. Additionally Pedro’s grandmother (maternal) came to live with us for a month. That allowed us to do a step by step go back to work. So it was something like:

  • Pedro was born, I am in full parental leave, plus grandmother living at our house helping.
  • I start to work part-time
  • Grandmother goes back to Brazil
  • I start to work full-time.
  • Pedro’s mother goes back to work part-time

And probably the more complicated moment was when Pedro’s mother went back to work. More or less at that time we switched the usual grandparents (paternal) weekend visits to in-week visits. So what it is my usual timetable? Three days per week I work 8:00-14:30 at my home, and I take care of Pedro on the afternoon, when his mother goes to work. Two days per week is 8:00-12:30 at my home, and 15:00-20:00 at my parents home. And that is more or less, as no day is the same that the other 😉 And now I go more to the parks!

Pedro playing on a park
Pedro playing on a park

There were other changes. For example switched from being one of the usual suspects at Igalia office to start to work mostly from home. In any case the experience is totally worthing, and it is incredible how fast the time passed. Pedro is one year old already!

Pedro destroying his birthday cake
Mesa tasks

I almost forgot that this blog post started to talk about my work on Mesa. So many photos! During this year, I have participated (although not alone, specially on spec implementations) in the following tasks (and except last one, in chronological order):

About which task I liked the most, I think that it was the work done on optimizing the NIR to vec4 pass. And not forget that the work done by Igalia to implement internalformat_query2 and vertex_attrib_64bit, plus the work done by Iago and Samuel with fp64, helped to get Mesa 12.0 exposing OpenGL 4.2 support on Broadwell and later.

What happens with accessibility?

Being working full time for the same project for a full year, plus learning how to parent, means that I didn’t have too much spare time for accessibility development. Having said so, I have been reviewing ATK patches, doing the ATK releases regularly and keeping an eye on the GNOME accessibility mailing lists, so I’m still around. And we have Joanmarie Diggs, also fellow igalian, still rocking on improving Orca, WebKitGTK and WAI-ARIA.

Thanks

Finally, I’d like to thank both Intel and Igalia for supporting my work on Mesa and i965 all this time. Specially on allowing me so flexible timetable, where the important is what you deliver, and not when you do the work, allowing me to enjoy parenthood. I also want to thanks the igalian colleagues I has been working during this year, Iago, Samuel, Antía, Edu, Andrés, and Juan, and all those at Intel who have been helping and reviewing my work during all this year, like Jason Ekstrand, Kenneth Graunke, Matt Turner and Francisco Jerez.

 

Introducing Mesa intermediate representations on Intel drivers with a practical example

Introduction

The recent big news on the Igalia work on Mesa was that our effort getting the ARB_gpu_shader_fp64 and ARB_vertex_attrib_64bit extensions implemented for Intel Gen8+, allowed to expose OpenGL 4.2 for Gen8+. But I will let other igalians to talk in details about them (no pressures ;)).

In a previous blog post I mentioned that NIR was intended to replace GLSL IR. Although that was true on the context I was talking about, that comment could be somewhat misleading, so I will try to clarify it.

Intermediate representations on Intel Mesa drivers

So first, let’s list the intermediate representations that you would find when working on Mesa Intel drivers:

  • AST (Abstract Syntax Tree): calling it an Intermediate Language is somewhat an abuse of language. This is the tree representation of your GLSL shader just after parsing it with Flex/Bison.
  • Mesa IR (Intermediate Representation): also called HIR and GLSL IR. A real Intermediate Language. It is converted from AST. Here you have optimizations, link support, etc.
  • NIR (New Intermediate Representation): a new Intermediate Language added recently.

So the first questions would be, why three? Having AST and another intermediate representation is easier to explain. AST is a raw tree representation, not useful to generate code. But why IR and NIR?

Current Mesa IR was created some years ago. The design decisions behind it had their advantages. But it has also some disadvantages. For example, their tree-like structure make it complex to traverse, making difficult to navigate, implement optimizations, etc. Other disadvantage was that it was not on SSA form, making again some optimizations hard to write. You can see a summary of all this on Ian Romanick’s presentation “Three Years Experience with a Tree-like Shader IR“.

So at some point an effort was started to “flatenize” Mesa IR and adding SSA support. But the conclusion was that the effort to modify Mesa IR was so big, that it was worth to just start from scratch using the learned lessons, as explained on Connor Abbot’s email “A new IR for Mesa“, in which he proposed this new IR.

Some time later, NIR was ready for production, and as I mentioned on my blog post (that one I’m clarifying right now), some parts of Mesa Intel driver was reimplemented in order to use NIR instead of Mesa IR. So Mesa IR was being replaced there. Where exactly? The parts where the final assembly code was being generated. And now that that is finished (at least on the i965 driver), we can say that Mesa IR is not used to generate code at all.

So right now there are an AST->Mesa IR->NIR chain. What is the plan now? Generate an AST->NIR pass and completely remove Mesa IR? This same question was asked (among other things) on January 2016, on the mesa-dev email “Nir, SCons, and Gallium“.  And the answer was “no”, as you can see on two Ian Romanick’s replies (here and here). The summary is that Mesa IR has several GLSL specifics that aren’t appropriate for NIR’s level. In that sense, NIR is a step below Mesa IR, more near to the GPU needs. It is also worth to mention that since then, Vulkan support was added to Mesa. In order to support Spir-V (Vulkan’s shader language, that is an intermediate representation itself), a SPIRV->NIR pass was created. In that sense, for OpenGL there is an OpenGL-specific intermediate representation, that is Mesa IR, and for Vulkan there is a Vulkan-specific intermediate representation, that is Spirv, and both are translated to the same common representation, NIR.

Practical example:

So, what means that in the practice? Do I need to deal with all those intermediate representations? Well, as anything in life, that would depend. If for example, you want to provide the support for a new GLSL feature or a specific hw, you would need to touch all three. You would need to modify the flex/bison files, so AST would need to be updated, and then Mesa IR, NIR, and the passes that transform one to the other. But if you want to give support for an GLSL feature that is already supported, but on new hw, the most likely is that you will not need changes on any of them, but just on the code that generate the final assembly using NIR.

And what would happen to other features like warnings and errors? Right now most of them are detected at the AST level, and some at the IR level. NIR doesn’t trigger any error/warning yet. It contains several asserts, but basically because it assumes that at that moment the representation of the shader should be correct, so if you find something wrong, means that the developer working on NIR is doing something wrong, in opposite to the developer that wrote the GLSL shader.

So lets go for a practical example I was working on: uninitialized variable warnings (“undefined values” on the bug tracking the issue). In short, this is about warn the developer about things like this:

out color;

void main()

{
 vec4 myTemp;

 color = myTemp;
}

What would be the color on screen? Who knows. So although is a feature you can live without, it is a good nice-to-have.

On the original bug, it was mentioned that this kind of errors are easy to detect, as NIR has a type ssa_undefs, so we just need to check if they are used. And in fact, when I started to work on it, I quickly find how to raise a warning. For example, on the method nir_print.c:print_ssa_def, used to debug, it is easy to modify it in order to point that it is using a undef. But trying to raise the warning there have some problems:

  • As mentioned NIR doesn’t have any warning/error triggering mechanism implemented, you would need to add them.
  • You want to include this warning on the OpenGL InfoLog, but as mentioned NIR is used for both OpenGL and Vulkan, and right now it doesn’t maintains any info about the origin.
  • And perhaps more important, at that point you lack the context data that points which source code line you are working on.

In fact, the last bullet point also applies to Mesa IR. There are some warnings raised at the Mesa IR level, but are “line-less”. Not sure any other developer, but for me, this kind of warning without a reference to the source line number would be annoying to use. Does that mean that the only option would be the totally raw AST tree?

Fortunately, Mesa IR was already saving if a variable was being statically assigned or not (to check some other possible errors). This was being computed on the AST to Mesa IR pass, and in fact the documentation mentions that this value is only valid at that moment. We would be on the middle of AST and Mesa IR. So when to raise the warning? The straightforward solution would be when a variable is , just before/after the error “variableX undeclared” is raised. But that is not so easy. For example:

float myFloat1;
float myFloat2;

myFloat1 = myFloat2;

How many warnings should we raise? Just one, for myFloat2. But technically we are also using myFloat1, and it is uninitialized. So we need to differentiate both cases. Being AST so raw, at that point we don’t have that information, and in fact it is also impossible to go up to the parent expression in order to compute that information. So it was needed to add an attribute on the AST node, that I called is_lhs (as “is left hand side”). That variable would be set when parent expressions are being transformed.

If you are taking attention, probably you start to see what would be the collateral effect of this. Being AST so raw, and OpenGL specific, there would be several corner cases needed to be manually assigned. In fact the  first commit of the series is already covering several corner cases. And in spite of this, once the code reached master, there were two cases of false positives that needed extra checks (for builtin-variables and for inout/out function parameters)

After those two false positives, managing this warning was spread all along the code that made the AST to Mesa IR pass, so seemed easy to broke. So I decided to send too some unit tests to verify that it gets working. First I sent the make check test that tested that warning, and then the unit tests. 30 unit tests (it was initially 28, but reviewer asked two more). Not a bad number for a warning.

Final words

At this point, one would wonder if it still makes sense to have this warning on the AST to Mesa IR pass, and if it would have it better to do it as initially proposed, on NIR. But although it is true that “just detecting it” would be easier on NIR, without dealing with so many corner cases, I still think that adding the support for raising warning/errors compatible with OpenGL Infolog, and bringing somehow the original source code line number, would mean too many changes on both Mesa IR and NIR. More changes that dealing with those corner cases when using the variable on the AST to Mesa IR pass. And in any case, if in the future the situation changes, and makes sense to move the warning to NIR, we would have the unit tests that would help to ensure that we don’t introduce regressions.

In relation to the intermediate representations, just to note that I’m focusing on the Intel driver. Gallium drivers use other intermediate representation, called TGSI. As far as I know, on those drivers, they have a AST->Mesa IR->TGSI chain, and right now there is a work in progress AST->Mesa IR->NIR->TGSI chain that will be used on some specific cases. But all this is beyond my knowledge, so you would need to investigate if you are interested.

Appendix, extra documentation:

If you want more details about MESA IR, you can read:

  • Read Mesa IR README
  • Read past blog posts from fellow igalian Iago Toral (when NIR was not available yet): post 1 and post 2

If you want extra information about Mesa NIR, you can read:

 

 

Optimizing shader assembly instruction on Mesa using shader-db (II)

On my previous post I mentioned that I have been working on optimizing the shader instruction count for specific shaders guided by shader-db, and showed one specific example. In this post I will show another one, slightly more complex on the triaging and solution.

Some of the shaders with a worse instruction count can be foundn at shader-db/shaders/dolphin. Again I analyzed it in order to get the more simpler shader possible with the same issue:

   #version 130

   in vec2 myData;

   void main()
   {
      gl_Position = vec4(myData, 3.0, 4.0);
   }

Some comments:

  • It also happens with uniforms (so you can replace “in vec2” for “uniform vec2”)
  • It doesn’t happens if you use directly the input. You need to do this kind of “input and const” combination.

So as in my previous post I executed the compilation, using the option optimizer. In the case of IR I got the following files:

  • VS-0001-00-start
  • VS-0001-01-01-opt_reduce_swizzle
  • VS-0001-01-04-opt_copy_propagation
  • VS-0001-01-07-opt_register_coalesce
  • VS-0001-02-02-dead_code_eliminate
  • VS-0001-02-07-opt_register_coalesce

Being this the desired outcome (so the content of VS-0001-02-02-dead_code_eliminate):

0: mov m3.z:F, 3.000000F
1: mov m3.w:F, 4.000000F
2: mov m3.xy:F, attr17.xyyy:F
3: mov m2:D, 0D
4: vs_urb_write (null):UD

Unsurprisingly it is mostly movs. In opposite to the shader I mentioned on my previous post, where on both cases the same optimizations were applied, in this case the NIR path doesn’t apply the last optimization (the second register coalesce). So this time I will focus on the starting point and the state just after the dead code eliminate pass.

So on IR, the starting point (VS-0001-00-start) is:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov vgrf1.0.zw:F, vgrf2.xxxy:F
3: mov vgrf1.0.xy:F, attr17.xyxx:F
4: mov vgrf0.0:F, vgrf1.xyzw:F
5: mov m2:D, 0D
6: mov m3:F, vgrf0.xyzw:F
7: vs_urb_write (null):UD

and the state after the dead code eliminate is the following one:

0: mov vgrf1.0.z:F, 3.000000F
1: mov vgrf1.0.w:F, 4.000000F
2: mov vgrf1.0.xy:F, attr17.xyyy:F
3: mov m2:D, 0D
4: mov m3:F, vgrf1.xyzw:F
5: vs_urb_write (null):UD

On NIR, the starting point is:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov vgrf0.0.xy:F, attr17.xyyy:F
3: mov vgrf1.0.xy:D, vgrf0.xyzw:D
4: mov vgrf1.0.zw:D, vgrf2.xxxy:D
5: mov m2:D, 0D
6: mov m3:F, vgrf1.xyzw:F
7: vs_urb_write (null):UD

and the state after the dead code eliminate is the following one:

0: mov vgrf2.0.x:F, 3.000000F
1: mov vgrf2.0.y:F, 4.000000F
2: mov m3.xy:D, attr17.xyyy:D
3: mov m3.zw:D, vgrf2.xxxy:D
4: mov m2:D, 0D
5: vs_urb_write (null):UD

The first difference we can see is that although the instructions are basically the same at the starting point, the order is not the same. In fact if we check the different intermediate steps (I will not show them here to avoid a post too long), although the optimizations are the same, how and which get optimized are somewhat different. One could conclude that the problem is this order, but if we take a look to the final step on the NIR assembly shader, there isn’t anything clearly indicating that that shader can’t be simplified. Specifically instruction #3 could go away if instruction #0 and #1 writes directly to m3 instead of vgrf2, that is what the IR path does. So it seems that the problem is on the register coalesce optimization.

As I mentioned, there is a slight order difference between NIR and IR. That leads that on the NIR case, between instruction #3 and #0/#1 there is another instruction, that is in a different place on IR. So my first thought was that the optimization was only checking against the immediate previous instruction. Once I started to look to the code it showed that I was wrong. For each instruction, there was a loop checking for all the previous instructions. What I noticed is that on that loop, all the checks that rejected one previous instruction was a break. So I initially thought that perhaps one of those breaks was in fact a continue. This seemed to be confirmed when I did the quick hack of replace everything for continues. It proved wrong as soon as I saw all the piglit regressions I had in hand. So after that I did the proper, and do a proper debug. So using gdb, the condition it was stopping the optimization to check previous instructions was the following one:

/* If somebody else writes our destination here, we can't coalesce
* before that.
*/
if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written))
break;

Probably the code is hard to understand out of context, but the comment is clear. When we coalesce two instructions, that is possible when the previous one writes to a register we are reading on the current instruction. But obviously, that can’t be done if there is a instruction in the middle that writes on the same register. And it is true that is what is happening here. If you look at the final state of the NIR path, we want to coalesce instruction #3 with instruction #1 and #0, but instruction #2 is writing on m3 too.

So, that’s over? Not exactly. Remember that IR was able to simplify this, and it can’t be only because the order was different. If you take a deeper look to those instructions, there are some x, y, z, w after the register names. Those report in which channels those instructions are writing. As I mentioned on my previous post, this work is about providing a NIR to vec4 pass. So those registers are vectors. Instruction #3 can be read as “move the content of components x and y from register vgrf2 to components z and w of register m3”.  And instruction #2 can be read as “move the content of components x and y from register attr17 to components x and y of register m3”. So although we are writing to the same destination, we are writing to different components, meaning that it would be safe to do the coalescing. We just need to be sure that there isn’t any component overlap between current instruction and the previous one we are checking against. Fortunately the registers already save in which registers they are writing on in a variable called “writemask”. So we only need to change that code for the following one:

/* If somebody else writes the same channels of our destination here,
* we can't coalesce before that.
*/
if (inst->dst.in_range(scan_inst->dst, scan_inst->regs_written) &&
(inst->dst.writemask & scan_inst->dst.writemask) != 0) {
break;
}

The patch with this change was sent to the mesa list (here), approved and pushed to master.

Final words

So again, a problem what was easier to write the solution that to write the patch. But in any case, it showed a significant improvement. Using shader-db tool comparing before and after the patch:

total instructions in shared programs: 1781593 -> 1734957 (-2.62%)
instructions in affected programs:     1238390 -> 1191754 (-3.77%)
helped:                                12782
HURT:                                  0

Optimizing shader assembly instruction on Mesa using shader-db

Lately I have been working on Mesa. Specifically I have been working with my fellow igalians Eduardo Lima and Antía Puentes to provide a NIR to vec4 pass to the i965 backend. I will not go too much in the details, but in summary, NIR is a new intermediate representation for Mesa. Intermediate as being in the middle of the OpenGL GLSL language used for shaders, and the final GPU machine instructions for each specific Mesa backends. NIR is intended to replace the previous GLSL IR, and in some places it is already done. If you are interested on the details, take a look to the NIR announcement and NIR documentation page.

Although the bug is still open, Mesa master already has the functionality for this pass, and in fact, is the default now. This new NIR pass provides the same functionality that the one available with the old GLSL IR pass (from now, just IR). This was properly tested with piglit. But although the total instruction count in general have improved, we are getting worse instruction count compiling some specific known shaders if we use NIR. So the next step would be improve this. This is an ongoing effort, like these patches from Jason Ekstrand, but I would like to share some of my experience so far.

In order to guide this work, we have been using shader-db. shader-db is a shader database, with a executable to compile those shaders, and a tool to compare two executions of that compilation. Usually it is used to verify that the optimization that you are implementing is really improving the instruction count, or even to justify your change. Several Mesa commits include the before and after shader-db statistics. But in this case, we have been using it as a guide of what we could improve. We compile all the shaders using IR and using NIR (using the environment variable INTEL_USE_NIR), and check in which shaders there are a instruction count regression.

Case 1: subtraction needs an extra mov.

Ok, so one of the shaders with a worse instruction count is humus-celshading/4.shader_test. After some analysis of what was the problem, I got a simpler shader with the same problem:

in vec4 inData;

void main(){
gl_Position = gl_Vertex - inData;
}

This simple shader needs one extra instructions using NIR. So yes, a simple subtraction is getting worse. FWIW, this is the desired final shader assembly:

0: add m3:F, attr0.xyzw:F, -attr18.xyzw:F
1: mov m2:D, 0D
2: vs_urb_write (null):UD

Note that there isn’t an assembly subtraction instruction, but it is represented as negating the second parameter and use an add (this seems captain obvious information here, but will be relevant later).

So at this point one option would be start to look at the backend (remember, i965) code for vec4, specifically the optimizations, and check if we see something. Those optimization are called at brw_vec4.cpp. Those optimizations are in general common to any compiler, like dead code elimination, copy propagation, register coalescing, etc. And usually they are executed several times in several passes, and some of those are simplifications to be used by other optimizations (for example, if your copy propagation optimization pass works, then it is common that your dead code elimination pass will get an instruction out). So with all those optimizations and passes, how do you find the problem? Although it is a good idea read the code for those optimizations to know how they work, it is usually not enough to know where the problem is. So this is again a debug problem, and as usually, you want to know what it is happening step by step.

For this I executed again the compilation, with the following environment variable:

INTEL_USE_NIR=1 INTEL_DEBUG=optimizer ./run subtraction.shader_test

This option prints out to a file the shader assembly compiled at each optimization pass (if applied). So for example, I get the following files for both cases:

  • VS-0001-00-start
  • VS-0001-01-04-opt_copy_propagation
  • VS-0001-01-07-opt_register_coalesce
  • VS-0001-02-02-dead_code_eliminate

So in order to get the final shader assemlby, it was executed a copy propagation, a register coalesce, and a dead code eliminate. BTW, I found that environment variable while looking at the code. It is not listed on the mesa envvar page, something I assume is a bug.

So I started to look at the differences between the different steps. Taking into account that on both cases, the same optimizations were executed, and in the same order, I started looking for differences between one and the other at any step. And I found one difference on the copy propagation.

So let’s see the starting point using IR:

0: mov vgrf2.0:F, -attr18.xyzw:F
1: add vgrf0.0:F, attr0.xyzw:F, vgrf2.xyzw:F
2: mov m2:D, 0D
3: mov m3:F, vgrf0.xyzw:F
4: vs_urb_write (null):UD

And the outcome of the copy propagation:

0: mov vgrf2.0:F, -attr18.xyzw:F
1: add vgrf0.0:F, attr0.xyzw:F, -attr18.xyzw:F
2: mov m2:D, 0D
3: mov m3:F, vgrf0.xyzw:F
4: vs_urb_write (null):UD

And the starting point using NIR:

0: mov vgrf0.0:UD, attr0.xyzw:UD
1: mov vgrf1.0:UD, attr18.xyzw:UD
2: add vgrf2.0:F, vgrf0.xyzw:F, -vgrf1.xyzw:F
3: mov m2:D, 0D
4: mov m3:F, vgrf2.xyzw:F
5: vs_urb_write (null):UD

And the outcome of the copy propagation:

0: mov vgrf0.0:UD, attr0.xyzw:UD
1: mov vgrf1.0:UD, attr18.xyzw:UD
2: add vgrf2.0:F, attr0.xyzw:F, -vgrf1.xyzw:F
3: mov m2:D, 0D
4: mov m3:F, vgrf2.xyzw:F
5: vs_urb_write (null):UD

Although it is true that the starting point for NIR already have one extra instruction compared with IR, that extra one gets optimized on following steps. What caught my attention was the difference between what happens with the instruction #1 on the IR case, compared with the equivalent instruction #2 on the NIR case (the add). On the IR case, copy propagation is able to propagate attr18 from the previous instruction. So is easy to see that this could be simplified on following optimization steps. But that doesn’t happen on the NIR case. On NIR, instruction #2 after the copy propagation remains the same.

So I started to take a look to the implementation of the copy propagation optimization code (here). Without entering into details, this pass analyses each instruction, comparing them with the previous ones in order to know if it can do a copy propagation. So I looked why with that specific instruction the pass concludes that it can’t be done. At this point you could use gdb, but I used some extra printfs (sometimes they are useful too). So I find the check that rejected that instruction:

bool has_source_modifiers = value.negate || value.abs;

<skip>

if (has_source_modifiers && value.type != inst->src[arg].type)
    return false;

That means that if the source of the previous instruction you are checking against is negated (or has an abs), and the types are different, you can’t do the propagation. This makes sense, because negation is different on different types. If we go back to check the shader assembly output, we find that it is true that the types (those F, D and UD just after the registers) are different between the IR and the NIR case. Why we didn’t worry before? Why this was not failing on any piglit test? Well, because if you take a look more carefully, the instructions that had different types are the movs. In both cases, the type is correct on the add instruction. And in a mov, the type is somewhat irrelevant. You are just moving raw data from one place to the other. It is important on the ALU operation. But in any case, it is true that the type is wrong on those registers (compared with the original GLSL code), and as we are seeing, is causing some problems on the optimization passes. So next step: check where those types are filled.

Searching a little on the code, and using gdb this time, this is done on the function nir_setup_uniforms at brw_vec4_nir.cpp,while creating a source register variable. But curiously it is using the type that came from NIR:

src_reg src = src_reg(ATTR, var->data.location + i, var->type);

and printing out the content of var->type with gdb, it properly shows the type used at the GLSL code. If we go deeper to the src_reg constructor:

src_reg::src_reg(register_file file, int reg, const glsl_type *type)
{
    init();

    this->file = file;
    this->reg = reg;
    if (type && (type->is_scalar() || type->is_vector() || type->is_matrix()))
        this->swizzle = brw_swizzle_for_size(type->vector_elements);
    else
        this->swizzle = BRW_SWIZZLE_XYZW;
}

We see that the type is only used to fill the swizzle. But if we compare this to the equivalent code for a destination register:

dst_reg::dst_reg(register_file file, int reg, const glsl_type *type,
unsigned writemask)
{
    init();

    this->file = file;
    this->reg = reg;
    this->type = brw_type_for_base_type(type);
    this->writemask = writemask;
}

dst_reg is is also filling internally the type, something src_reg is not doing. So at this point the patch is straightforward. It is just fill src_reg->type using the constructor type parameter. The patch was approved and is already on master.

Coming up next

At the end, I didn’t need to improve at all any of the optimization passes, as the bug was elsewhere, but the debugging steps for this work are still the same. In fact it was the usual bug that was harder to find (for simplicity I summarized the triaging) that to solve. For the next blog post, I will explain how I worked on another instruction count regression, somewhat more complex, that needed a change on one of the optimization passes.

 

GNOME 3.12.1 out: PDF accessibility progress

Welcome to a new “GNOME 3.12 is out blog post”, somewhat late because I wanted to focus on 3.12.1 instead of the usual 3.12.0, and because I was away for several days due to Easter holidays.

Flowers and a mill at Keukenhof
Flowers and a mill at Keukenhof

As I said just after 3.10, Antía worked hard on adding keyboard navigation support to Evince, and Adrián provided an implementation of the tagged PDF specification for Poppler. The plan for the 3.12 cycle was to build upon their work in order to improve Evince’s accessibility support.

Thanks to the tagged-PDF implementation for Poppler, we were able to start experimenting with tagged PDF documents in Evince, and playing with all the cool things that tagged PDFs bring to the table. Finally we have available information about if we are in a paragraph, where a list starts, different levels of headings, and pretty much anything else one can put in an element tag. But while Adrián and Carlos García kept working on getting their patches pushed upstream (more than 15 patches were pushed during this cycle), Joanmarie Diggs and I realized that “only” a bare/plain implementation of this specification would be a hard animal to tame in order to be used by assistive technologies: Additional parsing and structuring will be needed in Poppler to properly implement ATK support in Evince.

Additionally, having working keyboard support in Evince made it finally possible to test real-world document accessibility with Orca (as opposed to just Accerciser). But in doing so, we found that the existing ATK support was incomplete or wrong in several places. So even the more basic PDF documents, those that should be also accessible without tagged PDF, were not properly accessible. Taking all this into account, we decided to focus on fixing the bugs in Evince’s core accessibility support as doing so would make all PDFs more accessible, but at the same time to continue working on the tagged-PDF support in order to start developing a concrete list of the improvements we will need added to Poppler.

So the main tasks on Evince during this cycle were:

  • Reimplement AtkText
  • Expose all document pages to the accessibility tools, not only the current one.
  • Implementation of AtkDocument
  • Some fixes to caret-navigation and hyperlink management

As a result of these changes:

  • Several accessibility-triggered crashes have been eliminated
  • Orca’s SayAll feature now works with Evince
  • Prosody when reading documents with Orca has been improved
  • The caret can be positioned and text selected via AT-SPI2

Some of this work was not quite in time for the 3.12.0 release, but has been included in 3.12.1. In addition, we are continuing to work on accessibility-related bug fixes which we anticipate will be included in 3.12.2.

As for what’s next: We encourage Orca users to give Evince a try and help us identify the bugs that remain in Evince’s core accessibility support. Anything that they find will be added to our high-priority TODO list. In the meantime, we will continue to work on enhancing Poppler’s tagged-PDF support and then exposing that structural information through Evince to assistive technologies.

Finally, I would like to thank the GNOME Foundation and the Friends of GNOME supporters for their contributions towards making a more accessible GNOME, as this work would not be possible without them.

Sponsored by GNOME Foundation

GNOME Accessibility Update: 3.10 Release, Montreal Summit and Plans for 3.12

3.10 is out, what’s new about accessibility?

As you probably already know, GNOME 3.10 was released several weeks ago, with lots of new accessibility goodness:

  • Magnifier focus and caret tracking: Finally, the focus and caret tracking feature of GNOME Shell’s magnifier has landed. Now the magnified view automatically follows the writing caret and changes in focus so you can always see where you are without having to move the mouse. You can read more about this work in this post written by Magdalen Berns, the GSoC student that implemented this feature.
  • GNOME Shell improvements: One of the new features of GNOME 3.10 is the new System Status Menu. This menu includes several new visual elements which were reviewed and enhanced in order to ensure they would be fully keyboard navigable and accessible through accessibility tools like Orca. Keyboard navigability was also added to the calendar pop-down in the shell panel, though admittedly there is some room for improvement which we hope to address in GNOME 3.12.
  • PDF accessibility: Evince keyboard support has landed. Now users can press F7 to activate a caret for navigation and selection within the document being read. This new support was also made to work with Orca, so that PDF content can be accessed by users who are blind directly in Evince. Support for tagged PDFs is currently being added to Poppler and will be used to further improve accessibility support in Evince. This work is being done by Igalia, having been funded by the Friends of GNOME accessibility campaign. You can read more about this work on Antía Puentes’s “Accessibility in Evince” and Adrián Pérez’s “Tagged-PDF: Coming to a Poppler near you” blog posts.
  • A new global keyboard shortcut for Orca: Now the screen reader can be easily turned on/off at any time by just pressing Super+Alt+S. This might seem like a small change, but it is in fact a really big step that allows more distros to be more accessible out of the box.
  • ATK deprecations (a lot): While this does not directly affect the user experience, over time it will make developers’ lives easier, and will also lead to cleaner and more easily maintainable code. The first one is the simplification of what used to be extremely confusing and hard-to-implement methods to get a substring from a text related object. We had been talking about this problem for a long time, and finally agreed upon the new API at this year’s GUADEC. Mario Sánchez then added the new method to ATK and AT-SPI2 and also implemented it in WebKitGTK. The other major change is related to focus handling. One signal and six methods were deprecated, simplifying the situation *a lot* in that regard.

What’s next?

Captain Obvious to the rescue: 3.12. Although 3.10 was better than 3.8, our plan is making 3.12 even better. Like a lot of other teams, we started the new cycle listing, analyzing and prioritizing everything we need to do, using the Montreal Summit as a kickoff for 3.12 and making the most of being able to talk face-to-face with other GNOME developers. There are always changes to keep up with: new applications, new widgets, and new deprecations. But right now, the more important change in progress is Wayland. A lot of work was done for 3.10, so that we have the possibility to run GNOME 3 using a Wayland session. It is still not production ready (in my humble opinion, it is alpha status), but the plan is filling the gaps for 3.12, and that includes accessibility.

But Wayland is not the only topic for 3.12. During the weekly Accessibility Team meeting after the summit, we discussed all the improvements planned for 3.12:

  • Complete Wayland support
  • Create a new asynchronous API for AT-SPI2
  • Add configuration UI for some already-implemented magnifier features (focus/caret tracking and tinting)
  • Homogenize keyboard navigation within GNOME (to be proposed)
  • Update ATK implementations (e.g. GTK, Clutter) for deprecations and new API
  • Implement tagged PDF support in Evince

Finally, I would like to thank my employer Igalia for its continued support of my work on GNOME accessibility as part of my job duties, and the GNOME Foundation for sponsoring my trip to the Montreal Summit.

Sponsored by GNOME FoundationIgalia: Free Software Engineering

Going back from FOSDEM 2013

Last week planet GNOME was full of “Going to” posts. This is my traditional “Coming from” post. Most of my events-related posts are written after the event itself. I blame the need to write slides. Anyhow, I’m back from FOSDEM 2013.

This is the fourth time that I’ve attended FOSDEM, making it my second most-visited Free Software event  (GUADEC being the first), and the third time that I have given a presentation  there.

This year, instead of my usual “GNOME Accessibility State of the Union” talk, I spoke about what is arguably one the biggest changes in Accessibility for Free desktop environments: “How GNOME Obsoleted its “Enable Accessibility” Setting”   (aka “Accessibility Always on”). The turnout was great in spite of it being the first session Sunday morning. I don’t see the slides uploaded on the  FOSDEM page, so I’ve provided them here for those of you unable to attend.

Additionally,  I attended several interesting talks (the number of interesting  talks at FOSDEM is overwhelming), met people that I only see at this kind of event and also participated in a small (somewhat informal) release-team meeting.

Finally I would like to thank Igalia for sponsoring my trip this year.