We, at Igalia, have been involved in enabling ARB_gpu_shader_fp64 extension to different Intel generations: first Broadwell and later, then Haswell. Now IvyBridge support is finished and landed Mesa’s master branch.

This feature was the last one to expose OpenGL 4.0 in Intel IvyBridge with the open-source Mesa driver. This is a big achievement for an old hardware generation (IvyBridge was released in 2012), which allows users to run OpenGL 4.0 games/apps on GNU/Linux without overriding the supported version with a Mesa-specific environment variable.

More goods news… ARB_vertex_attrib64 support has landed too, meaning that we are exposing OpenGL 4.2 on Intel Ivybrige!

Technical details

Diving a little bit into technical details (skip this if you are not interested on those)…

This work is standing on top of the shoulders of Intel Haswell support for ARB_gpu_shader_fp64. The latter introduced support of double floating-point (DF) data types on both scalar and vec4 backends which is, in general, very similar to Ivybridge. If you are interested in the technical details about adding ARB_gpu_shader_fp64 to Intel GPUs, see Iago’s talk at last XDC (slides here).

Ivybridge was the first Intel generation that supported double floating-point data types natively. The most important difference bettwen Ivybridge and Haswell is that both execution size and regioning parameters (stride and width) are in terms of 32-bits, so we need to double both regioning parameters and execution size at DF instructions’ emission.

But this is not the only annoyance, there are others quite relevant like:

We emit an scalar DF instruction with a maximum execution size of 4 (doubled later to 8) to avoid hitting the gen7’s decompression instruction bug -present also in Haswell- that makes the hardware to read 2 consecutive GRFs regardless the vertical stride. This is specially annoying when reading DF scalars, because the stride is zero -we just want to read data from one GRF- and this bug would make us to read next GRF too; furthermore the hardware applies the same channel enable signals to both halves of the compressed instruction which will be just wrong under non-uniform control flow if force_writemask_all is disabled. However, there is also a physical limitation related to this when using Align16 access mode: SIMD8 is not allowed for DF operations. Also, in order to make DF instructions work under non-uniform control flow, we use NibCtrl to choose the proper flags of the execution mask.
32-bit data types to double (and vice-versa) conversions are quite special. Each 32-bit source data should be aligned 64-bits, so we need to apply an stride on the original data in order to keep this aligment. This is because the FPU internals cannot do the conversion if the data is not aligned to the size of the bigger one. A similar thing happens on converting doubles to 32-bit data types: the output elements need to be 64-bit aligned too.
When splitting each DF instruction in two (or more) instructions with an exec_size of 4 as maximum, sometimes it is not so trivial to do and need temporary registers to save the intermediate results before merge them to the real destination.

Due to these things, we needed to improve the d2x lowering pass (now called lower_conversions) which fixes the aforementioned conversions from double floating-point data to 32-bit data types, add some specific fixes in the generator, add code in the validator to detect invalid cases, among other things.

In summary, although Ivybridge is very similar to Haswell, the 64-bit floating point support is quite special and it needs specific code.

Acknowledgements

I would like to thanks Matt Turner and Francisco Jerez from Intel for their insightful reviews, sometimes spotting problems that we did not foresee and their contributions to the patch series. Also, I would like to thanks Juan for his contributions to make this support happen and Igalia for allowing me to work on this amazing open-source project.

Igalia