Playing with the rpi4 CPU/GPU frequencies

In recent days I have been testing how modifying the default CPU and GPU frequencies on the rpi4 increases the performance of our reference Vulkan applications. By default Raspbian uses 1500MHz and 500MHz respectively. But with a good heat dissipation (a good fan, rpi400 heat spreader, etc) you can play a little with those values.

One of the tools we usually use to check performance changes are gfxreconstruct. This tools allows you to record all the Vulkan calls during a execution of an aplication, and then you can replay the captured file. So we have traces of several applications, and we use them to test any hypothetical performance improvement, or to verify that some change doesn’t cause a performance drop.

So, let’s see what we got if we increase the CPU/GPU frequency, focused on the Unreal Engine 4 demos, that are the more shader intensive:

Unreal Engine 4 demos FPS chart

So as expected, with higher clock speed we see a good boost in performance of ~10FPS for several of these demos.

Some could wonder why the increase on the CPU frequency got so little impact. As I mentioned, we didn’t get those values from the real application, but from gfxreconstruct traces. Those are only capturing the Vulkan calls. So on those replays there are not tasks like collision detection, user input, etc that are usually handled on the CPU. Also as mentioned, all the Unreal Engine 4 demos uses really complex shaders, so the “bottleneck” there is the GPU.

Let’s move now from the cold numbers, and test the real applications. Let’s start with the Unreal Engine 4 SunTemple demo, using the default CPU/GPU frequencies (1500/500):

Even if it runs fairly smooth most of the time at ~24 FPS, there are some places where it dips below 18 FPS. Let’s see now increasing the CPU/GPU frequencies to 1800/750:

Now the demo runs at ~34 FPS most of the time. The worse dip is ~24 FPS. It is a lot smoother than before.

Here is another example with the Unreal Engine 4 Shooter demo, already increasing the CPU/GPU frequencies:

Here the FPS never dips below 34FPS, staying at ~40FPS most of time.

It has been around 1 year and a half since we announced a Vulkan 1.0 driver for Raspberry Pi 4, and since then we have made significant performance improvements, mostly around our compiler stack, that have notably improved some of these demos. In some cases (like the Unreal Engine 4 Shooter demo) we got a 50%-60% improvement (if you want more details about the compiler work, you can read the details here).

In this post we can see how after this and taking advantage of increasing the CPU and GPU frequencies, we can really start to get reasonable framerates in more demanding demos. Even if this is still at low resolutions (for this post all the demos were running at 640×480), it is still great to see this on a Raspberry Pi.

v3dv status update 2022-05-16

We haven’t posted updates to the work done on the V3DV driver since
we announced the driver becoming Vulkan 1.1 Conformant.

But after reaching that milestone, we’ve been very busy working on more improvements, so let’s summarize the work done since then.

Multisync support

As mentioned on past posts, for the Vulkan driver we tried to focus as much as possible on the userspace part. So we tried to re-use the already existing kernel interface that we had for V3D, used by the OpenGL driver, without modifying/extending it.

This worked fine in general, except for synchronization. The V3D kernel interface only supported one synchronization object per submission. This didn’t properly map with Vulkan synchronization, which is more detailed and complex, and allowed defining several semaphores/fences. We initially handled the situation with workarounds, and left some optional features as unsupported.

After our 1.1 conformance work, our colleage Melissa Wen started to work on adding support for multiple semaphores on the V3D kernel side. Then she also implemented the changes on V3DV to use this new feature. If you want more technical info, she wrote a very detailed explanation on her blog (part1 and part2).

For now the driver has two codepaths that are used depending on if the kernel supports this new feature or not. That also means that, depending on the kernel, the V3DV driver could expose a slightly different set of supported features.

More common code – Migration to the common synchronization framework

For a while, Mesa developers have been doing a great effort to refactor and move common functionality to a single place, so it can be used by all drivers, reducing the amount of code each driver needs to maintain.

During these months we have been porting V3DV to some of that infrastructure, from small bits (common VkShaderModule to NIR code), to a really big one: common synchronization framework.

As mentioned, the Vulkan synchronization model is really detailed and powerful. But that also means it is complex. V3DV support for Vulkan synchronization included heavy use of threads. For example, V3DV needed to rely on a CPU wait (polling with threads) to implement vkCmdWaitEvents, as the GPU lacked a mechanism for this.

This was common to several drivers. So at some point there were multiple versions of complex synchronization code, one per driver. But, some months ago, Jason Ekstrand refactored Anvil support and collaborated with other driver developers to create a common framework. Obviously each driver would have their own needs, but the framework provides enough hooks for that.

After some gitlab and IRC chats, Jason provided a Merge Request with the port of V3DV to this new common framework, that we iterated and tested through the review process.

Also, with this port we got timelime semaphore support for free. Thanks to this change, we got ~1.2k less total lines of code (and have more features!).

Again, we want to thank Jason Ekstrand for all his help.

Support for more extensions:

Since 1.1 got announced the following extension got implemented and exposed:

  • VK_EXT_debug_utils
  • VK_KHR_timeline_semaphore
  • VK_KHR_create_renderpass2
  • VK_EXT_4444_formats
  • VK_KHR_driver_properties
  • VK_KHR_16_bit_storage and VK_KHR_8bit_storage
  • VK_KHR_imageless_framebuffer
  • VK_KHR_depth_stencil_resolve
  • VK_EXT_image_drm_format_modifier
  • VK_EXT_line_rasterization
  • VK_EXT_inline_uniform_block
  • VK_EXT_separate_stencil_usage
  • VK_KHR_separate_depth_stencil_layouts
  • VK_KHR_pipeline_executable_properties
  • VK_KHR_shader_float_controls
  • VK_KHR_spirv_1_4

If you want more details about VK_KHR_pipeline_executable_properties, Iago wrote recently a blog post about it (here)

Android support

Android support for V3DV was added thanks to the work of Roman Stratiienko, who implemented this and submitted Mesa patches. We also want to thank the Android RPi team, and the Lineage RPi maintainer (Konsta) who also created and tested an initial version of that support, which was used as the baseline for the code that Roman submitted. I didn’t test it myself (it’s in my personal TO-DO list), but LineageOS images for the RPi4 are already available.

Performance

In addition to new functionality, we also have been working on improving performance. Most of the focus was done on the V3D shader compiler, as improvements to it would be shared among the OpenGL and Vulkan drivers.

But one of the features specific to the Vulkan driver (pending to be ported to OpenGL), is that we have implemented double buffer mode, only available if MSAA is not enabled. This mode would split the tile buffer size in half, so the driver could start processing the next tile while the current one is being stored in memory.

In theory this could improve performance by reducing tile store overhead, so it would be more benefitial when vertex/geometry shaders aren’t too expensive. However, it comes at the cost of reducing tile size, which also causes some overhead on its own.

Testing shows that this helps in some cases (i.e the Vulkan Quake ports) but hurts in others (i.e. Unreal Engine 4), so for the time being we don’t enable this by default. It can be enabled selectively by adding V3D_DEBUG=db to the environment variables. The idea for the future would be to implement a heuristic that would decide when to activate this mode.

FOSDEM 2022

If you are interested in watching an overview of the improvements and changes to the driver during the last year, we made a presention in FOSDEM 2022:
“v3dv: Status Update for Open Source Vulkan Driver for Raspberry Pi
4”

Improving v3dv pipeline caching

Iago wrote recently a blog post about performance improvements on the v3d compiler, and introduced our plans to improve the pipeline caching (specifically the compiled shaders) (full blog here). We merged some improvements recently, so let’s talk about that work.

Pipeline cache improvements

While analysing the perfomance of RBDOOM-3-BFG, we noticed that some significant CPU time was spent every frame for linking shaders.

RBDOOM-3-BFG screenshot

After some investigation, we found that the game was calling ClearAttachment twice every frame. The implementation of those ClearAttachments was relying on a full job with a graphics pipeline. On v3dv by default any pipeline is created with a pipeline cache (provided by the user, or a default pipeline). On v3dv (and in general any Vulkan driver) the main cached data are the compiled shaders, so the main objective of the pipeline cache is avoiding full shader re-compilation on compatible pipelines that are used really often. Why was that time spent on linking shaders?

The issue was that for each pipeline lookup on the pipeline cache we were doing two cache lookups. The first one against a cache with the shaders in NIR, that is the main intermediate representation for shaders in Mesa (more info about intermediate representation here). And then we used those shaders to fill up the key for a second cache lookup, that if succesful, will return the compiled shader on Broadcom (QPU) assembly format.

The reason of this two-step lookup is that to compile a shader we call the common (for both OpenGL and Vulkan) Broadcom compiler, and we use some data structures that contain info that will affect the compilation (like if blending is enabled). When we implemented the pipeline cache support, for simplicity, we used the same data structures as part of the cache key. But as such keys were filled with info coming from the NIR shaders, those needed to be linked together on the case of the graphics pipelines.

When we analyzed how to improve it, we realized that in order to identify the compiled shader, we don’t really need the NIR shaders, as the info derived from them are implicit to the SPIR-V shaders provided to create the pipeline. Those NIR shaders are only really needed to compile the shader. The improvement here was using a different data structure as part of the cache key, and replace the two-cache-lookup with a one-cache-lookup. We needed to do some additional changes, as there were parts of the code that assumed that the NIR shaders would be available, but now if possible we are skipping getting them.

Numbers

Let’s start showing the improvement with a synthetic test. We took one CTS test using a complex shader, and forced the pipeline to be re-recreated 1000 times. So we get the following times:

  • Before this work, disabling pipeline cache: 125,79 seconds
  • Before this work, enabling pipeline cache (default): 11,41 seconds
  • With this work, enabling pipeline chache: 0,87 seconds

That’s a clear improvement. So how about real applications? As mentioned we started this work after analyze RBDOOM-3-BFG use of ClearAttachment. That game got an improvement of ~1fps. But when we tested other games we didn’t get any improvement on that case. This is because using a full job for ClearAttachment isn’t the preferred option (it is in fact the slowest path), and because the other games are GPU limited.

But we got improvements on other cases. As mentioned the advantage of creating the pipeline with a hot cache is that we avoid the shader recompilation, so it is far faster. And there are some apps that create pipelines at runtime. We found that this happens with the Unreal Engine 4 demos, specifically the Shooter Game. So, for example, we start the demo like this:

UE4 demo screenshot, before shooting gun

and then we decide to shot for the first time:

UE4 demo screenshot, while shooting gun

in order to render the shooting effect, new pipelines are created, and several shaders are compiled. Due all that extra work we experiment a noticiable FPS hiccup. We can visualize it with this graph:

On that graph we can see how we go from ~25fps to ~2fps. That is the shooting moment.

For this cases, the ideal would be to start the game with a pipeline cache loaded with the outcome of previous executions of the game. So adding a hack to simulate that situation, and focusing on the relevant stat in this case, that is the minimum FPS, we get the following:

  • Cold cache: ~2fps
  • Hot cache, before this work: ~6fps
  • Hot cache, with this work: ~8fps

on-disk-cache

As mentioned, the ideal would be that the applications used a pipeline cache when creating the pipelines, and that they stored the content on disk, and loaded it on following executions. Among other things, because the application would have more control about what to store and load at each moment (like for example: store/load the pipeline cache for a given level of a game).

The reality is that not all the applications do that. As mentioned, the numbers of the previous situation was simulating that ideal situation, but the application was not doing that. In order to mitigate that, we added support for on-disk-cache. Mesa provides a framework to store and load shaders on disk, so basically for any pipeline cacke lookup, now we have an extra fallback. In this case, for the UE4 Shooter demo, for a hot on-disk-cache we get a minimum of ~7fps, that as expected, is better that the situation before our work, but worse that the ideal case of the application handling the store/load of the pipeline cache on disk.

Some conclusions

So some conclusions from this work, that applies to any Vulkan driver, not only v3dv in particular:

  • If possible, pre-create all the pipelines that you would use during your application runtime. Creating pipelines during runtime, even if cached, could lead to performance hiccups.
  • If that is not possible (for example if the variable combination is too high), use pipeline cache, and store/load them on-disk. This would reduce the loading time, and make the runtime performance hiccups less noticiable. Note that having support for general on-disk-cache doesn’t mean that v3dv will do this for you.

v3dv status update 2021-02-03

So some months have passed since our last update, when we announced that v3dv became Vulkan 1.0 conformant. The main reason for not publishing so many posts is that we saw the 1.0 checkpoint as a good moment to hold on adding new big features, and focus on improving the codebase (refactor, clean-ups, etc.) and the already existing features. For the latter we did a lot of work on performance. That alone would deserve a specific blog post, so in this one I will summarize the other stuff we did.

New features

Even if we didn’t focus on adding new features, we were still able to add some:

  • The following optional 1.0 features were enabled: logicOp, althaToOne, independentBlend, drawIndirectFirstInstance, and shaderStorageImageExtendedFormats.
  • Added support for timestamp queries.
  • Added implementation for VK_KHR_maintenance1, VK_EXT_private_data, and VK_KHR_display extensions
  • Added support for Wayland WSI.

Here I would like to highlight that we started to get feature contributions out of the initial core of developers that created the driver. VK_KHR_display was submitted by Steven Houston, and Wayland WSI support was submitted by Ella-0. Thanks a lot for it, really appreciated! We hope that this would begin a trend of having more things implemented by the rpi/mesa community as a whole.

Bugfixing and vulkan tools

Even if the driver got conformant, we were still testing the driver with several demos and applications, and provided fixes. As a example, we got Sascha Willem’s oit (Order Independent Transparency) working:

Sascha Willem’s oit demo on the rpi4

Among those applications that we were testing, we can highlight renderdoc and gfxreconstruct. The former is a frame-capture based graphics debugger and the latter is a tool that allows to capture and replay several frames. Both tools are heavily used when debugging and testing vulkan applications. We tested that they work on the rpi4 (fixing some bugs while doing it), and also started to use them to help/guide the performance work we are doing.

Fosdem 2021

If you are interested on an overview of the development of the driver during the last year, we are going to present “Overview of the Open Source Vulkan Driver for Raspberry Pi 4” on FOSDEM this weekend (presentation details here).

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31
v3dv status update 2020-09-07
Vulkan update: we’re conformant!

v3dv status update 2020-09-07

So here a new update of the evolution of the Vulkan driver for the rpi4 (broadcom GPU).

Features

Since my last update we finished the support for two features. Robust buffer access and multisampling.

Robust buffer access is a feature that allows to specify that accesses to buffers are bounds-checked against the range of the buffer descriptor. Usually this is used as a debug tool during development, but disabled on release (this is explained with more detail on this ARM guide). So sorry, no screenshot here.

On my last update I mentioned that we have started the support for multisampling, enough to get some demos working. Since then we were able to finish the rest of the mulsisampling support, and even implemented the optional feature sample rate shading. So now the following Sascha Willems’s demo is working:

Sascha Willems deferred multisampling demo run on rpi4

Bugfixing

Taking into account that most of the features towards support Vulkan core 1.0 are implemented now, a lot of the effort since the last update was done on bugfixing, focusing on the specifics of the driver. Our main reference for this is Vulkan CTS, the official Khronos testsuite for Vulkan and OpenGL.

As usual, here some screenshots from the nice Sascha Willems’s demos, showing demos that were failing when I wrote the last update, and are working now thanks of the bugfixing work.

Sascha Willems hdr demo run on rpi4

Sascha Willems gltf skinning demo run on rpi4

Next

At this point there are no full features pending to implement to fulfill the support for Vulkan core 1.0. So our focus would be on getting to pass all the Vulkan CTS tests.

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working
v3dv status update 2020-07-31

v3dv status update 2020-07-31

Iago talked recently about the work done testing and supporting well known applications, like the Vulkan ports of the Quake1, Quake 2 and Quake3. Let’s go back here to the lastest news on feature and bugfixing work.

Pipeline cache

Pipeline cache objects allow the result of pipeline construction to be reused. Usually (and specifically on our implementation) that means caching compiled shaders. Reuse can be achieved between pipelines creation during the same application run by passing the same pipeline cache object when creating multiple pipelines. Reuse across runs of an application is achieved by retrieving pipeline cache contents in one run of an application, saving the contents, and using them to preinitialize a pipeline cache on a subsequent run.

Note that it may happens that a pipeline cache would not improve the performance of an application once it starts to render. This is because application developers are encouraged to create all the pipelines in advance, to avoid any hiccup during rendering. On that situation pipeline cache would help to reduce load times. In any case, that is not always avoidable. In that case the pipeline cache would allow to reduce the hiccup, as a cache hit is far faster than a shader recompilation.

One specific detail about our implementation is that internally we keep a default pipeline cache, used if the user doesn’t provide a pipeline cache when creating a pipeline, and also to cache the custom shaders we use for internal operations. This allowed to simplify our code, discarding some custom caches that were already implemented.

Uniform/storage texel buffer

Uniform texel buffers define a tightly-packed 1-dimensional linear array of texels, with texels going through format conversion when read in a shader in the same way as they are for an image. They are mostly equivalent to OpenGL buffer texture, so you can see them as textures backed up by a VkBuffer (through a VkBufferView). With uniform texel buffers you can only do a formatted load.

Storage texel buffers are the equivalent concept, but applied to images instead of textures. Unlike uniform texel buffers, they can also be written to in the same way as for storage images.

Multisampling

Multisampling is a technique that allows to reduce aliasing artifacts on images, by by sampling pixel coverage at multiple subpixel locations and then averaging subpixel samples to produce a final color value for each pixel. We have already started working on this feature, and included some patches on the development branch, but it is still a work in progress. Having said so, it is enough to get Sascha Willems’s basic multisampling demo working:

Sascha Willems multisampling demo run on rpi4

Bugfixing

Again, in addition to work on specific features, we also spent some time fixing specific driver bugs, using failing Vulkan CTS tests as reference. So let’s take a look of some screenshots of Sascha Willem’s demos that are now working:

Sascha Willems deferred demo run on rpi4

Sascha Willems texture array demo run on rpi4

Sascha Willems Compute N-Body demo run on rpi4

Next

We plan to work on supporting the following features next:

  • Robust access
  • Multisample (finish it)

Previous updates

Just in case you missed any of the updates of the vulkan driver so far:

Vulkan raspberry pi first triangle
Vulkan update now with added source code
v3dv status update 2020-07-01
V3DV Vulkan driver update: VkQuake1-3 now working

v3dv status update 2020-07-01

About three weeks ago there was a big announcement about the update of the status of the Vulkan effort for the Raspberry Pi 4. Now the source code is public. Taking into account the interest that it got, and that now the driver is more usable, we will try to post status updates more regularly. Let’s talk about what’s happened since then.

Input Attachments

Input attachment is one of the main sub-features for Vulkan multipass, and we’ve gained support since the announcement. On Vulkan the support for multipass is more tightly supported by the API. Renderpasses can have multiple subpasses. These can have dependencies between each other, and each subpass define a subset of “attachments”. One attachment that is easy to understand is the color attachment: This is where a given subpass writes a given color. Another, input attachment, is an attachment that was updated in a previous subpass (for example, it was the color attachment on such previous subpass), and you get as a input on following subpasses. From the shader POV, you interact with it as a texture, with some restrictions. One important restriction is that you can only read the input attachment at the current pixel location. The main reason for this restriction is because on tile-based GPUs (like rpi4) all primitives are batched on tiles and fragment processing is rendered one tile at a time. In general, if you can live with those restrictions, Vulkan multipass and input attachment will provide better performance than traditional multipass solutions.

If you are interested in reading more details on this, you can check out ARM’s very nice presentation “Vulkan Multipass mobile deferred done right”, or Sascha Willems’ post “Vulkan input attachments and sub passes”. The latter also includes information about how to use them and code snippets of one of his demos. For reference, this is how the input attachment demos looks on the rpi4:

Sascha Willems inputattachment demos run on rpi4

Compute Shader

Given that this was one of the most requested features after the last update, we expect that this will be likely be the most popular news from this post: Compute shaders are now supported.

Compute shaders give applications the ability to perform non-graphics related tasks on the GPU, outside the normal rendering pipeline. For example they don’t have vertices as input, or fragments as output. They can still be used for massivelly parallel GPGPU algorithms. For example, this demo from Sascha Willems uses a compute shader to simulate cloth:

Sascha Willems Compute Cloth demos run on rpi4

Storage Image

Storage Image is another recent addition. It is a descriptor type that represents an image view, and supports unfiltered loads, stores, and atomics in a shader. It is really similar in most other ways to the well-known OpenGL concept of texture. They are really common with compute shaders. Compute shaders will not render (they can’t) directly any image, and it is likely that if they need an image, they will update it. In fact the two Sascha Willem demos using storage images also require compute shader support:

Sascha Willems compute shader demos run on rpi4

Sascha Willems compute raytracing demo run on rpi4

Performance

Right now our main focus for the driver is working on features, targetting a compliant Vulkan 1.0 driver. Having said so, now that we both support a good range of features and can run non-basic applications, we have devoted some time to analyze if there were clear points where we could improve the performance. Among these we implemented:
1. A buffer object (BO) cache: internally we are allocating and freeing really often buffer objects for basically the same tasks, so there are a constant need of buffers of the same size. Such allocation/free require a DRM call, so we implemented a BO cache (based on the existing for the OpenGL driver) so freed BOs would be added to a cache, and reused if a new BO is allocated with the same size.
2. New code paths for buffer to image copies.

Bugfixing!!

In addition to work on specific features, we also spent some time fixing specific driver bugs, using failing Vulkan CTS tests as reference. Thanks to that work, the Sascha Willems’ radial blur demo is now properly rendering, even though we didn’t focus specifically on working on that demo:

Sascha Willems radial blur demo run on rpi4

Next?

Now that the driver supports a good range of features and we are able to test more applications and run more Vulkan CTS Tests with all the needed features implemented, we plan to focus some efforts towards bugfixing for a while.

We also plan to start to work on implementing the support for Pipeline Cache, which allows the result of pipeline construction to be reused between pipelines and between runs of an application.

v3dv: quick guide to build and run some demos

Just today it has published a status update of the Vulkan effort for the Raspberry Pi 4, including that we are moving the development of the driver to an open repository. As it is really likely that some people would be interested on testing it, even if it is not complete at all, here you can find a quick guide to compile it, and get some demos running.

Dependencies

So let’s start installing some dependencies. My personal recipe, that I use every time I configure a new machine to work on mesa is the following one (sorry if some extra unneeded dependencies slipped):

sudo apt-get install libxcb-randr0-dev libxrandr-dev \
        libxcb-xinerama0-dev libxinerama-dev libxcursor-dev \
        libxcb-cursor-dev libxkbcommon-dev xutils-dev \
        xutils-dev libpthread-stubs0-dev libpciaccess-dev \
        libffi-dev x11proto-xext-dev libxcb1-dev libxcb-*dev \
        bison flex libssl-dev libgnutls28-dev x11proto-dri2-dev \
        x11proto-dri3-dev libx11-dev libxcb-glx0-dev \
        libx11-xcb-dev libxext-dev libxdamage-dev libxfixes-dev \
        libva-dev x11proto-randr-dev x11proto-present-dev \
        libclc-dev libelf-dev git build-essential mesa-utils \
        libvulkan-dev ninja-build libvulkan1 python-mako \
        libdrm-dev libxshmfence-dev libxxf86vm-dev \
        python3-mako

Most Raspian libraries are recent enough, but they have been updating some of then during the past months, so just in case, don’t forget to update:

$ sudo apt-get update
$ sudo apt-get upgrade

Additionally, you woud need to install meson. Mesa has just recently bumped up the version needed for meson, so Raspbian version is not enough. There is the option to build meson from the tarball (meson-0.52.0 here), but by far, the easier way to get a recent meson version is using pip3:

$ pip3 install meson

2020-07-04 update

It seems that some people had problems if they have installed meson with apt-get on their system, as when building it would try the older meson version first. For those people, they were able to fix that doing this:

$ sudo apt-get remove meson
$ pip3 install --user meson

Download and build v3dv

This is the simpler recipe to build v3dv:

$ git clone https://gitlab.freedesktop.org/apinheiro/mesa.git mesa
$ cd mesa
$ git checkout wip/igalia/v3dv
$ meson --prefix /home/pi/local-install --libdir lib -Dplatforms=x11,drm -Dvulkan-drivers=broadcom -Ddri-drivers= -Dgallium-drivers=v3d,kmsro,vc4 -Dbuildtype=debug _build
$ ninja -C _build
$ ninja -C _build install

2020-11-25 update

Now v3dv is merged on Mesa upstream, so in order to clone the repository now you just need to do this:

$ git clone https://gitlab.freedesktop.org/mesa/mesa.git

This builds and install a debug version of v3dv on a local directory. You could set a release build, or any other directory. The recipe is also building the OpenGL driver, just in case anyone want to compare, but if you are only interested on the vulkan driver, that is not mandatory.

Run some Vulkan demos

Now, the easiest way to ensure that a vulkan program founds the drivers is setting the following envvar:

export VK_ICD_FILENAMES=/home/pi/local-install/share/vulkan/icd.d/broadcom_icd.armv7l.json

That envvar is used by the Vulkan loader (installed as one of the dependencies listed before) to know which library load. This also means that you don’t need to use LD_PRELOAD, LD_LIBRARY_PATH or similar

So what Vulkan programs are working? For example several of the Sascha Willem Vulkan demos. To make things easier to everybody, here another quick recipe of how to get them build:

$ sudo apt-get install libassimp-dev
$ git clone --recursive https://github.com/SaschaWillems/Vulkan.git  sascha-willems
$ cd sascha-willems
$ mkdir build; cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug  ..
$ make

Update 2020-08-03: When the post was originally written, some demos didn’t need to ask for extra assets. Recently the fonts were moved there, so you would need to gather the assests always:

$ cd ..
$ python3 download_assets.py

So in order to see a really familiar demo:

$ cd build/bin
$ ./gears

And one slightly more complex:

$./scenerendering

As mentioned, not all the demos works. But a list of some that we tested and seem to work:
* distancefieldfonts
* descriptorsets
* dynamicuniformbuffer
* gears
* gltfscene
* imgui
* indirectdraw
* occlusionquery
* parallaxmapping
* pbrbasic
* pbribl
* pbrtexture
* pushconstants
* scenerendering
* shadowmapping
* shadowmappingcascade
* specializationconstants
* sphericalenvmapping
* stencilbuffer
* textoverlay
* texture
* texture3d
* texturecubemap
* triangle
* vulkanscene

Update : rpiMike on the comments, and some people privately, have pointed some errors on the post. Thanks! And sorry for the inconvenience.

Update 2 : Mike Hooper pointed more issues on gitlab

ARB_gl_spirv and ARB_spirv_extension support for i965 landed Mesa master

And something more visible thanks to that: now the Intel Mesa driver exposes OpenGL 4.6 support, the most recent version of OpenGL.

As perhaps you could recall, the i965 Intel driver became 4.6 conformant last year. You have more details about that, and what being conformant means in this Iago blog post. On that blog post Iago mentioned that it was passing with an early version of the ARB_gl_spirv support, that we were improving and interating during this time so it could be included on Mesa master. At the same time, the CTS tests were only testing the specifics of the extensions, and we wanted a more detailed testing, so we also were adding more tests on the piglit test suite, written manually for ARB_gl_spirv or translated from existing GLSL tests.

Why did it take so long?

Perhaps some would wonder why it took so much time. There were several reasons, but the main one, was related to the need to add a lot of code related to linking on NIR. On a previous blog post ( Introducing Mesa intermediate representations on Intel drivers with a practical example) I mentioned that there were several intermediate languages on Mesa.

So, for the case of the Intel driver, for GLSL, we had a chain like this:

GLSL -> AST -> Mesa IR -> NIR -> Intel backend IR

Now ARB_gl_spirv introduces the possibility to use SPIR-V instead of GLSL. Thanks to the Vulkan support on Mesa, there is a SPIR-V to NIR pass, so the chain for that case would be something like this:

SPIR-V -> NIR -> Intel backend IR

So, at first sight, this seems like it should be more simple, as there is one intermediate step less. But there is a problem. On Vulkan there is no need of a really complex shader linking. Basically gathering some info from the NIR shader. But OpenGL requires more. Even though the extension doesn’t required error validation, the spec points that queries and other introspection utilities should work naturally, as long as they don’t require names (names are considered debug info on SPIR-V). So for example, on Vulkan you can’t ask the shader about how big an SSBO is in order to allocate the space needed for it. It is assumed that you know that. But in OpenGL you can, and as you could do that using just the SSBO binding, ARB_gl_spirv requires that you still support that.

And here resided the main issue. On Mesa most of the linking is done at the Mesa IR level, with some help when doing the AST to Mesa IR pass. So the new intermediate language chain lacked it.

The first approach was trying to convert just the needed NIR stuff back to Mesa IR and then call the Mesa IR linker, but that was working only on some limited cases. Additionally, for this extension the linker rules change significantly. As mentioned, on SPIR-V names are optional. So everything needs to work without names. In fact, the current support just ignores names, for simplicity, and to ensure that everything works without it. So we ended up writing a new linker, based on NIR, and based on ARB_gl_spirv needs.

The other main reason for the delay was the significant changes on the SPIR-V to NIR pass, and NIR in general, to support SSBO/UBO (derefs were added), and also the native support of transform feedback, as Vulkan added a new extension, that we wanted to use. Here I would like to thank Jason Ekstrand for his support and patience ensuring that all those changes were compatible with our ARB_gl_spirv work.

So how about removing Mesa IR linking?

So, now that it is done, perhaps some people would wonder if this work could be used to remove Mesa IR on the GLSL intermediate language chain. Specially taking into account that there were already some NIR based linking utilities. My opinion is that no 😉

There are several reasons. The main one is that as mentioned ARB_gl_spirv linking rules are different, specifically on the lack of names. GLSL rules are heavily based on names. During the review it was suggested that need could be abstracted somehow, and reuse code. But, that doesn’t solve the issue that current Mesa IR supports the linkage of several different versions of GLSL, and more important, the validation and error checking done there, that is needed for GLSL, but not for ARB_gl_spirv (so it is not done). And that is a really big amount of work needed. In my opinion, redoing that work on NIR would not bring a significant advantage. So the current NIR linker would be just the implementation of a linker focused on the ARB_gl_spirv new rules, and would just share utility NIR based linking methods, or specific subfeatures, as far as possible (like the mentioned transform feedback support).

Final words

If you are interested on more details about the work done to implement ARB_gl_spirv/ARB_spirv_extensions support, you can check the presentation I gave on FOSDEM 2018 (slides, video) and the update I gave the same year on XDC (slides, video).

And finally I would like to thank all the people involved. First, thanks to Nicolai Hähnle for starting the work, that we used as basis.

The Igalia team that worked on it at some point were: Eduardo Lima, Alejandro Piñeiro, Antía Puentes, Neil Roberts, some help from Iago Toral and Samuel Iglesias at the beginning and finally thanks to Arcady Goldmints-Orlov, that dealt with handling the review feedback for the two ~80 patches MR (Mesa and Piglit test suite MR) that I created, when I became needed elsewhere.

And thanks a lot to the all the the reviewers, specially Timothy Arceri, Jason Ekstrand and Caio Marcelo.

Finally, big thanks to Intel for sponsoring our work on Mesa, Piglit, and CTS, and also to Igalia, for having me working on this wonderful project.

Bringing VK_KHR_16bit_storage to Intel GPUs

Just yesterday, Vulkan 1.0.54 was released. Among other things, it includes the specification for a new extension, VK_KHR_16bit_storage. And just yesterday, we sent to mesa-dev the implementation of this extension for Intel gen8+ GPUs, that is the outcome of the effort from the igalians José María Casanova, Andrés Gómez, Eduardo Lima, and myself.

In short, this extension allows the use of 16-bit types (half floats, 16-bit ints, and 16-bit uints) in shader input and output interfaces, push constant blocks, and buffers (shader storage buffer objects). The only operation that you can do with those 16-bit variables in the shader are 16-bit to 32-bit, and 32-bit to 16-bit conversions. So no arithmetic (adds, muls, etc) operations for now. The value of this feature at this point is to reduce the memory bandwidth when feeding and getting the data from a shader. It will also be the basis for future extensions defining 16-bit arithmetic operations.

Taking into account that the series is still in the review process, I will not go too deep into the technical details of the implementation. In general, most of the changes were related with the assumption that all we had was 32 or 64 bit types, so we just needed to update some conditions to take into account 16-bit types supported by the HW. In any case, I think that I can list three issues that required some extra work from our side:

  • One of the subfeatures we needed to support is being able to define 16-bit input vertex attributes. A really good reading about how this is implemented and supported on Intel HW is Ben Widawsky’s post “GEN Graphics And The URB”. This post explains in detail how this is done for 32-bit vertex inputs. We used this post as another source of documentation when we implemented the support for 64-bit vertex attributes last year (I briefly mentioned it on my previous blog post). In the case of 64-bit, when feeding the shader with the data, you can configure how the 64-bit data is passed to the shader. There is a surface format that do an implicit conversion to 32-bit, and another that pass it without any conversion (PASSTHRU format). You use one or the other depending on the type of your variable at the shader. But, for the case of 16-bit, there is just one surface format. And as the section FormatConversion at the reference manual points, this surface format do an implicit 16-bit to 32-bit conversion. In order to workaround it, we needed to change the surface format on the fly, using a 32-bit format one, and then reorder the data when it arrived to the shader.

  • Most of the surface read/writes used on intel driver are untyped surface readwrite message. Unfourtunately, those are 32-bit width messages. So we needed to implement the support, and use, a different kind of message, byte scattered read/write messages. The reference already warns that it is likely that it would be better to use a different message (for performance reasons). In any case, using this message is only really needed when using variable of one and three components. Eduardo already have a patch that uses 32-bit untyped read/write messages when possible.

  • For a render target write message (so for example, the output of a fragment shader), we enabled the 16-bit payload using the data format bit (Data Format on the Message Descriptor Definition of Send Messages). But this bit is not available on Broadwell, and doesn’t support unsigned ints on Cherryview/Braswell. So for those cases as workaround we needed to use the 32-bit payload, doing an extra conversion from 16-bit to 32-bits before the HW deals with the surface format conversion when writing 32-bit values to a 16-bit format surface.

So the next steps now is getting it reviewed, update the patchs accordingly and land it on master. In parallel we are working on optimizations and other improvements we listed while we were working on the extension (as the already mentioned Eduardo’s patch).

Finally, I would like to thanks Intel for sponsoring this work and for their support. Also, thanks to Iago Toral and Samuel Iglesias for sharing with us their experience while developing the 64-bit support on both OpenGL and Vulkan that helped us to implement this extension.