Iago wrote recently a blog post about performance improvements on the v3d compiler, and introduced our plans to improve the pipeline caching (specifically the compiled shaders) (full blog here). We merged some improvements recently, so let’s talk about that work.

Pipeline cache improvements

While analysing the perfomance of RBDOOM-3-BFG, we noticed that some significant CPU time was spent every frame for linking shaders.

After some investigation, we found that the game was calling ClearAttachment twice every frame. The implementation of those ClearAttachments was relying on a full job with a graphics pipeline. On v3dv by default any pipeline is created with a pipeline cache (provided by the user, or a default pipeline). On v3dv (and in general any Vulkan driver) the main cached data are the compiled shaders, so the main objective of the pipeline cache is avoiding full shader re-compilation on compatible pipelines that are used really often. Why was that time spent on linking shaders?

The issue was that for each pipeline lookup on the pipeline cache we were doing two cache lookups. The first one against a cache with the shaders in NIR, that is the main intermediate representation for shaders in Mesa (more info about intermediate representation here). And then we used those shaders to fill up the key for a second cache lookup, that if succesful, will return the compiled shader on Broadcom (QPU) assembly format.

The reason of this two-step lookup is that to compile a shader we call the common (for both OpenGL and Vulkan) Broadcom compiler, and we use some data structures that contain info that will affect the compilation (like if blending is enabled). When we implemented the pipeline cache support, for simplicity, we used the same data structures as part of the cache key. But as such keys were filled with info coming from the NIR shaders, those needed to be linked together on the case of the graphics pipelines.

When we analyzed how to improve it, we realized that in order to identify the compiled shader, we don’t really need the NIR shaders, as the info derived from them are implicit to the SPIR-V shaders provided to create the pipeline. Those NIR shaders are only really needed to compile the shader. The improvement here was using a different data structure as part of the cache key, and replace the two-cache-lookup with a one-cache-lookup. We needed to do some additional changes, as there were parts of the code that assumed that the NIR shaders would be available, but now if possible we are skipping getting them.

Numbers

Let’s start showing the improvement with a synthetic test. We took one CTS test using a complex shader, and forced the pipeline to be re-recreated 1000 times. So we get the following times:

Before this work, disabling pipeline cache: 125,79 seconds
Before this work, enabling pipeline cache (default): 11,41 seconds
With this work, enabling pipeline chache: 0,87 seconds

That’s a clear improvement. So how about real applications? As mentioned we started this work after analyze RBDOOM-3-BFG use of ClearAttachment. That game got an improvement of ~1fps. But when we tested other games we didn’t get any improvement on that case. This is because using a full job for ClearAttachment isn’t the preferred option (it is in fact the slowest path), and because the other games are GPU limited.

But we got improvements on other cases. As mentioned the advantage of creating the pipeline with a hot cache is that we avoid the shader recompilation, so it is far faster. And there are some apps that create pipelines at runtime. We found that this happens with the Unreal Engine 4 demos, specifically the Shooter Game. So, for example, we start the demo like this:

and then we decide to shot for the first time:

in order to render the shooting effect, new pipelines are created, and several shaders are compiled. Due all that extra work we experiment a noticiable FPS hiccup. We can visualize it with this graph:

On that graph we can see how we go from ~25fps to ~2fps. That is the shooting moment.

For this cases, the ideal would be to start the game with a pipeline cache loaded with the outcome of previous executions of the game. So adding a hack to simulate that situation, and focusing on the relevant stat in this case, that is the minimum FPS, we get the following:

Cold cache: ~2fps
Hot cache, before this work: ~6fps
Hot cache, with this work: ~8fps

on-disk-cache

As mentioned, the ideal would be that the applications used a pipeline cache when creating the pipelines, and that they stored the content on disk, and loaded it on following executions. Among other things, because the application would have more control about what to store and load at each moment (like for example: store/load the pipeline cache for a given level of a game).

The reality is that not all the applications do that. As mentioned, the numbers of the previous situation was simulating that ideal situation, but the application was not doing that. In order to mitigate that, we added support for on-disk-cache. Mesa provides a framework to store and load shaders on disk, so basically for any pipeline cacke lookup, now we have an extra fallback. In this case, for the UE4 Shooter demo, for a hot on-disk-cache we get a minimum of ~7fps, that as expected, is better that the situation before our work, but worse that the ideal case of the application handling the store/load of the pipeline cache on disk.

Some conclusions

So some conclusions from this work, that applies to any Vulkan driver, not only v3dv in particular:

If possible, pre-create all the pipelines that you would use during your application runtime. Creating pipelines during runtime, even if cached, could lead to performance hiccups.
If that is not possible (for example if the variable combination is too high), use pipeline cache, and store/load them on-disk. This would reduce the loading time, and make the runtime performance hiccups less noticiable. Note that having support for general on-disk-cache doesn’t mean that v3dv will do this for you.

Improving v3dv pipeline caching

Pipeline cache improvements

Numbers

on-disk-cache

Some conclusions

Leave a Reply Cancel reply