We have been making good progress so far and at this point we are getting close to having a complete Vulkan 1.0 implementation. I believe the main pending features for that are pipeline caches, which Alejandro is currently working on, texel buffers, multisampling support and robust buffer access, so in the last few weeks I decided to take a break from feature development and try to get some Vulkan games running with our driver and use them to guide some inital performance work.
I decided to work with all 3 VkQuake games since they run on Linux, the source code is available (which makes things a lot easier to debug) and seemed to be using a subset of the Vulkan API we already supported. For vkQuake we needed compute shaders and input attachments that we implemented recently, and for vkQuake3 we needed a couple of optional Vulkan features which I implemented recently to get it running without having to modify the game code. So all these games are now running on the Raspberry Pi4 with the V3DV driver. At the same time, our friend Salva from Pi Labs has also been testing the PPSSPP emulator using Vulkan and reporting that some games seem to be working already, which has been great to hear.
I was particularly interested in getting vkQuake3 to work because the project includes both the Vulkan and the original OpenGL renderers, which was great to compare performance between both APIs. VkQuake3 comes with a GL1 and a GL2 renderer, with the GL1 render being the fastest of the two by a large margin (apparently the GL2 renderer has additional rendering features that make it much slower). I think the Vulkan renderer is based on the GL1 renderer (although I have not actually checked) so I figured it would make the most reasonable comparison, and in our tests we found the Vulkan version to be up to 60% faster. Of course, it could be argued that GL1 is a pretty old API and that the difference with a more modern GL or GLES renderer might be less significant, but it is still a good sign.
To finish the post, here are some pics of the games:
vkQuake3 OpenGL 1 renderer
vkQuake3 OpenGL 1 renderer
vkQuake3 Vulkan renderer
vkQuake3 Vulkan renderer
A couple of final notes:
* Note that the Vulkan renderer for vkQuake3 is much darker, but that is just how the renderer operates and not a driver issue, we observed the same behavior on Intel GPUs.
* A note for those interested in trying vkQuake3, we noticed that exterior levels have broken sky rendering, I hope we will get to fix that soon.
So continuing with the news, here is a fairly recent one: as the tile states, I am happy to announce that the Raspberry Pi 4 is now an OpenGL ES 3.1 conformant product!. This means that the Mesa V3D driver has successfully passed a whole lot of tests designed to validate the OpenGL ES 3.1 feature set, which should be a good sign of driver quality and correctness.
It should be noted that the Raspberry Pi 4 shipped with a V3D driver exposing OpenGL ES 3.0, so this also means that on top of all the bugfixes that we implemented for conformance, the driver has also gained new functionality! Particularly, we merged Eric’s previous work to enable Compute Shaders.
All this work has been in Mesa master since December (I believe there is only one fix missing waiting for us to address review feedback), and will hopefully make it to Raspberry Pi 4 users soon.
I actually landed this in Mesa back in December but never got to announce it anywhere. The implementation passes all the tests available in the Khronos Conformance Tests Suite (CTS). If you give this a try and find any bugs, please report them here with the V3D tag.
This is also the first large feature I land in V3D! Hopefully there will be more coming in the future.
Yeah… this blog post is well overdue, but better late than never! So yes, I am currently working on progressing the Raspberry Pi 4 Mesa driver stack, together with my Igalian colleagues Piñeiro and Chema, continuing the fantastic work started by Eric Anholt on the Mesa V3D driver.
The Raspberry Pi 4 sports a Video Core VI GPU that is capable of OpenGL ES 3.2, so it is a big update from the Raspberry Pi 3, which could only do OpenGL ES 2.0. Another big change with the Raspberry Pi 4 is that the Mesa v3d driver is the driver used by default with Raspbian. Because both GPUs are quite different, Eric had to write an all new driver for the Raspberry Pi 4, and that is why there are two drivers in Mesa: the VC4 driver is for the Raspberry Pi 3, while the V3D driver targets the Raspberry Pi 4.
As for what we have been working on exactly, I wrote a long post on the Raspberry Pi blog some months ago with a lot of the details, but for those looking for the quick summary:
Shader compiler optimizations.
Significant Transform Feedback fixes and improvements.
Implemented OpenGL Logic Operations.
A bunch of bugfixes for Piglit test failures.
Set up a Continuous Integration system to identify regressions.
Rebased and merge Eric’s work on Compute Shaders.
Many bug fixes targeting the Khronos OpenGL ES Conformance Test Suite (CTS).
So that’s it for the late news. I hope to do a better job keeping this blog updated with the news this year, and to start with that I will be writing a couple of additional posts to highlight a few significant development milestones we achieved recently, so stay tuned for more!
The last time I talked about my driver work was to announce the implementation of the shaderInt16 feature for the Anvil Vulkan driver back in May, and since then I have been working on VK_KHR_shader_float16_int8, a new Vulkan extension recently announced by the Khronos group, for which I have just posted initial patches in mesa-dev supporting Broadwell and later Intel platforms.
As you probably guessed by the name, this extension enables Vulkan to consume SPIR-V shaders that use of Float16 and Int8 types in arithmetic operations, extending the functionality included with VK_KHR_16bit_storage and VK_KHR_8bit_storage, which was limited to load/store operations. In theory, applications that do not need the range and precision of regular 32-bit floating point and integers, can use these new types to improve performance by increasing ALU throughput and reducing register pressure, which in some platforms can also lead to improved parallelism.
In the case of the Intel platforms initial testing done by Intel suggests that better ALU throughput is expected when issuing half-float instructions. Lower register pressure is also expected, at least for SIMD16 fragment and compute shaders, where we can pack all 16-channels worth of half-float data into a single GPU register, which could significantly improve performance for shaders that would otherwise need to spill registers to memory.
Another neat thing is that while VK_KHR_shader_float16_int8 is a Vulkan extension, its implementation is mostly API agnostic, so most of the work we did here should also help us have a proper mediump implementation for GLSL ES shaders in the future.
There are a few caveats to consider as well though: on some hardware platforms smaller bit-sizes have certain hardware restrictions that may lead to emitting worse shader code in some scenarios, and generally, Mesa’s compiler infrastructure (and the Intel compiler backend in particular) have a long history of being 32-bit only, so there are parts of the compiler stack that still work better for 32-bit code.
Because VK_KHR_shader_float16_int8 is a brand new feature, we don’t really have any real world use cases yet. This is on top of the fact that Mesa’s compiler backends have been mostly (or exclusively) 32-bit aware until now (and more recently 64-bit too), so going forward I would expect a lot of focus on making our compiler be as robust (and optimal) for 16-bit code as it is for 32-bit code.
While we are already aware of a few areas where we can do better and I am currently working on addressing a few of these, one of the major limiting factors we have at the moment is the fact that the only source of 16-bit shaders available to us is the Khronos CTS, which due to its particular motivation, is very different from real world shader workloads and it is not a valid source material to drive compiler optimization work. Unfortunately, it might take some time until we start seeing applications using these new features, so in the meantime we will need to find other ways to drive further work in this area, and I think our best option here might be GLSL ES’s mediump and lowp qualifiers.
GLSL ES mediump and lowp qualifiers have been around for a long time but they are only defined as hints to the shader compiler that lower precision is acceptable and we have never really used them to emit half-float code. Thankfully, Topi Pohjolainen from Intel has been working on this for a while, which would open up a much better scenario for improving our 16-bit compiler paths, so this is something I am really looking forward to.
Finally, as I say above, we could could definitely use more testing and feedback from real world use cases, so if you decide to use this feature in your next project and you hit any bugs, please be sure to file them in Bugzilla so we can continue to improve our implementation.
The Vulkan specification includes a number of optional features that drivers may or may not support, as described in chapter 30.1 Features. Application developers can query the driver for supported features via vkGetPhysicalDeviceFeatures() and then activate the subset they need in the pEnabledFeatures field of the VkDeviceCreateInfo structure passed at device creation time.
In the last few weeks I have been spending some time, together with my colleague Chema, adding support for one of these features in Anvil, the Intel Vulkan driver in Mesa, called shaderInt16, which we landed in Mesa master last week. This is an optional feature available since Vulkan 1.0. From the spec:
shaderInt16 specifies whether 16-bit integers (signed and unsigned) are supported in shader code. If this feature is not enabled, 16-bit integer types must not be used in shader code. This also specifies whether shader modules can declare the Int16 capability.
It is probably relevant to highlight that this Vulkan capability also requires the SPIR-V Int16 capability, which basically means that the driver’s SPIR-V compiler backend can actually digest SPIR-V shaders that declare and use 16-bit integers, and which is really the core of the functionality exposed by the Vulkan feature.
Ideally, shaderInt16 would increase the overall throughput of integer operations in shaders, leading to better performance when you don’t need a full 32-bit range. It may also provide better overall register usage since you need less register space to store your integer data during shader execution. It is important to remark, however, that not all hardware platforms (Intel or otherwise) may have native support for all possible types of 16-bit operations, and thus, some of them might still need to run in 32-bit (which requires injecting type conversion instructions in the shader code). For Intel platforms, this is the case for operations associated with integer division.
From the point of view of the driver, this is the first time that we generally exercise lower bit-size data types in the driver compiler backend, so if you find any bugs in the implementation, please file bug reports in bugzilla!
Speaking of shaderInt16, I think it is worth mentioning its interactions with other Vulkan functionality that we implemented in the past: the Vulkan 1.0 VK_KHR_16bit_storage extension (which has been promoted to core in Vulkan 1.1). From the spec:
The VK_KHR_16bit_storage extension allows use of 16-bit types in shader input and output interfaces, and push constant blocks. This extension introduces several new optional features which map to SPIR-V capabilities and allow access to 16-bit data in Block-decorated objects in the Uniform and the StorageBuffer storage classes, and objects in the PushConstant storage class. This extension allows 16-bit variables to be declared and used as user-defined shader inputs and outputs but does not change location assignment and component assignment rules.
While the shaderInt16 capability provides the means to operate with 16-bit integers inside a shader, the VK_KHR_16bit_storage extension provides developers with the means to also feed shaders with 16-bit integer (and also floating point) input data, such as Uniform/Storage Buffer Objects or Push Constants, from the applications side, plus, it also gives the opportunity for linked shader stages in a graphics pipeline to consume 16-bit shader inputs and produce 16-bit shader outputs.
VK_KHR_16bit_storage and shaderInt16 should be seen as two parts of a whole, each one addressing one part of a larger problem: VK_KHR_16bit_storage can help reduce memory bandwith for Uniform and Storage Buffer data accessed from the shaders and / or optimize Push Constant space, of which there are only a few bytes available, making it a precious shader resource, but without shaderInt16, shaders that are fed 16-bit input data are still required to convert this data to 32-bit internally for operation (and then back again to 16-bit for output if needed). Likewise, shaders that use shaderInt16 without VK_KHR_16bit_storage can only operate with 16-bit data that is generated inside the shader, which largely limits its usage. Both together, however, give you the complete functionality.
We are very happy to continue expanding the feature set supported in Anvil and we look forward to seeing application developers making good use of shaderInt16 in Vulkan to improve shader performance. As noted above, this is the first time that we fully enable the shader compiler backend to do general purpose operations on lower bit-size data types and there might be things that we can still improve or optimize. If you hit any issues with the implementation, please contact us and / or file bug reports so we can continue to improve the implementation.
For some time now I have been working on a personal project to render the well known Sponza model provided by Crytek using Vulkan. Here is a picture of the current (still a work-in-progress) result:
This screenshot was captured on my Intel Kabylake laptop, running on the Intel Mesa Vulkan driver (Anvil).
The following list includes the main features implemented in the demo:
Forward and deferred rendering paths
Shadow mapping with Percentage-Closer Filtering
Screen Space Ambient Occlusion (only on the deferred path)
Screen Space Reflections (only on the deferred path)
I have been thinking about writing post about this for some time, but given that there are multiple features involved I wasn’t sure how to scope it. Eventually I decided to write a “frame analysis” post where I describe, step by step, all the render passes involved in the production of the single frame capture showed at the top of the post. I always enjoyed reading this kind of articles so I figured it would be fun to write one myself and I hope others find it informative, if not entertaining.
To avoid making the post too dense I won’t go into too much detail while describing each render pass, so don’t expect me to go into the nitty-gritty of how I implemented Screen Space Ambient Occlussion for example. Instead I intend to give a high-level overview of how the various features implemented in the demo work together to create the final result. I will provide screenshots so that readers can appreciate the outputs of each step and verify how detail and quality build up over time as we include more features in the pipeline. Those who are more interested in the programming details of particular features can always have a look at the Vulkan source code (link available at the bottom of the article), look for specific tutorials available on the Internet or wait for me to write feature-specifc posts (I don’t make any promises though!).
If you’re interested in going through with this then grab a cup of coffe and get ready, it is going to be a long ride!
Step 0: Culling
This is the only step in this discussion that runs on the CPU, and while optional from the point of view of the result (it doesn’t affect the actual result of the rendering), it is relevant from a performance point of view. Prior to rendering anything, in every frame, we usually want to cull meshes that are not visible to the camera. This can greatly help performance, even on a relatively simple scene such as this. This is of course more noticeable when the camera is looking in a direction in which a significant amount of geometry is not visible to it, but in general, there are always parts of the scene that are not visible to the camera, so culling is usually going to give you a performance bonus.
In large, complex scenes with tons of objects we probably want to use more sophisticated culling methods such as Quadtrees, but in this case, since the number of meshes is not too high (the Sponza model is slightly shy of 400 meshes), we just go though all of them and cull them individually against the camera’s frustum, which determines the area of the 3D space that is visible to the camera.
The way culling works is simple: for each mesh we compute an axis-aligned bounding box and we test that box for intersection with the camera’s frustum. If we can determine that the box never intersects, then the mesh enclosed within it is not visible and we flag it as such. Later on, at rendering time (or rather, at command recording time, since the demo has been written in Vulkan) we just skip the meshes that have been flagged.
The algorithm is not perfect, since it is possible that an axis-aligned bounding box for a particular mesh is visible to the camera and yet no part of the mesh itself is visible, but it should not affect a lot of meshes and trying to improve this would incur in additional checks that could undermine the efficiency of the process anyway.
Since in this particular demo we only have static geometry we only need to run the culling pass when the camera moves around, since otherwise the list of visible meshes doesn’t change. If dynamic geometry were present, we would need to at least cull dynamic geometry on every frame even if the camera stayed static, since dynamic elements may step in (or out of) the viewing frustum at any moment.
Step 1: Depth pre-pass
This is an optional stage, but it can help performance significantly in many cases. The idea is the following: our GPU performance is usually going to be limited by the fragment shader, and very specially so as we target higher resolutions. In this context, without a depth pre-pass, we are very likely going to execute the fragment shader for fragments that will not end up in the screen because they are occluded by fragments produced by other geometry in the scene that will be rasterized to the same XY screen-space coordinates but with a smaller Z coordinate (closer to the camera). This wastes precious GPU resources.
One way to improve the situation is to sort our geometry by distance from the camera and render front to back. With this we can get fragments that are rasterized from background geometry quickly discarded by early depth tests before the fragment shader runs for them. Unfortunately, although this will certainly help (assuming we can spare the extra CPU work to keep our geometry sorted for every frame), it won’t eliminate all the instances of the problem in the general case.
Also, some times things are more complicated, as the shading cost of different pieces of geometry can be very different and we should also take this into account. For example, we can have a very large piece of geometry for which some pixels are very close to the camera while some others are very far away and that has a very expensive shader. If our renderer is doing front-to-back rendering without any other considerations it will likely render this geometry early (since parts of it are very close to the camera), which means that it will shade all or most of its very expensive fragments. However, if the renderer accounts for the relative cost of the shader execution it would probably postpone rendering it as much as possible, so by the time it actually renders it, it takes advantage of early fragment depth tests to avoid as many of its expensive fragment shader executions as possible.
Using a depth-prepass ensures that we only run our fragment shader for visible fragments, and only those, no matter the situation. The downside is that we have to execute a separate rendering pass where we render our geometry to the depth buffer so that we can identify the visible fragments. This pass is usually very fast though, since we don’t even need a fragment shader and we are only writing to a depth texture. The exception to this rule is geometry that has opacity information, such as opacity textures, in which case we need to run a cheap fragment shader to identify transparent pixels and discard them so they don’t hit the depth buffer. In the Sponza model we need to do that for the flowers or the vines on the columns for example.
The picture shows the output of the depth pre-pass. Darker colors mean smaller distance from the camera. That’s why the picture gets brighter as we move further away.
Now, the remaining passes will be able to use this information to limit their shading to fragments that, for a given XY screen-space position, match exactly the Z value stored in the depth buffer, effectively selecting only the fragments that will be visible in the screen. We do this by configuring the depth test to do an EQUAL test instead of the usual LESS test, which is what we use in the depth-prepass.
In this particular demo, running on my Intel GPU, the depth pre-pass is by far the cheapest of all the GPU passes and it definitely pays off in terms of overall performance output.
Step 2: Shadow map
In this demo we have single source of light produced by a directional light that simulates the sun. You can probably guess the direction of the light by checking out the picture at the top of this post and looking at the direction projected shadows.
I already covered how shadow mapping works in previous series of posts, so if you’re interested in the programming details I encourage you to read that. Anyway, the basic idea is that we want to capture the scene from the point of view of the light source (to be more precise, we want to capture the objects in the scene that can potentially produce shadows that are visible to our camera).
With that information, we will be able to inform out lighting pass so it can tell if a particular fragment is in the shadows (not visible from our light’s perspective) or in the light (visible from our light’s perspective) and shade it accordingly.
From a technical point of view, recording a shadow map is exactly the same as the depth-prepass: we basically do a depth-only rendering and capture the result in a depth texture. The main differences here are that we need to render from the point of view of the light instead of our camera’s and that this being a directional light, we need to use an orthographic projection and adjust it properly so we capture all relevant shadow casters around the camera.
In the image above we can see the shadow map generated for this frame. Again, the brighter the color, the further away the fragment is from the light source. The bright white area outside the atrium building represents the part of the scene that is empty and thus ends with the maximum depth, which is what we use to clear the shadow map before rendering to it.
In this case, we are using a 4096×4096 texture to store the shadow map image, much larger than our rendering target. This is because shadow mapping from directional lights needs a lot of precision to produce good results, otherwise we end up with very pixelated / blocky shadows, more artifacts and even missing shadows for small geometry. To illustrate this better here is the same rendering of the Sponza model from the top of this post, but using a 1024×1024 shadow map (floor reflections are disabled, but that is irrelevant to shadow mapping):
You can see how in the 1024×1024 version there are some missing shadows for the vines on the columns and generally blurrier shadows (when not also slightly distorted) everywhere else.
Step 3: GBuffer
In deferred rendering we capture various attributes of the fragments produced by rasterizing our geometry and write them to separate textures that we will use to inform the lighting pass later on (and possibly other passes).
What we do here is to render our geometry normally, like we did in our depth-prepass, but this time, as we explained before, we configure the depth test to only pass fragments that match the contents of the depth-buffer that we produced in the depth-prepass, so we only process fragments that we now will be visible on the screen.
Deferred rendering uses multiple render targets to capture each of these attributes to a different texture for each rasterized fragment that passes the depth test. In this particular demo our GBuffer captures:
Position of the fragment from the point of view of the light (for shadow mapping)
It is important to be very careful when defining what we store in the GBuffer: since we are rendering to multiple screen-sized textures, this pass has serious bandwidth requirements and therefore, we should use texture formats that give us the range and precision we need with the smallest pixel size requirements and avoid storing information that we can get or compute efficiently through other means. This is particularly relevant for integrated GPUs that don’t have dedicated video memory (such as my Intel GPU).
In the demo, I do lighting in view-space (that is the coordinate space used takes the camera as its origin), so I need to work with positions and vectors in this coordinate space. One of the parameters we need for lighting is surface normals, which are conveniently stored in the GBuffer, but we will also need to know the view-space position of the fragments in the screen. To avoid storing the latter in the GBuffer we take advantage of the fact that we can reconstruct the view-space position of any fragment on the screen from its depth (which is stored in the depth buffer we rendered during the depth-prepass) and the camera’s projection matrix. I might cover the process in more detail in another post, for now, what is important to remember is that we don’t need to worry about storing fragment positions in the GBuffer and that saves us some bandwidth, helping performance.
Let’s have a look at the various GBuffer textures we produce in this stage:
Here we see the normalized normal vectors for each fragment in view-space. This means they are expressed in a coordinate space in which our camera is at the origin and the positive Z direction is opposite to the camera’s view vector. Therefore, we see that surfaces pointing to the right of our camera are red (positive X), those pointing up are green (positive Y) and those pointing opposite to the camera’s view direction are blue (positive Z).
It should be mentioned that some of these surfaces use normal maps for bump mapping. These normal maps are textures that provide per-fragment normal information instead of the usual vertex normals that come with the polygon meshes. This means that instead of computing per-fragment normals as a simple interpolation of the per-vertex normals across the polygon faces, which gives us a rather flat result, we use a texture to adjust the normal for each fragment in the surface, which enables the lighting pass to render more nuanced surfaces that seem to have a lot more volume and detail than they would have otherwise.
For comparison, here is the GBuffer normal texture without bump mapping enabled. The difference in surface detail should be obvious. Just look at the lion figure at the far end or the columns and and you will immediately notice the addditional detail added with bump mapping to the surface descriptions:
To make the impact of the bump mapping more obvious, here is a different shot of the final rendering focusing on the columns of the upper floor of the atrium, with and without bump mapping:
All the extra detail in the columns is the sole result of the bump mapping technique.
Here we have the diffuse color of each fragment in the scene. This is basically how our scene would look like if we didn’t implement a lighting pass that considers how the light source interacts with the scene.
Naturally, we will use this information in the lighting pass to modulate the color output based on the light interaction with each fragment.
This is similar to the diffuse texture, but here we are storing the color (and strength) used to compute specular reflections.
Similarly to normal textures, we use specular maps to obtain per-fragment specular colors and intensities. This allows us to simulate combinations of more complex materials in the same mesh by specifying different specular properties for each fragment.
For example, if we look at the cloths that hang from the upper floor of the atrium, we see that they are mostly black, meaning that they barely produce any specular reflection, as it is to be expected from textile materials. However, we also see that these same cloths have an embroidery that has specular reflection (showing up as a light gray color), which means these details in the texture have stronger specular reflections than its surrounding textile material:
The image shows visible specular reflections in the yellow embroidery decorations of the cloth (on the bottom-left) that are not present in the textile segment (the blue region of the cloth).
Fragment positions from Light
Finally, we store fragment positions in the coordinate space of the light source so we can implement shadows in the lighting pass. This image may be less intuitive to interpret, since it is encoding space positions from the point of view of the sun rather than physical properties of the fragments. We will need to retrieve this information for each fragment during the lighting pass so that we can tell, together with the shadow map, which fragments are visible from the light source (and therefore are directly lit by the sun) and which are not (and therefore are in the shadows). Again, more detail on how that process works, step by step and including Vulkan source code in my series of posts on that topic.
Step 4: Screen Space Ambient Occlusion
With the information stored in the GBuffer we can now also run a screen-space ambient occlusion pass that we will use to improve our lighting pass later on.
The idea here, as I discussed in my lighting and shadows series, the Phong lighting model simplifies ambient lighting by making it constant across the scene. As a consequence of this, lighting in areas that are not directly lit by a light source look rather flat, as we can see in this image:
Screen-space Ambient Occlusion is a technique that gathers information about the amount of ambient light occlusion produced by nearby geometry as a way to better estimate the ambient light term of the lighting equations. We can then use that information in our lighting pass to modulate ambient light accordingly, which can greatly improve the sense of depth and volume in the scene, specially in areas that are not directly lit:
Comparing the images above should illustrate the benefits of the SSAO technique. For example, look at the folds in the blue curtains on the right side of the images, without SSAO, we barely see them because the lighting is too flat across all the pixels in the curtain. Similarly, thanks to SSAO we can create shadowed areas from ambient light alone, as we can see behind the cloths that hang from the upper floor of the atrium or behind the vines on the columns.
To produce this result, the output of the SSAO pass is a texture with ambient light intensity information that looks like this (after some blur post-processing to eliminate noise artifacts):
In that image, white tones represent strong light intensity and black tones represent low light intensity produced by occlusion from nearby geometry. In our lighting pass we will source from this texture to obtain per-fragment ambient occlusion information and modulate the ambient term accordingly, bringing the additional volume showcased in the image above to the final rendering.
Step 6: Lighting pass
Finally, we get to the lighting pass. Most of what we showcased above was preparation work for this.
The lighting pass mostly goes as I described in my lighting and shadows series, only that since we are doing deferred rendering we get our per-fragment lighting inputs by reading from the GBuffer textures instead of getting them from the vertex shader.
Basically, the process involves retrieving diffuse, ambient and specular color information from the GBuffer and use it as input for the lighting equations to produce the final color for each fragment. We also sample from the shadow map to decide which pixels are in the shadows, in which case we remove their diffuse and specular components, making them darker and producing shadows in the image as a result.
We also use the SSAO output to improve the ambient light term as described before, multipliying the ambient term of each fragment by the SSAO value we computed for it, reducing the strength of the ambient light for pixels that are surrounded by nearby geometry.
The lighting pass is also where we put bump mapping to use. Bump mapping provides more detailed information about surface normals, which the lighting pass uses to simulate more complex lighting interactions with mesh surfaces, producing significantly enhanced results, as I showcased earlier in this post.
After combining all this information, the lighting pass produces an output like this. Compare it with the GBuffer diffuse texture to see all the stuff that this pass is putting together:
Step 7: Tone mapping
After the lighting pass we run a number of post-processing passes, of which tone mapping is the first one. The idea behind tone mapping is this: normally, shader color outputs are limited to the range [0, 1], which puts a hard cap on our lighting calculations. Specifically, it means that when our light contributions to a particular pixel go beyond 1.0 in any color component, they get clamped, which can distort the resulting color in unrealistic ways, specially when this happens during intermediate lighting calculations (since the deviation from the physically correct color is then used as input to more computations, which then build on that error).
To work around this we do our lighting calculations in High Dynamic Range (HDR) which allows us to produce color values with components larger than 1.0, and then we run a tone mapping pass to re-map the result to the [0, 1] range when we are done with the lighting calculations and we are ready for display.
The nice thing about tone mapping is that it gives the developer control over how that mapping happens, allowing us to decide if we are interested in preserving more detail in the darker or brighter areas of the scene.
In this particular demo, I used HDR rendering to ramp up the intensity of the sun light beyond what I could have represented otherwise. Without tone mapping this would lead to unrealistic lighting in areas with strong light reflections, since would exceed the 1.0 per-color-component cap and lead to pure white colors as result, losing the color detail from the original textures. This effect can be observed in the following pictures if you look at the lit area of the floor. Notice how the tone-mapped picture better retains the detail of the floor texture while in the non tone-mapped version the floor seems to be over-exposed to light and large parts of it just become white as a result (shadow mapping has been disabled to better showcase the effects of tone-mapping on the floor):
Step 8: Screen Space Reflections (SSR)
The material used to render the floor is reflective, which means that we can see the reflections of the surrounding environment on it.
There are various ways to capture reflections, each with their own set of pros and cons. When I implemented my OpenGL terrain rendering demo I implemented water reflections using “Planar Reflections”, which produce very accurate results at the expense of requiring to re-render the scene with the camera facing in the same direction as the reflection. Although this can be done at a lower resolution, it is still quite expensive and cumbersome to setup (for example, you would need to run an additional culling pass), and you also need to consider that we need to do this for each planar surface you want to apply reflections on, so it doesn’t scale very well. In this demo, although it is not visible in the reference screenshot, I am capturing reflections from the floor sections of both stories of the atrium, so the Planar Reflections approach might have required me to render twice when fragments of both sections are visible (admittedly, not very often, but not impossible with the free camera).
So in this particular case I decided to experiment with a different technique that has become quite popular, despite its many shortcomings, because it is a lot faster: Screen Space Reflections.
As all screen-space techniques, the technique uses information already present in the screen to capture the reflection information, so we don’t have to render again from a different perspective. This leads to a number of limitations that can produce fairly visible artifacts, specially when there is dynamic geometry involved. Nevertheless, in my particular case I don’t have any dynamic geometry, at least not yet, so while the artifacts are there they are not quite as distracting. I won’t go into the details of the artifacts introduced with SSR here, but for those interested, here is a good discussion.
I should mention that my take on this is fairly basic and doesn’t implement relevant features such as the Hierarchical Z Buffer optimization (HZB) discussed here.
The technique has 3 steps: capturing reflections, applying roughness material properties and alpha blending:
I only implemented support for SSR in the deferred path, since like in the case of SSAO (and more generally all screen-space algorithms), deferred rendering is the best match since we are already capturing screen-space information in the GBuffer.
The first stage for this requires to have means to identify fragments that need reflection information. In our case, the floor fragments. What I did for this is to capture the reflectiveness of the material of each fragment in the screen during the GBuffer pass. This is a single floating-point component (in the 0-1 range). A value of 0 means that the material is not reflective and the SSR pass will just ignore it. A value of 1 means that the fragment is 100% reflective, so its color value will be solely the reflection color. Values in between allow us to control the strength of the reflection for each fragment with a reflective material in the scene.
One small note on the GBuffer storage: because this is a single floating-point value, we don’t necessarily need an extra attachment in the GBuffer (which would have some performance penalty), instead we can just put this in the alpha component of the diffuse color, since we were not using it (the Intel Mesa driver doesn’t support rendering to RGB textures yet, so since we are limited to RGBA we might as well put it to good use).
Besides capturing which fragments are reflective, we can also store another piece of information relevant to the reflection computations: the material’s roughness. This is another scalar value indicating how much blurring we want to apply to the resulting reflection: smooth metal-like surfaces can have very sharp reflections but with rougher materials that have not smooth surfaces we may want the reflections to look a bit blurry, to better represent these imperfections.
Besides the reflection and roughness information, to capture screen-space reflections we will need access to the output of the previous pass (tone mapping) from which we will retrieve the color information of our reflection points, the normals that we stored in the GBuffer (to compute reflection directions for each fragment in the floor sections) and the depth buffer (from the depth-prepass), so we can check for reflection collisions.
The technique goes like this: for each fragment that is reflective, we compute the direction of the reflection using its normal (from the GBuffer) and the view vector (from the camera and the fragment position). Once we have this direction, we execute a ray marching from the fragment position, in the direction of the reflection. For each point we generate, we take the screen-space X and Y coordinates and use them to retrieve the Z-buffer depth for that pixel in the scene. If the depth buffer value is smaller than our sample’s it means that we have moved past foreground geometry and we stop the process. If we got to this point, then we can do a binary search to pin-point the exact location where the collision with the foreground geometry happens, which will give us the screen-space X and Y coordinates of the reflection point. Once we have that we only need to sample the original scene (the output from the tone mapping pass) at that location to retrieve the reflection color.
As discussed earlier, the technique has numerous caveats, which we need to address in one way or another and maybe adapt to the characteristics of different scenes so we can obtain the best results in each case.
The output of this pass is a color texture where we store the reflection colors for each fragment that has a reflective material:
Naturally, the image above only shows reflection data for the pixels in the floor, since those are the only ones with a reflective material attached. It is immediately obvious that some pixels lack reflection color though, this is due to the various limitations of the screen-space technique that are discussed in the blog post I linked above.
Because the reflections will be alpha-blended with the original image, we use the reflectiveness that we stored in the GBuffer as the base for the alpha component of the reflection color as well (there are other aspects that can contribute to the alpha component too, but I won’t go into that here), so the image above, although not visible in the screenshot, has a valid alpha channel.
Considering material roughness
Once we have captured the reflection image, the next step is to apply the material roughness settings. We can accomplish this with a simple box filter based on the roughness of each fragment: the larger the roughness, the larger the box filter we apply and the blurrier the reflection we get as a result. Because we store roughness for each fragment in the GBuffer, we can have multiple reflective materials with different roughness settings if we want. In this case, we just have one material for the floor though.
Finally, we use alpha blending to incorporate the reflection onto the original image (the output from the tone mapping) ot incorporate the reflections to the final rendering:
Step 9: Anti-aliasing (FXAA)
So far we have been neglecting anti-aliasing. Because we are doing deferred rendering Multi-Sample Anti-Aliasing (MSAA) is not an option: MSAA happens at rasterization time, which in a deferred renderer occurs before our lighting pass (specifically, when we generate the GBuffer), so it cannot account for the important effects that the lighting pass has on the resulting image, and therefore, on the eventual aliasing that we need to correct. This is why deferred renderers usually do anti-aliasing via post-processing.
In this demo I have implemented a well-known anti-aliasing post-processing pass known as Fast Approximate Anti Aliasing (FXAA). The technique attempts to identify strong contrast across neighboring pixels in the image to identify edges and then smooth them out using linear filtering. Here is the final result which matches the one I included as reference at the top of this post:
The image above shows the results of the anti-aliasing pass. Compare that with the output of the SSR pass. You can see how this pass has effectively removed the jaggies observed in the cloths hanging from the upper floor for example.
Unlike MSAA, which acts on geometry edges only, FXAA works on all pixels, so it can also smooth out edges produced by shaders or textures. Whether that is something we want to do or not may depend on the scene. Here we can see this happening on the foreground column on the left, where some of the imperfections of the stone are slightly smoothed out by the FXAA pass.
Conclusions and source code
So that’s all, congratulations if you managed to read this far! In the past I have found articles that did frame analysis like this quite interesting so it’s been fun writing one myself and I only hope that this was interesting to someone else.
This demo has been implemented in Vulkan and includes a number of configurable parameters that can be used to tweak performance and quality. The work-in-progress source code is available here, but beware that I have only tested this on Intel, since that is the only hardware I have available, so you may find issues if you run this on other GPUs. If that happens, let me know in the comments and I might be able to provide fixes at some point.
For some time now I have been working on and off on a personal project with no other purpose than toying a bit with Vulkan and some rendering and shading techniques. Although I’ll probably write about that at some point, in this post I want to focus on Vulkan’s specialization constants and how they can provide a very visible performance boost when they are used properly, as I had the chance to verify while working on this project.
The concept behind specialization constants is very simple: they allow applications to set the value of a shader constant at run-time. At first sight, this might not look like much, but it can have very important implications for certain shaders. To showcase this, let’s take the following snippet from a fragment shader as a case study:
That is a snippet taken from a Screen Space Ambient Occlusion shader that I implemented in my project, a popular techinique used in a lot of games, so it represents a real case scenario. As we can see, the process involves a set of vector samples passed to the shader as a UBO that are processed for each fragment in a loop. We have made the maximum number of samples that the shader can use large enough to accomodate a high-quality scenario, but the actual number of samples used in a particular execution will be taken from a push constant uniform, so the application has the option to choose the quality / performance balance it wants to use.
While the code snippet may look trivial enough, let’s see how it interacts with the shader compiler:
The first obvious issue we find with this implementation is that it is preventing loop unrolling to happen because the actual number of samples to use is unknown at shader compile time. At most, the compiler could guess that it can’t be more than 64, but that number of iterations would still be too large for Mesa to unroll the loop in any case. If the application is configured to only use 24 or 32 samples (the value of our push constant uniform at run-time) then that number of iterations would be small enough that Mesa would unroll the loop if that number was known at shader compile time, so in that scenario we would be losing the optimization just because we are using a push constant uniform instead of a constant for the sake of flexibility.
The second issue, which might be less immediately obvious and yet is the most significant one, is the fact that if the shader compiler can tell that the size of the samples array is small enough, then it can promote the UBO array to a push constant. This means that each access to S.samples[i] turns from an expensive memory fetch to a direct register access for each sample. To put this in perspective, if we are rendering to a full HD target using 24 samples per fragment, it means that we would be saving ourselves from doing 1920x1080x24 memory reads per frame for a very visible performance gain. But again, we would be loosing this optimization because we decided to use a push constant uniform.
Vulkan’s specialization constants allow us to get back these performance optmizations without sacrificing the flexibility we implemented in the shader. To do this, the API provides mechanisms to specify the values of the constants at run-time, but before the shader is compiled.
Continuing with the shader snippet we showed above, here is how it can be rewritten to take advantage of specialization constants:
We are now informing the shader that we have a specialization constant NUM_SAMPLES, which represents the actual number of samples to use. By default (if the application doesn’t say otherwise), the specialization constant’s value is 64. However, now that we have a specialization constant in place, we can have the application set its value at run-time, like this:
The application code above sets up specialization constant information for shader consumption at run-time. This is done via an array of VkSpecializationMapEntry entries, each one determining where to fetch the constant value to use for each specialization constant declared in the shader for which we want to override its default value. In our case, we have a single specialization constant (with id 0), and we are taking its value (of integer type) from offset 0 of a buffer. In our case we only have one specialization constant, so our buffer is just the address of the variable holding the constant’s value (config.ssao.num_samples). When we create the Vulkan pipeline, we pass this specialization information using the pSpecializationInfo field of VkPipelineShaderStageCreateInfo. At that point, the driver will override the default value of the specialization constant with the value provided here before the shader code is optimized and native GPU code is generated, which allows the driver compiler backend to generate optimal code.
It is important to remark that specialization takes place when we create the pipeline, since that is the only moment at which Vulkan drivers compile shaders. This makes specialization constants particularly useful when we know the value we want to use ahead of starting the rendering loop, for example when we are applying quality settings to shaders. However, If the value of the constant changes frequently, specialization constants are not useful, since they require expensive shader re-compiles every time we want to change their value, and we want to avoid that as much as possible in our rendering loop. Nevertheless, it it is possible to compile the same shader with different constant values in different pipelines, so even if a value changes often, so long as we have a finite number of combinations, we can generate optimized pipelines for each one ahead of the start of the redendering loop and just swap pipelines as needed while rendering.
Specialization constants are a straight forward yet powerful way to gain control over how shader compilers optimize your code. In my particular pet project, applying specialization constants in a small number of shaders allowed me to benefit from loop unrolling and, most importantly, UBO promotion to push constants in the SSAO pass, obtaining performance improvements that ranged from 10% up to 20% depending on the configuration.
Finally, although the above covered specialization constants from the point of view of Vulkan, this is really a feature of the SPIR-V language, so it is also available in OpenGL with the GL_ARB_gl_spirv extension, which is core since OpenGL 4.6.
At Igalia we are very proud of being a part of this: on the driver side, we have contributed the implementation of VK_KHR_16bit_storage, numerous bugfixes for issues raised by the Khronos Conformance Test Suite (CTS) and code reviews for some of the new Vulkan 1.1 features developed by Intel. On the CTS side, we have worked with other Khronos members in reviewing and testing additions to the test suite, identifying and providing fixes for issues in the tests as well as developing new tests.
Finally, I’d like to highlight the strong industry adoption of Vulkan: as stated in the Khronos press release, various other hardware vendors have already implemented conformant Vulkan 1.1 drivers, we are also seeing major 3D engines adopting and supporting Vulkan and AAA games that have already shipped with Vulkan-powered graphics. There is no doubt that this is only the beginning and that we will be seeing a lot more of Vulkan in the coming years, so look forward to it!
Vulkan and the Vulkan logo are registered trademarks of the Khronos Group Inc.
Being conformant on day one, which the Intel Mesa Vulkan driver also obtained back in the day, is a significant achievement. Besides Intel Mesa, only NVIDIA managed to do this, which I think speaks of the amount of work and effort that one needs to put to achieve it. The days where Linux implementations lagged behind are long gone, we should all celebrate this and acknowledge the important efforts that companies like Intel have put into making this a reality.
Over the last 8-9 months or so, I have been working together with some of my Igalian friends to keep the Intel drivers (for both OpenGL and Vulkan) conformant, so I am very proud that we have reached this milestone. Kudos to all my work mates who have worked with me on this, to our friends at Intel, who have been providing reviews for our patches, feedback and additional driver fixes, and to many other members in the Mesa community who have contributed to make this possible in one way or another.
Of course, OpenGL 4.6 conformance requires that we have an implementation of GL_ARB_gl_spirv, which allows OpenGL applications to consume SPIR-V shaders. If you have been following Igalia’s work, you have already seen some of my colleagues sending patches for this over the last months, but the feature is not completely upstreamed yet. We are working hard on this, but the scope of the implementation that we want for upstream is rather ambitious, since it involves to (finally) have a full shader linker in NIR. Getting that to be as complete as the current GLSL linker and in a shape that is good enough for review and upstreaming is going to take some time, but it is surely a worthwhile effort that will pay off in the future, so please look forward to it and be patient with us as we upstream more of it in the coming months.
It is also important to remark that OpenGL 4.6 conformance doesn’t just validate new features in OpenGL 4.6, it is a full conformance program for OpenGL drivers that includes OpenGL 4.6 functionality, and as such, it is a super set of the OpenGL 4.5 conformance. The OpenGL 4.6 CTS does, in fact, incorporate a whole lot of bugfixes and expanded coverage for OpenGL features that were already present in OpenGL 4.5 and prior.
What is the conformance process and why is it important?
It is a well known issue with standards that different implementations are not always consistent. This can happen for a number of reasons. For example, implementations have bugs which can make something work on one platform but not on another (which will then require applications to implement work arounds). Another reason for this is that some times implementators just have different interpretations of the standard.
The Khronos conformance program is intended to ensure that products that implement Khronos standards (such as OpenGL or Vulkan drivers) do what they are supposed to do and they do it consistently across implementations from the same or different vendors. This is achieved by producing an extensive test suite, the Conformance Test Suite (or CTS for short), which aims to verify that the semantics of the standard are properly implemented by as many vendors as possible.
Why is CTS different to other test suites available?
One reason is that CTS development includes Khronos members who are involved in the definition of the API specifications. This means there are people well versed in the spec language who can review and provide feedback to test developers to ensure that the tests are good.
Another reason is that before new tests go in, it is required that there are at least a number of implementation (from different vendors) that pass them. This means that various vendors have implemented the related specifications and these different implementations agree on the expected result, which is usually a good sign that the tests and the implementations are good (although this is not always enough!).
How does CTS and the Khronos conformance process help API implementators and users?
First, it makes it so that existing and new functionality covered in the API specifications is tested before granting the conformance status. This means that implementations have to run all these tests and pass them, producing the same results as other implementations, so as far as the test coverage goes, the implementations are correct and consistent, which is the whole point of this process: it wont’ matter if you’re running your application on Intel, NVIDIA, AMD or a different GPU vendor, if your application is correct, it should run the
same no matter the driver you are running on.
Now, this doesn’t mean that your application will run smoothly on all conformant platforms out of the box. Application developers still need to be aware that certain aspects or features in the specifications are optional, or that different hardware implementations may have different limits for certain things. Writing software that can run on multiple platforms is always a challenge and some of that will always need to be addressed on the application side, but at least the conformance process attempts to ensure that for applications that do their part of the work, things will work as intended.
There is another interesting implication of conformance that has to do with correct API specification. Designing APIs that can work across hardware from different vendors is a challenging process. With the CTS, Khronos has an opportunity to validate the specifications against actual implementations. In other words, the CTS allows Khronos to verify that vendors can implement the specifications as intended and revisit the specification if they can’t before releasing them. This ensures that API specifications are reasonable and a good match for existing hardware implementations.
Another benefit of CTS is that vendors working on any API implementation will always need some level of testing to verify their code during development. Without CTS, they would have to write their own tests (which would be biased towards their own interpretations of the spec anyway), but with CTS, they can leave that to Khronos and focus on the implementation instead, cutting down development times and sharing testing code with other vendors.
What about Piglit or other testing frameworks?
CTS doesn’t make Piglit obsolete or redundant at all. While CTS coverage will improve over the years it is nearly impossible to have 100% coverage, so having other testing frameworks around that can provide extra coverage is always good.
My experience working on the Mesa drivers is that it is surprisingly easy to break stuff, specially on OpenGL which has a lot of legacy stuff in it. I have seen way too many instances of patches that seemed correct and in fact fixed actual problems only to later see Piglit, CTS and/or dEQP report regressions on existing tests. The (not so surprising) thing is that many times, the regressions were reported on just one of these testing frameworks, which means they all provide some level of coverage that is not included in the others.
It is for this reason that the continuous integration system for Mesa provided by Intel runs all of these testing frameworks (and then some others). You just never get enough testing. And even then, some regressions slip into releases despite all the testing!
Also, for the case of Piglit in particular, I have to say that it is very easy to write new tests, specially shader runner tests, which is always a bonus. Writing tests for CTS or dEQP, for example, requires more work in general.
So what have I been doing exactly?
For the last 9 months or so, I have been working on ensuring that the Intel Mesa drivers for both Vulkan and OpenGL are conformant. If you have followed any of my work in Mesa over the last year or so, you have probably already guessed this, since most of the patches I have been sending to Mesa reference the conformance tests they fix.
To be more thorough, my work included:
Reviewing and testing patches submitted for inclusion in CTS that either fixed test bugs, extended coverage for existing features or added new tests for new API specifications. CTS is a fairly active project with numerous changes submitted for review pretty much every day, for OpenGL, OpenGL ES and Vulkan, so staying on top of things requires a significant dedication.
Ensuring that the Intel Mesa drivers passed all the CTS tests for both Vulkan and OpenGL. This requires to run the conformance tests, identifying test failures, identifying the cause for the failures and providing proper fixes. The fixes would go to CTS when the cause for the issue was a bogus test, to the driver, when it was a bug in our implementation or the fact that the driver was simply missing some functionality, or they could even go to the OpenGL or Vulkan specs, when the source of the problem was incomplete, ambiguous or incorrect spec language that was used to drive the test development. I have found instances of all these situations.
Where can I get the CTS code?
Good news, it is open source and available at GitHub.
This is a very important and welcomed change by Khronos. When I started helping Intel with OpenGL conformance, specifically for OpenGL 4.5, the CTS code was only available to specific Khronos members. Since then, Khronos has done a significant effort in working towards having an open source testing framework where anyone can contribute, so kudos to Khronos for doing this!
Going open source not only leverages larger collaboration and further development of the CTS. It also puts in the hands of API users a handful of small test samples that people can use to learn how some of the new Vulkan and OpenGL APIs released to the public are to be used, which is always nice.
What is next?
As I said above, CTS development is always ongoing, there is always testing coverage to expand for existing features, bugfixes to provide for existing tests, more tests that need to be adapted or changed to match corrections in in the spec language, new extensions and APIs that need testing coverage, etc.
And on the driver side, there are always new features to implement that come with their potential bugs that need to be fixed, occasional regressions that need to be addressed promptly, new bugs uncovered by new tests that need fixes, etc
So the fun never really stops 🙂
In this post, besides bringing the good news to everyone, I hope that I have made a good case for why the Khronos CTS is important for the industry and why we should care about it. I hope that I also managed to give a sense for the enormous amount of work that goes into making all of this possible, both on the side of Khronos and the side of the driver developer teams. I think all this effort means better drivers for everyone and I hope that we all, as users, come to appreciate it for that.
Finally, big thanks to Intel for sponsoring our work on Mesa and CTS, and also to Igalia, for having me work on this wonderful project.
OpenGL® and the oval logo are trademarks or registered trademarks of Silicon Graphics, Inc. in the United States and/or other countries worldwide. Additional license details are available on the SGI website.