Jekyll2023-03-22T12:01:18+01:00https://blogs.igalia.com/dpiliaiev/feed.xmlDanylo’s blogDanylo PiliaievCommand stream editing as an effective method to debug driver issues2023-03-21T00:00:00+01:002023-03-21T00:00:00+01:00https://blogs.igalia.com/dpiliaiev/debugging-by-editing-gpu-packets<aside class="sidebar__right sticky">
<nav class="toc">
<header>
<h4 class="nav__title"><i class="fas fa-file-alt"></i> Table of Contents </h4>
</header>
<ol id="markdown-toc">
<li><a href="#how-the-tool-is-used" id="markdown-toc-how-the-tool-is-used">How the tool is used</a></li>
</ol>
</nav>
</aside>
<p>In previous posts, <a href="/dpiliaiev/google-flight-recorder/">“Graphics Flight Recorder - unknown but handy tool to debug GPU hangs”</a> and <a href="/dpiliaiev/debugging-unrecoverable-hangs/">“Debugging Unrecoverable GPU Hangs”</a>, I demonstrated a few tricks of how to identify the location of GPU fault.</p>
<p>But what’s the next step once you’ve roughly pinpointed the issue? What if the problem is only sporadically reproducible and the only way to ensure consistent results is by replaying a trace of raw GPU commands? How can you precisely determine the cause and find a proper fix?</p>
<p>Sometimes, you may have an inkling of what’s causing the problem, then and you can simply modify the driver’s code to see if it resolves the issue. However, there are instances where the root cause remains elusive or you only want to change a specific value without affecting the same register before and after it.</p>
<p>The optimal approach in these situations is to directly modify the commands sent to the GPU. The ability to arbitrarily edit the command stream was always an obvious idea and has crossed my mind numerous times (and not only mine – proprietary driver developers seem to employ similar techniques). Finally, the stars aligned: my frustration with a recent bug, the kernel’s new support for user-space-defined GPU addresses for buffer objects, the tool I wrote to replay command stream traces not so long ago, and the realization that implementing a command stream editor was not as complicated as initially thought.</p>
<p>The end result is a tool for Adreno GPUs (with msm kernel driver) to decompile, edit, and compile back command streams: <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19444">“freedreno,turnip: Add tooling to edit command streams and use them in ‘replay’”</a>.</p>
<p>The primary advantage of this command stream editing tool lies the ability to rapidly iterate over hypotheses. Another highly valuable feature (which I have plans for) would be the automatic bisection of the command stream, which would be particularly beneficial in instances where only the bug reporter has the necessary hardware to reproduce the issue at hand.</p>
<h2 id="how-the-tool-is-used">How the tool is used</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Decompile one command stream from the trace
./rddecompiler -s 0 gpu_trace.rd > generate_rd.c
# Compile the executable which would output the command stream
meson setup . build
ninja -C build
# Override the command stream with the commands from the generator
./replay gpu_trace.rd --override=0 --generator=./build/generate_rd
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Reading dEQP-VK.renderpass.suballocation.formats.r5g6b5_unorm_pack16.clear.clear.rd...
gpuid: 660
Uploading iova 0x100000000 size = 0x82000
Uploading iova 0x100089000 size = 0x4000
cmdstream 0: 207 dwords
generating cmdstream './generate_rd --vastart=21441282048 --vasize=33554432 gpu_trace.rd'
Uploading iova 0x4fff00000 size = 0x1d4
override cmdstream: 117 dwords
skipped cmdstream 1: 248 dwords
skipped cmdstream 2: 223 dwords
</code></pre></div></div>
<p>The decompiled code isn’t pretty:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/* pkt4: GRAS_SC_SCREEN_SCISSOR[0].TL = { X = 0 | Y = 0 } */
pkt4(cs, REG_A6XX_GRAS_SC_SCREEN_SCISSOR_TL(0), (2), 0);
/* pkt4: GRAS_SC_SCREEN_SCISSOR[0].BR = { X = 32767 | Y = 32767 } */
pkt(cs, 2147450879);
/* pkt4: VFD_INDEX_OFFSET = 0 */
pkt4(cs, REG_A6XX_VFD_INDEX_OFFSET, (2), 0);
/* pkt4: VFD_INSTANCE_START_OFFSET = 0 */
pkt(cs, 0);
/* pkt4: SP_FS_OUTPUT[0].REG = { REGID = r0.x } */
pkt4(cs, REG_A6XX_SP_FS_OUTPUT_REG(0), (1), 0);
/* pkt4: SP_TP_RAS_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */
pkt4(cs, REG_A6XX_SP_TP_RAS_MSAA_CNTL, (2), 2);
/* pkt4: SP_TP_DEST_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */
pkt(cs, 2);
/* pkt4: GRAS_RAS_MSAA_CNTL = { SAMPLES = MSAA_FOUR } */
pkt4(cs, REG_A6XX_GRAS_RAS_MSAA_CNTL, (2), 2);
</code></pre></div></div>
<p>Shader assembly is editable:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>const char *source = R"(
shps #l37
getone #l37
cov.u32f32 r1.w, c504.z
cov.u32f32 r2.x, c504.w
cov.u32f32 r1.y, c504.x
....
end
)";
upload_shader(&ctx, 0x100200d80, source);
emit_shader_iova(&ctx, cs, 0x100200d80);
</code></pre></div></div>
<p>However, not everything is currently editable, such as descriptors. Despite this limitations, the existing functionality is sufficient for the majority of cases.</p>Danylo PiliaievDirectly editing commands stream submitted to GPU allows to rapidly test many hypotheses.Turnips in the wild (Part 3)2023-01-05T00:00:00+01:002023-01-05T00:00:00+01:00https://blogs.igalia.com/dpiliaiev/turnips-in-the-wild-part-3<p>This is the third part of my “Turnips in the wild” blog post series where I describe how I found and fixed graphical issues in the Mesa Turnip Vulkan driver for Adreno GPUs.
If you missed the first two parts, you can find them here:</p>
<!--more-->
<ul>
<li><a href="/dpiliaiev/turnips-in-the-wild-part-1/">Turnips in the wild (Part 1)</a></li>
<li><a href="/dpiliaiev/turnips-in-the-wild-part-2/">Turnips in the wild (Part 2)</a></li>
</ul>
<aside class="sidebar__right sticky">
<nav class="toc">
<header>
<h4 class="nav__title"><i class="fas fa-file-alt"></i> Table of Contents </h4>
</header>
<ol id="markdown-toc">
<li><a href="#psychonauts-2" id="markdown-toc-psychonauts-2">Psychonauts 2</a> <ol>
<li><a href="#step-1---toggling-driver-options" id="markdown-toc-step-1---toggling-driver-options">Step 1 - Toggling driver options</a></li>
<li><a href="#step-2---finding-the-draw-call-and-staring-at-it-intensively" id="markdown-toc-step-2---finding-the-draw-call-and-staring-at-it-intensively">Step 2 - Finding the Draw Call and staring at it intensively</a></li>
<li><a href="#step-3---bisecting-the-shader-until-nothing-is-left" id="markdown-toc-step-3---bisecting-the-shader-until-nothing-is-left">Step 3 - Bisecting the shader until nothing is left</a></li>
<li><a href="#step-4---going-deeper" id="markdown-toc-step-4---going-deeper">Step 4 - Going deeper</a> <ol>
<li><a href="#loading-varyings" id="markdown-toc-loading-varyings">Loading varyings</a></li>
<li><a href="#packing-varyings" id="markdown-toc-packing-varyings">Packing varyings</a></li>
<li><a href="#interpolating-varyings" id="markdown-toc-interpolating-varyings">Interpolating varyings</a></li>
</ol>
</li>
</ol>
</li>
<li><a href="#injustice-2" id="markdown-toc-injustice-2">Injustice 2</a></li>
<li><a href="#monster-hunter-world" id="markdown-toc-monster-hunter-world">Monster Hunter: World</a></li>
</ol>
</nav>
</aside>
<h2 id="psychonauts-2">Psychonauts 2</h2>
<p>A few months ago it was reported that “Psychonauts 2” has rendering artifacts in the main menu. Though only recently I got my hands on it.</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/psychonauts_2_reported_misrendering.jpg" alt="Screenshot of the main menu with the rendering artifacts" /><figcaption>
Notice the mark on the top right, the game was running directly on Qualcomm board via FEX-Emu
</figcaption></figure>
<h3 id="step-1---toggling-driver-options">Step 1 - Toggling driver options</h3>
<p>Forcing direct rendering, forcing tiled rendering, disabling UBWC compression, forcing synchronizations everywhere, and so on, nothing helped or changed the outcome.</p>
<h3 id="step-2---finding-the-draw-call-and-staring-at-it-intensively">Step 2 - Finding the Draw Call and staring at it intensively</h3>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/psychonauts_2_finding_the_draw_call.jpg" alt="Screenshot of the RenderDoc with problematic draw call being selected" /><figcaption>
The first draw call with visible corruption
</figcaption></figure>
<p>When looking around the draw call, everything looks good, but a single input image:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/psychonauts_2_input_image_corruption.jpg" alt="Input image of the draw call with corruption already present" /><figcaption>
One of the input images
</figcaption></figure>
<p>Ughhh, it seems like this image comes from the previous frame, that’s bad. Inter-frame issues are hard to debug since there is no tooling to inspect two frames together…</p>
<p>* Looks around nervously *</p>
<p>Ok, let’s forget about it, maybe it doesn’t matter. Then next step would be looking at the pixel values in the corrupted region:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/psychonauts_2_bad_pixel_value.png" alt="" /><figcaption>
color = (35.0625, 18.15625, 2.2382, 0.00)
</figcaption></figure>
<p>Now, let’s see whether RenderDoc’s built-in shader debugger would give us the same value as GPU or not.</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/psychonauts_2_bad_pixel_renderdoc_interpreter.png" alt="" /><figcaption>
color = (0.0335, 0.0459, 0.0226, 0.50)
</figcaption></figure>
<p>(Or not)</p>
<p>After looking at the similar pixel on the RADV, RenderDoc seems right. So the issue is somewhere in how driver compiled the shader.</p>
<h3 id="step-3---bisecting-the-shader-until-nothing-is-left">Step 3 - Bisecting the shader until nothing is left</h3>
<p>A good start is to print the shader values with <code class="language-plaintext highlighter-rouge">debugPrintfEXT</code>, much nicer than looking at color values. Adding the <code class="language-plaintext highlighter-rouge">debugPrintfEXT</code> aaaand, the issue goes away, great, not like I wanted to debug it or anything.</p>
<p>Adding a printf changes the shader which affects the compilation process, so the changed result is not unexpected, though it’s much better when it works. So now we are stuck with observing pixel colors.</p>
<p>Bisecting a shader isn’t hard, especially if there is a reference GPU, with the same capture opened, to compare the changes. You delete pieces of the shader until results are the same, when you get the same results you take one step back and start removing other expressions, repeat until nothing could be removed.</p>
<details>
<summary>
First iteration of the elimination, most of the shader is gone (click me)
</summary>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>float _258 = 1.0 / gl_FragCoord.w;
vec4 _273 = vec4(_222, _222, _222, 1.0) * _258;
vec4 _280 = View.View_SVPositionToTranslatedWorld * vec4(gl_FragCoord.xyz, 1.0);
vec3 _284 = _280.xyz / vec3(_280.w);
vec3 _287 = _284 - View.View_PreViewTranslation;
vec3 _289 = normalize(-_284);
vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y) * Material.Material_ScalarExpressions[0].x;
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec2 _312 = (_309.xy * vec2(2.0)) - vec2(1.0);
vec3 _331 = normalize(mat3(in_var_TEXCOORD10_centroid.xyz, cross(in_var_TEXCOORD11_centroid.xyz, in_var_TEXCOORD10_centroid.xyz) * in_var_TEXCOORD11_centroid.w, in_var_TEXCOORD11_centroid.xyz) * normalize((vec4(_312, sqrt(clamp(1.0 - dot(_312, _312), 0.0, 1.0)), 1.0).xyz * View.View_NormalOverrideParameter.w) + View.View_NormalOverrideParameter.xyz)) * ((View.View_CullingSign * View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID * 37u) + 4u].w) * float(gl_FrontFacing ? (-1) : 1));
vec2 _405 = in_var_TEXCOORD4.xy * vec2(1.0, 0.5);
vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), _405 + vec2(0.0, 0.5));
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-
float _447 = _331.y;
vec3 _531 = (((max(0.0, dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_447, _331.zx, 1.0))))) * View.View_IndirectLightingColorScale);
bool _1313 = TranslucentBasePass.TranslucentBasePass_Shared_Fog_ApplyVolumetricFog > 0.0;
vec4 _1364;
vec4 _1322 = View.View_WorldToClip * vec4(_287, 1.0);
float _1323 = _1322.w;
vec4 _1352;
if (_1313)
{
_1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(((_1322.xy / vec2(_1323)).xy * vec2(0.5, -0.5)) + vec2(0.5), (log2((_1323 * View.View_VolumetricFogGridZParams.x) + View.View_VolumetricFogGridZParams.y) * View.View_VolumetricFogGridZParams.z) * View.View_VolumetricFogInvGridSize.z), 0.0);
}
else
{
_1352 = vec4(0.0, 0.0, 0.0, 1.0);
}
_1364 = vec4(_1352.xyz + (in_var_TEXCOORD7.xyz * _1352.w), _1352.w * in_var_TEXCOORD7.w);
out_var_SV_Target0 = vec4(_531.x, _1364.w, 0, 1);
</code></pre></div> </div>
</details>
<p><br />
After that it became harder to reduce the code.</p>
<details>
<summary>
More elimination (click me)
</summary>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vec2 _303 = vec2(in_var_TEXCOORD0[0].x, in_var_TEXCOORD0[0].y);
vec4 _309 = texture(sampler2D(Material_Texture2D_0, Material_Texture2D_0Sampler), _303, View.View_MaterialTextureMipBias);
vec3 _331 = normalize(mat3(in_var_TEXCOORD10_centroid.xyz, in_var_TEXCOORD11_centroid.www, in_var_TEXCOORD11_centroid.xyz) * normalize((vec4(_309.xy, 1.0, 1.0).xyz))) * ((View_PrimitiveSceneData._m0[(in_var_PRIMITIVE_ID)].w) );
vec4 _412 = texture(sampler2D(LightmapResourceCluster_LightMapTexture, LightmapResourceCluster_LightMapSampler), in_var_TEXCOORD4.xy);
uint _418 = in_var_LIGHTMAP_ID; // <<<<<<-
vec3 _531 = (((dot((_412 * View_LightmapSceneData_1._m0[_418 + 5u]), vec4(_331.x, 1,1, 1.0)))) * View.View_IndirectLightingColorScale);
vec4 _1352 = textureLod(sampler3D(TranslucentBasePass_Shared_Fog_IntegratedLightScattering, View_SharedBilinearClampedSampler), vec3(vec2(0.5), View.View_VolumetricFogInvGridSize.z), 0.0);
out_var_SV_Target0 = vec4(_531.x, in_var_TEXCOORD7.w, 0, 1);
</code></pre></div> </div>
</details>
<p><br /></p>
<p>And finally, the end result:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vec3 a = in_var_TEXCOORD10_centroid.xyz + in_var_TEXCOORD11_centroid.xyz;
float b = a.x + a.y + a.z + in_var_TEXCOORD11_centroid.w + in_var_TEXCOORD0[0].x + in_var_TEXCOORD0[0].y + in_var_PRIMITIVE_ID.x;
float c = b + in_var_TEXCOORD4.x + in_var_TEXCOORD4.y + in_var_LIGHTMAP_ID;
out_var_SV_Target0 = vec4(c, in_var_TEXCOORD7.w, 0, 1);
</code></pre></div></div>
<p>Nothing left but loading of varyings and the simplest operations on them in order to prevent their elimination by the compiler.</p>
<p><code class="language-plaintext highlighter-rouge">in_var_TEXCOORD7.w</code> values are several orders of magnitude different from the expected ones and if any varying is removed the issue goes away. Seems like an issue with loading of varyings.</p>
<p>I created a simple standalone reproducer in <code class="language-plaintext highlighter-rouge">vkrunner</code> to isolate this case and make my life easier, but the same fragment shader passed without any trouble. This should have pointed me to undefined behavior somewhere.</p>
<h3 id="step-4---going-deeper">Step 4 - Going deeper</h3>
<p>Anyway, one major difference is the vertex shader, changing it does “fix” the issue. Though changing it resulted in the changes in varyings layout, without changing the layout the issue is present. Thus the vertex shader is an unlikely culprit here.</p>
<p>Let’s take a look at the fragment shader assembly:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bary.f r0.z, 0, r0.x
bary.f r0.w, 3, r0.x
bary.f r1.x, 1, r0.x
bary.f r1.z, 4, r0.x
bary.f r1.y, 2, r0.x
bary.f r1.w, 5, r0.x
bary.f r2.x, 6, r0.x
bary.f r2.y, 7, r0.x
flat.b r2.z, 11, 16
bary.f r2.w, 8, r0.x
bary.f r3.x, 9, r0.x
flat.b r3.y, 12, 17
bary.f r3.z, 10, r0.x
bary.f (ei)r1.x, 16, r0.x
....
</code></pre></div></div>
<h4 id="loading-varyings">Loading varyings</h4>
<p><code class="language-plaintext highlighter-rouge">bary.f</code> loads interpolated varying, <code class="language-plaintext highlighter-rouge">flat.b</code> loads it without interpolation. <code class="language-plaintext highlighter-rouge">bary.f (ei)r1.x, 16, r0.x</code> is what loads the problematic varying, though it doesn’t look suspicious at all. Looking through the state which defines how varyings are passed between VS and FS also doesn’t yield anything useful.</p>
<p>Ok, but what does second operand of <code class="language-plaintext highlighter-rouge">flat.b r2.z, 11, 16</code> means (the command format is <code class="language-plaintext highlighter-rouge">flat.b dst, src1, src2</code>). The first one is location from where varying is loaded, and looking through the Turnip’s code the second one should be equal to the first otherwise “some bad things may happen”. Forced the sources to be equal - nothing changed… What have I expected? Since the standalone reproducer with the same assembly works fine.</p>
<p>The same description which promised bad things to happen also said that using 2 immediate sources for <code class="language-plaintext highlighter-rouge">flat.b</code> isn’t really expected. Let’s revert the change and get something like <code class="language-plaintext highlighter-rouge">flat.b r2.z, 11, r0.x</code>, nothing is changed, again.</p>
<h4 id="packing-varyings">Packing varyings</h4>
<p>What else happens with these varyings? They are being packed! To remove their unused components, so let’s stop packing them. Aha! Now it works correctly!</p>
<p>Looking several times through the code, nothing is wrong. Changing the order of varyings helps, aligning them helps, aligning only flat varyings also helps. But code is entirely correct.</p>
<p>Though one thing changes, during the shuffling of varyings order I noticed that the resulting misrendering changed, so likely it’s not the order, but the location which is cursed.</p>
<h4 id="interpolating-varyings">Interpolating varyings</h4>
<p>What’s left? How varyings interpolation is specified. The code emits interpolations only for used varyings, but looking closer the “used varyings” part isn’t that obviously defined. Emitting the whole interpolation state fixes the issue!</p>
<p>The culprit is found, stale data is being read of varying interpolation. The resulting fix could be found in <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20533">tu: Fix varyings interpolation reading stale values + cosmetic changes</a></p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/psychonauts_2_correct_rendering.jpg" alt="Correctly rendered draw call after the changes" /><figcaption>
Correctly rendered draw call after the changes
</figcaption></figure>
<h2 id="injustice-2">Injustice 2</h2>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/injustice_2_reported_misrendering.jpg" alt="" /><figcaption>
</figcaption></figure>
<p>Another corruption in the main menu.</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/injustice_2_bad_drawcall_on_turnip.jpg" alt="" /><figcaption>
Bad draw call on Turnip
</figcaption></figure>
<p>How it should look:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/injustice_2_bad_drawcall_on_radv.jpg" alt="" /><figcaption>
The same draw call on RADV
</figcaption></figure>
<p>The draw call inputs and state look good enough. So it’s time to bisect the shader.</p>
<p>Here is the output of the reduced shader on Turnip:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/injustice_2_reduced_bad_shader_on_turnip.png" alt="" /><figcaption>
</figcaption></figure>
<p>Enabling the display of NaNs and Infs shows that there are NaNs in the output on Turnip (NaNs have green color here):</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/injustice_2_reduced_bad_shader_nans_on_turnip.png" alt="" /><figcaption>
</figcaption></figure>
<p>While the correct rendering on RADV is:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/injustice_2_reduced_bad_shader_on_radv.png" alt="" /><figcaption>
</figcaption></figure>
<p>Carefully reducing the shader further resulted in the following fragment which reproduces the issue:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>r12 = uintBitsToFloat(uvec4(texelFetch(t34, _1195 + 0).x, texelFetch(t34, _1195 + 1).x, texelFetch(t34, _1195 + 2).x, texelFetch(t34, _1195 + 3).x));
....
vec4 _1268 = r12;
_1268.w = uintBitsToFloat(floatBitsToUint(r12.w) & 65535u);
_1275.w = unpackHalf2x16(floatBitsToUint(r12.w)).x;
</code></pre></div></div>
<p>On Turnip this <code class="language-plaintext highlighter-rouge">_1275.w</code> is <code class="language-plaintext highlighter-rouge">NaN</code>, while on RADV it is a proper number. Looking at assembly, the calculation of <code class="language-plaintext highlighter-rouge">_1275.w</code> from the above is translated into:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>isaml.base0 (u16)(x)hr2.z, r0.w, r0.y, s#0, t#12
(sy)cov.f16f32 r1.z, hr2.z
</code></pre></div></div>
<p>In GLSL there is a read of <code class="language-plaintext highlighter-rouge">uint32</code>, stripping it of the high 16 bits, then converting the lower 16 bits to a half float.</p>
<p>In assembly the “read and strip the high 16 bits” part is done in a single command <code class="language-plaintext highlighter-rouge">isaml</code>, where the stripping is done via <code class="language-plaintext highlighter-rouge">(u16)</code> conversion.</p>
<p>At this point I wrote a simple reproducer to speed up iteration on the issue:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>result = uint(unpackHalf2x16(texelFetch(t34, 0).x & 65535u).x);
</code></pre></div></div>
<p>After testing different values I confirmed that <code class="language-plaintext highlighter-rouge">(u16)</code> conversion doesn’t strip higher 16 bits, but clamps the value to 16 bit unsigned integer. Running the reproducer on the proprietary driver shown that it doesn’t fold <code class="language-plaintext highlighter-rouge">u32 -> u16</code> conversion into <code class="language-plaintext highlighter-rouge">isaml</code>.</p>
<p>Knowing that the fix is easy: <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20396">ir3: Do 16b tex dst folding only for floats</a></p>
<h2 id="monster-hunter-world">Monster Hunter: World</h2>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/monster_hunter_world_main_menu_misrendering.jpg" alt="" /><figcaption>
</figcaption></figure>
<p>Main menu, again =) Before we even got here two other issues were fixed before, including the one which seems like an HW bug which proprietary driver is not aware of.</p>
<p>In this case of misrendering the culprit is a compute shader.</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/monster_hunter_world_bad_dispatch_on_turnip.png" alt="" /><figcaption>
</figcaption></figure>
<p>How it should look:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/monster_hunter_world_bad_dispatch_on_radv.png" alt="" /><figcaption>
</figcaption></figure>
<p>Compute shader are generally easier to deal with since much less state is involved.</p>
<p>None of debug options helped and shader printf didn’t work at that time for some reason. So I decided to look at the shader assembly trying to spot something funny.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ldl.u32 r6.w, l[r6.z-4016], 1
ldl.u32 r7.x, l[r6.z-4012], 1
ldl.u32 r7.y, l[r6.z-4032], 1
ldl.u32 r7.z, l[r6.z-4028], 1
ldl.u32 r0.z, l[r6.z-4024], 1
ldl.u32 r2.z, l[r6.z-4020], 1
</code></pre></div></div>
<p>Negative offsets into shared memory are not suspicious at all. Were they always there? How does it look right before being passed into our backend compiler?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vec1 32 ssa_206 = intrinsic load_shared (ssa_138) (base=4176, align_mul=4, align_offset=0)
vec1 32 ssa_207 = intrinsic load_shared (ssa_138) (base=4180, align_mul=4, align_offset=0)
vec1 32 ssa_208 = intrinsic load_shared (ssa_138) (base=4160, align_mul=4, align_offset=0)
vec1 32 ssa_209 = intrinsic load_shared (ssa_138) (base=4164, align_mul=4, align_offset=0)
vec1 32 ssa_210 = intrinsic load_shared (ssa_138) (base=4168, align_mul=4, align_offset=0)
vec1 32 ssa_211 = intrinsic load_shared (ssa_138) (base=4172, align_mul=4, align_offset=0)
vec1 32 ssa_212 = intrinsic load_shared (ssa_138) (base=4192, align_mul=4, align_offset=0)
</code></pre></div></div>
<p>Nope, no negative offsets, just a number of offsets close to <code class="language-plaintext highlighter-rouge">4096</code>. Looks like offsets got wrapped around!</p>
<p>Looking at <code class="language-plaintext highlighter-rouge">ldl</code> definition it has 13 bits for the offset:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><pattern pos="0" >1</pattern>
<field low="1" high="13" name="OFF" type="offset"/> <--- This is the offset field
<field low="14" high="21" name="SRC" type="#reg-gpr"/>
<pattern pos="22" >x</pattern>
<pattern pos="23" >1</pattern>
</code></pre></div></div>
<p>With <code class="language-plaintext highlighter-rouge">offset</code> type being a signed integer (so the one bit is for the sign). Which leaves us with 12 bits, meaning the upper bound of <code class="language-plaintext highlighter-rouge">4095</code>. Case closed!</p>
<p>I know that there is a upper bound set on offset during optimizations, but where and how it is set?</p>
<p>The upper bound is set via <code class="language-plaintext highlighter-rouge">nir_opt_offsets_options::shared_max</code> and is equal to <code class="language-plaintext highlighter-rouge">(1 << 13) - 1</code>, which we saw is incorrect. Who set it?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Subject: [PATCH] ir3: Limit the maximum imm offset in nir_opt_offset for
shared vars
STL/LDL have 13 bits to store imm offset.
Fixes crash in CS compilation in Monster Hunter World.
Fixes: b024102d7c2959451bfef323432beaa4dca4dd88
("freedreno/ir3: Use nir_opt_offset for removing constant adds for shared vars.")
Signed-off-by: Danylo Piliaiev
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14968>
---
src/freedreno/ir3/ir3_nir.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
@@ -124,7 +124,7 @@ ir3_optimize_loop(struct ir3_compiler *compiler, nir_shader *s)
*/
.uniform_max = (1 << 9) - 1,
- .shared_max = ~0,
+ .shared_max = (1 << 13) - 1,
</code></pre></div></div>
<p>Weeeell, totally unexpected, it was me! Fixing the same game, maybe even the same shader…</p>
<p>Let’s set the <code class="language-plaintext highlighter-rouge">shared_max</code> to a correct value . . . . . Nothing changed, not even the assembly. The same incorrect offset is still there.</p>
<p>After a bit of wandering around the optimization pass, it was found that in one case the upper bound is not enforced correctly. Fixing it fixed the rendering.</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnips-in-the-wild-part-3/monster_hunter_world_correct_rendering.jpg" alt="" /><figcaption>
</figcaption></figure>
<p>The final changes were:</p>
<ul>
<li><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20100">ir3: Reduce the maximum allowed imm offset for shared var load/store</a></li>
<li><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20099">nir/nir_opt_offsets: Prevent offsets going above max</a></li>
</ul>Danylo PiliaievMore real world bugs! Featuring "Psychonauts 2", "Injustice 2", and "Monster Hunter World".Debugging Unrecoverable GPU Hangs2022-11-11T00:00:00+01:002022-11-11T00:00:00+01:00https://blogs.igalia.com/dpiliaiev/debugging-unrecoverable-hangs<aside class="sidebar__right sticky">
<nav class="toc">
<header>
<h4 class="nav__title"><i class="fas fa-file-alt"></i> Table of Contents </h4>
</header>
<ol id="markdown-toc">
<li><a href="#breadcrumbs" id="markdown-toc-breadcrumbs">Breadcrumbs</a></li>
<li><a href="#turnips-take-on-the-breadcrumbs" id="markdown-toc-turnips-take-on-the-breadcrumbs">Turnip’s take on the breadcrumbs</a></li>
<li><a href="#limitations" id="markdown-toc-limitations">Limitations</a></li>
</ol>
</nav>
</aside>
<p>I already talked about debugging hangs in <a href="/dpiliaiev/google-flight-recorder/">“Graphics Flight Recorder - unknown but handy tool to debug GPU hangs”</a>, now I want to talk about the most nasty kind of GPU hangs - the ones which cannot be recovered from, where your computer becomes completely unresponsive and you cannot even ssh into it.</p>
<p>How would one debug this? There is no data to get after the hang and it’s incredibly frustrating to even try different debug options and hypothesis, if you are wrong - you get to reboot the machine!</p>
<p>If you are a hardware manufacturer creating a driver for your own GPU, you could just run the workload in your fancy hardware simulator, wait for a few hours for the result and call it a day. But what if you don’t have access to a simulator, or to some debug side channel?</p>
<p>There are a few things one could try:</p>
<ul>
<li>Try all debug options you have, try to disable compute dispatches, some draw calls, blits, and so on. The downside is that you have to reboot every time you hit the issue.</li>
<li>Eyeball the GPU packets until they start eyeballing you.</li>
<li>Breadcrumbs!</li>
</ul>
<h2 id="breadcrumbs">Breadcrumbs</h2>
<p>Today we will talk about the breadcrumbs. Unfortunately, Graphics Flight Recorder, which I already wrote about, isn’t of much help here. The hang is unrecoverable so GFR isn’t able to gather the results and write them to the disk.</p>
<p>But the idea of breadcrumbs is still useful! What if instead of gathering result post factum, we stream the results to some other machine. This would allow us to get the results even if our target becomes unresponsive. Though, the requirement to get the results ASAP considerably changes the workflow.</p>
<p>What if we write breadcrumbs on GPU after each command and spin a thread on CPU reading it in a busy loop?</p>
<figure class="">
<img src="/dpiliaiev/assets/images/debugging-unrecoverable-hangs/breadcrumbs_without_cpu_gpu_sync.svg" alt="" /><figcaption>
</figcaption></figure>
<p>In practice the the amount of breadcrumbs between the one sent over network and the one currently executed is just too big to be practical.</p>
<p>So we have to make GPU and CPU running in a lockstep.</p>
<ul>
<li>GPU writes a breadcrumb and immediately waits for this value to be acknowledged;</li>
<li>CPU in a busy loop checks the last written breadcrumb value, sends it over socket, and writes it back to the fixed address;</li>
<li>GPU sees a new value and continues execution.</li>
</ul>
<figure class="">
<img src="/dpiliaiev/assets/images/debugging-unrecoverable-hangs/breadcrumbs_with_cpu_gpu_sync.svg" alt="" /><figcaption>
</figcaption></figure>
<p>This way the most recent breadcrumb gets immediately sent over the network. In practice, some breadcrumbs are still lost between the last sent over the network and the one where GPU hangs. But the difference is only a few of them.</p>
<p>With the lockstep execution we could narrow the hanging command even further. For this we have to wait for a certain time after each breadcrumb before proceeding to the next one. I chose to just promt user for explicit keyboard input for each breadcrumb.</p>
<p>I ended up with the following workflow:</p>
<ul>
<li>Run with breadcrumbs enabled but without requiring explicit user input;</li>
<li>On another machine receive the stream of breadcrumbs via network;</li>
<li>Note the last received breadcrumb, the hanging command would be nearby;</li>
<li>Reboot the target;</li>
<li>Enable breadcrumbs starting from a few breadcrumbs before the last one received, require explicit ack from the user on each breadcrumb;</li>
<li>In a few steps GPU should hang;</li>
<li>Now that we know the closest breadcrumb to the real hang location - we can get the command stream and see what happens right after the breadcrumb;</li>
<li>Knowing which command(s) is causing our hang - it’s now possible to test various changes in the driver.</li>
</ul>
<p>Could Graphics Flight Recorder do this?</p>
<p>In theory yes, but it would require additional VK extension to be able to wait for value in memory. However, it still would have a crucial limitation.</p>
<p>Even with Vulkan being really close to the hardware there are still many cases where one Vulkan command is translated into many GPU commands under the hood. Things like image copies, blits, renderpass boundaries. For unrecoverable hangs we want to narrow down the hanging GPU command as much as possible, so it makes more sense to implement such functionality in a driver.</p>
<h2 id="turnips-take-on-the-breadcrumbs">Turnip’s take on the breadcrumbs</h2>
<p>I recently <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15452">implemented it</a> in Turnip (open-source Vulkan driver for Qualcomm’s GPUs) and used a few times with good results.</p>
<p>Current implementation in Turnip is rather spartan and gives the minimal amount of instruments to achieve the workflow described above. While looks like this:</p>
<p>1) Launch workload with <code class="language-plaintext highlighter-rouge">TU_BREADCRUMBS</code> envvar (<code class="language-plaintext highlighter-rouge">TU_BREADCRUMBS=$IP:$PORT,break=$BREAKPOINT:$BREAKPOINT_HITS</code>):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TU_BREADCRUMBS=$IP:$PORT,break=-1:0 ./some_workload
</code></pre></div></div>
<p>2) Receive breadcrumbs on another machine via this bash spaghetti:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'
Received packet from 10.42.0.19:49116 -> 10.42.0.1:11113 (local)
1:0
7:0
8:0
9:0
10:0
11:0
12:0
13:0
14:0
[...]
10:3
11:3
12:3
13:3
14:3
15:3
16:3
17:3
18:3
</code></pre></div></div>
<p>Each line is a breadcrumb № and how many times it was repeated (either if command buffer is reusable or if breadcrumb is in a command stream which repeats for each tile).</p>
<p>3) Increase hang timeout:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>echo -n 120000 > /sys/kernel/debug/dri/0/hangcheck_period_ms
</code></pre></div></div>
<p>4) Launch workload and break on the last known executed breadcrumb:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TU_BREADCRUMBS=$IP:$PORT,break=15:3 ./some_workload
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GPU is on breadcrumb 18, continue?y
GPU is on breadcrumb 19, continue?y
GPU is on breadcrumb 20, continue?y
GPU is on breadcrumb 21, continue?y
...
</code></pre></div></div>
<p>5) Continue until GPU hangs.</p>
<p>:tada: Now you know two breadcrumbs between which there is a command which causes the hang.</p>
<h2 id="limitations">Limitations</h2>
<ul>
<li>If the hang was caused by a lack of synchronization, or a lack of cache flushing - it may not be reproducible with breadcrumbs enabled.</li>
<li>While breadcrumbs could help to narrow down the hang to a single command - a single command may access many different states and have many steps under the hood (e.g. draw or blit commands).</li>
<li>The method for breadcrumbs described here do affect performance, especially if tiling rendering is used, since breadcrumbs are repeated for each tile.</li>
</ul>Danylo PiliaievDebugging ordinary GPU hangs is too easy? Try hangs that bring the whole machine down!:tada: Turnip now exposes Vulkan 1.3 :tada:2022-09-21T00:00:00+02:002022-09-21T00:00:00+02:00https://blogs.igalia.com/dpiliaiev/turnip-vulkan-1-3<figure class="">
<img src="/dpiliaiev/assets/images/turnip-vulkan-1-3/rb3_board.jpg" alt="Photo of the RB3 development board with Adreno 630 GPU" /><figcaption>
RB3 development board with Adreno 630 GPU
</figcaption></figure>
<p>It is a major milestone for a driver that is created without any hardware documentation.</p>
<p>The last major roadblocks were <a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_dynamic_rendering.html#_description">VK_KHR_dynamic_rendering</a> and, to a much lesser extent, <a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_inline_uniform_block.html#_description">VK_EXT_inline_uniform_block</a>. Huge props to <a href="https://gitlab.freedesktop.org/cwabbott0">Connor Abbott</a> for implementing them both!</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-vulkan-1-3/mesamatrix.jpg" alt="Screenshot of mesamatrix.net showing that Turnip has 100% of features required for Vulkan 1.3" /><figcaption>
</figcaption></figure>
<p><code class="language-plaintext highlighter-rouge">VK_KHR_dynamic_rendering</code> was an especially nasty extension to implement on tiling GPUs because dynamic rendering allows splitting a render pass between several command buffers.</p>
<p>For desktop GPUs there are no issues with this. They could just record and execute commands in the same order they are submitted without any additional post-processing. Desktop GPUs don’t have render passes internally, they are just a sequence of commands for them.</p>
<p>On the other hand, tiling GPUs have the internal concept of a render pass: they do binning of the whole render pass geometry first, load part of the framebuffer into the tile memory, execute all render pass commands, store framebuffer contents into the main memory, then repeat <code class="language-plaintext highlighter-rouge">load_framebufer</code> -> <code class="language-plaintext highlighter-rouge">execute_renderpass</code> -> <code class="language-plaintext highlighter-rouge">store_framebuffer</code> for all tiles. In Turnip the required glue code is created at the end of a render pass, while the whole render pass contents (when the render pass is split across several command buffers) are known only at the submit time. Therefore we have to stitch the final render pass right there.</p>
<h2 id="whats-next">What’s next?</h2>
<p>Implementing Vulkan 1.3 was necessary to support the latest DXVK (Direct3D 9-11 translation layer). <code class="language-plaintext highlighter-rouge">VK_KHR_dynamic_rendering</code> itself was also necessary for the latest VKD3D (Direct3D 12 translation layer).</p>
<p>For now my plan is:</p>
<ul>
<li>Continue implementing new extensions for DXVK, VKD3D, and Zink as they come out.</li>
<li>Focus more on performance.</li>
<li>Improvements to driver debug tooling so it works better with internal and external debugging utilities.</li>
</ul>Danylo PiliaievTurnip, open-source Vulkan driver for Adreno GPUs, has reached a major milestone and now supports all necessary features for Vulkan 1.3July 2022 Turnip Status Update2022-07-25T00:00:00+02:002022-07-25T00:00:00+02:00https://blogs.igalia.com/dpiliaiev/turnip-july-2022-update<p>There is a steady progress being made since <a href="/dpiliaiev/turnip-1-1-conformance/">:tada: Turnip is Vulkan 1.1 Conformant :tada:</a>. We now support GL 4.6 via Zink, implemented a lot of extensions, and are close to Vulkan 1.3 conformance.</p>
<p>Support of real-world games is also looking good, here is a video of Adreno 660 rendering “The Witcher 3”, “The Talos Principle”, and “OMD2”:</p>
<iframe src="https://www.youtube.com/embed/oVFWy25uiXA" frameborder="0" allowfullscreen=""></iframe>
<p>All of them have reasonable frame rate. However, there was a bit of “cheating” involved. Only “The Talos Principle” game was fully running on the development board (via <a href="https://github.com/ptitSeb/box64">box64</a>), other two games were only rendered in real time on Adreno GPU, but were ran on x86-64 laptop with their VK commands being streamed to the dev board. You could read about this method in my post <a href="/dpiliaiev/gfxreconstruct-test-mobile-gpus/">“Testing Vulkan drivers with games that cannot run on the target device”</a>.</p>
<p>The video was captured directly on the device via OBS with <a href="https://github.com/nowrep/obs-vkcapture">obs-vkcapture</a>, which worked surprisingly well after fighting a bunch of issues due to the lack of binary package for it and a bit dated Ubuntu installation.</p>
<h2 id="zink-gl-over-vulkan">Zink (GL over Vulkan)</h2>
<p>A number of extensions were implemented that are required for Zink to support higher GL versions. As of now Turnip supports OpenGL 4.6 via Zink, and while yet not conformant - only a handful of GL CTS tests are failing. For the perspective, Freedreno (our GL driver for Adreno) supports only OpenGL 3.3.</p>
<p>For Zink adventures and profound post titles check out Mike Blumenkrantz’s awesome blog <a href="https://www.supergoodcode.com/">supergoodcode.com</a></p>
<p>If you are interested in Zink over Turnip bring up in particular, you should read:</p>
<ul>
<li><a href="https://www.supergoodcode.com/depth/">supergoodcode.com/depth</a></li>
<li><a href="https://www.supergoodcode.com/returning/">supergoodcode.com/returning</a></li>
<li><a href="https://www.supergoodcode.com/the-doctor-is-in/">supergoodcode.com/the-doctor-is-in</a></li>
</ul>
<h2 id="low-resolution-z-improvements">Low Resolution Z improvements</h2>
<p>A major improvement for low resolution Z optimization (LRZ) was recently made in Turnip, read about it in the previous post of mine: <a href="/dpiliaiev/adreno-lrz/">LRZ on Adreno GPUs</a></p>
<h2 id="extensions">Extensions</h2>
<p>Anyway, since the last update Turnip supports many more extensions (in no particular order):</p>
<ul>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_swapchain_mutable_format.html#_description">VK_KHR_swapchain_mutable_format</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_synchronization2.html#_description">VK_KHR_synchronization2</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_format_feature_flags2.html#_description">VK_KHR_format_feature_flags2</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_shader_non_semantic_info.html#_description">VK_KHR_shader_non_semantic_info</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_zero_initialize_workgroup_memory.html#_description">VK_KHR_zero_initialize_workgroup_memory</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_copy_commands2.html#_description">VK_KHR_copy_commands2</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_maintenance4.html#_description">VK_KHR_maintenance4</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_shader_module_identifier.html#_description">VK_EXT_shader_module_identifier</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_border_color_swizzle.html#_description">VK_EXT_border_color_swizzle</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_color_write_enable.html#_description">VK_EXT_color_write_enable</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_image_2d_view_of_3d.html#_description">VK_EXT_image_2d_view_of_3d</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_pipeline_creation_cache_control.html#_description">VK_EXT_pipeline_creation_cache_control</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_pipeline_creation_feedback.html#_description">VK_EXT_pipeline_creation_feedback</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_image_view_min_lod.html#_description">VK_EXT_image_view_min_lod</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_primitives_generated_query.html#_description">VK_EXT_primitives_generated_query</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_debug_utils.html#_description">VK_EXT_debug_utils</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_texel_buffer_alignment.html#_description">VK_EXT_texel_buffer_alignment</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_display_control.html#_description">VK_EXT_display_control</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_depth_clip_control.html#_description">VK_EXT_depth_clip_control</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_physical_device_drm.html#_description">VK_EXT_physical_device_drm</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_image_robustness.html#_description">VK_EXT_image_robustness</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_primitive_topology_list_restart.html#_description">VK_EXT_primitive_topology_list_restart</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_subgroup_size_control.html#_description">VK_EXT_subgroup_size_control</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_AMD_buffer_marker.html#_description">VK_AMD_buffer_marker</a></li>
<li><a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_ARM_rasterization_order_attachment_access.html#_description">VK_ARM_rasterization_order_attachment_access</a></li>
</ul>
<h2 id="what-about-vulkan-conformance">What about Vulkan conformance?</h2>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-july-2022-update/mesamatrix.jpg" alt="Screenshot of a mesamatrix.net website which shows how many extensions left for Turnip to implement to be Vulkan 1.3 conformant" /><figcaption>
From <a href="https://mesamatrix.net/#Vulkan1.3">mesamatrix.net/#Vulkan1.3</a>
</figcaption></figure>
<p>For Vulkan 1.3 conformance there are only a few extension left to implement. The only major ones are <a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_KHR_dynamic_rendering.html#_description">VK_KHR_dynamic_rendering</a> and <a href="https://www.khronos.org/registry/vulkan/specs/1.3-extensions/man/html/VK_EXT_inline_uniform_block.html#_description">VK_EXT_inline_uniform_block</a> required for Vulkan 1.3. <code class="language-plaintext highlighter-rouge">VK_KHR_dynamic_rendering</code> is currently being reviewed and foundation for <code class="language-plaintext highlighter-rouge">VK_EXT_inline_uniform_block</code> was recently merged.</p>
<p>That’s all for today!</p>Danylo PiliaievA lot of new extensions, GL 4.6 via Zink, major LRZ rework, and coming VK 1.3 conformance.Low-resolution-Z on Adreno GPUs2022-07-13T00:00:00+02:002022-07-13T00:00:00+02:00https://blogs.igalia.com/dpiliaiev/adreno-lrz<aside class="sidebar__right sticky">
<nav class="toc">
<header>
<h4 class="nav__title"><i class="fas fa-file-alt"></i> Table of Contents </h4>
</header>
<ol id="markdown-toc">
<li><a href="#what-is-lrz" id="markdown-toc-what-is-lrz">What is LRZ?</a></li>
<li><a href="#do-not-change-the-direction-of-depth-comparisons" id="markdown-toc-do-not-change-the-direction-of-depth-comparisons">Do not change the direction of depth comparisons</a></li>
<li><a href="#simple-rules-for-fragment-shader" id="markdown-toc-simple-rules-for-fragment-shader">Simple rules for fragment shader</a> <ol>
<li><a href="#do-not-write-depth" id="markdown-toc-do-not-write-depth">Do not write depth</a></li>
<li><a href="#do-not-use-blendinglogic-opscolorwritemask" id="markdown-toc-do-not-use-blendinglogic-opscolorwritemask">Do not use Blending/Logic OPs/colorWriteMask</a></li>
<li><a href="#do-not-have-side-effects-in-fragment-shaders" id="markdown-toc-do-not-have-side-effects-in-fragment-shaders">Do not have side-effects in fragment shaders</a></li>
<li><a href="#do-not-discard-fragments" id="markdown-toc-do-not-discard-fragments">Do not discard fragments</a></li>
</ol>
</li>
<li><a href="#lrz-in-secondary-command-buffers-and-dynamic-rendering" id="markdown-toc-lrz-in-secondary-command-buffers-and-dynamic-rendering">LRZ in secondary command buffers and dynamic rendering</a></li>
<li><a href="#reusing-lrz-between-renderpasses" id="markdown-toc-reusing-lrz-between-renderpasses">Reusing LRZ between renderpasses</a></li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ol>
</nav>
</aside>
<h2 id="what-is-lrz">What is LRZ?</h2>
<p>Citing official Adreno documentation:</p>
<blockquote>
<p>[A Low Resolution Z (LRZ)] pass is also referred to as draw order independent depth rejection. During the binning pass, a low resolution Z-buffer is constructed, and can reject LRZ-tile wide contributions to boost binning performance. This LRZ is then used during the rendering pass to reject pixels efficiently before testing against the full resolution Z-buffer.</p>
</blockquote>
<p>My colleague Samuel Iglesias did the initial reverse-engineering of this feature; for its in-depth overview you could read his great blog post <a href="https://blogs.igalia.com/siglesias/2021/04/19/low-resolution-z-buffer-support-on-turnip/">Low Resolution Z Buffer support on Turnip</a>.</p>
<p>Here are a few excerpts from this post describing <strong>what is LRZ</strong>?</p>
<blockquote>
<p>To understand better how LRZ works, we need to talk a bit about tiled-based rendering. This is a way of rendering based on subdividing the framebuffer in tiles and rendering each tile separately.</p>
</blockquote>
<blockquote>
<p>The binning pass processes the geometry of the scene and records in a table on which tiles a primitive will be rendered. By doing this, the HW only needs to render the primitives that affect a specific tile when is processed.</p>
</blockquote>
<blockquote>
<p>The rendering pass gets the rasterized primitives and executes all the fragment related processes of the pipeline. Once it finishes, the resolve pass starts.</p>
</blockquote>
<blockquote>
<p>Where is LRZ used then? Well, in both binning and rendering passes. In the binning pass, it is possible to store the depth value of each vertex of the geometries of the scene in a buffer as the HW has that data available. That is the depth buffer used internally for LRZ. It has lower resolution as too much detail is not needed, which helps to save bandwidth while transferring its contents to system memory.</p>
</blockquote>
<blockquote>
<p>Thanks to LRZ, the rendering pass is only executed on the fragments that are going to be visible at the end.</p>
</blockquote>
<blockquote>
<p>LRZ brings a couple of things on the table that makes it interesting. One is that applications don’t need to reorder their primitives before submission to be more efficient, that is done by the HW with LRZ automatically.</p>
</blockquote>
<p>Now, a year later, I returned to this feature to make some important improvements, for nitty-gritty details you could dive into <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/16251">Mesa MR#16251 “tu: Overhaul LRZ, implement on-GPU dir tracking and LRZ fast-clear”</a>. There I implemented on-GPU LRZ direction tracking, LRZ reuse between renderpasses, and fast-clear of LRZ.</p>
<p>In this post I want to give a practical advice, based on things I learnt while reverse-engineering this feature, on how to help driver to enable LRZ. Some of it could be self-evident, some is already written in the <a href="https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/overview.html#lrz">official docs</a>, and some cannot be found there. It should be applicable for Vulkan, GLES, and likely for Direct3D.</p>
<h2 id="do-not-change-the-direction-of-depth-comparisons">Do not change the direction of depth comparisons</h2>
<p>Or rather, when <strong>writing</strong> depth - do <strong>not</strong> change the direction of depth comparisons. If depth comparison direction is changed while writing into depth buffer - LRZ would have to be disabled.</p>
<p>Why? Because if depth comparison direction is <code class="language-plaintext highlighter-rouge">GREATER</code> - LRZ stores the <strong>lowest</strong> depth value of the block of pixels, if direction is <code class="language-plaintext highlighter-rouge">LESS</code> - it stores the <strong>highest</strong> value of the block. So if direction is changed the LRZ value becomes wrong for the new direction.</p>
<p>A few examples:</p>
<ul>
<li>:thumbsup: Going from <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_GREATER</code> -> <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_GREATER_OR_EQUAL</code> is good;</li>
<li>:x: Going from <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_GREATER</code> -> <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_LESS</code> is bad;</li>
<li>:neutral_face: From <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_GREATER</code> with depth write -> <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_LESS</code> without depth write is ok;
<ul>
<li>LRZ would be just temporally disabled for VK_COMPARE_OP_LESS draw calls.</li>
</ul>
</li>
</ul>
<p>The rules could be summarized as:</p>
<ul>
<li>Changing depth write direction disables LRZ;</li>
<li>For calls with different direction but without depth write LRZ is temporally disabled;</li>
<li><code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_GREATER</code> and <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_GREATER_OR_EQUAL</code> have same direction;</li>
<li><code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_LESS</code> and <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_LESS_OR_EQUAL</code> have same direction;</li>
<li><code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_EQUAL</code> and <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_NEVER</code> don’t have a direction, LRZ is temporally disabled;
<ul>
<li>Surprise, your <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_EQUAL</code> compares don’t benefit from LRZ;</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_ALWAYS</code> and <code class="language-plaintext highlighter-rouge">VK_COMPARE_OP_NOT_EQUAL</code> either temporally or completely disable LRZ, depending on depth being written.</li>
</ul>
<h2 id="simple-rules-for-fragment-shader">Simple rules for fragment shader</h2>
<h3 id="do-not-write-depth">Do not write depth</h3>
<p>This obviously makes resulting depth value unpredictable, so LRZ has to be completely disabled.</p>
<p>Note, the output values of manually written depth could be bound by conservative depth modifier, for GLSL this is achieved by GL_ARB_conservative_depth extension, like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>layout (depth_greater) out float gl_FragDepth;
</code></pre></div></div>
<p>However, Turnip at the moment does not consider this hint, and it is unknown if Qualcomm’s proprietary driver does.</p>
<h3 id="do-not-use-blendinglogic-opscolorwritemask">Do not use Blending/Logic OPs/colorWriteMask</h3>
<p>All of them make a new fragment value depend on the old fragment value. LRZ is <strong>temporary</strong> disabled in this case.</p>
<h3 id="do-not-have-side-effects-in-fragment-shaders">Do not have side-effects in fragment shaders</h3>
<p>Writing to SSBO, images, … from fragment shader forces late Z, thus it is incompatible with LRZ. At the moment Turnip completely disables LRZ when shader has such side-effects.</p>
<h3 id="do-not-discard-fragments">Do not discard fragments</h3>
<p>Discarding fragments moves the decision whether fragment contributes to the depth buffer to the time of fragment shader execution. LRZ is <strong>temporary</strong> disabled in this case.</p>
<h2 id="lrz-in-secondary-command-buffers-and-dynamic-rendering">LRZ in secondary command buffers and dynamic rendering</h2>
<p><strong>TLDR</strong>: Since Snapdragon 865 (Adreno 650) LRZ supported in secondary command buffers.</p>
<p><strong>TLDR</strong>: LRZ would work with <code class="language-plaintext highlighter-rouge">VK_KHR_dynamic_rendering</code>, but you’d like to avoid using this extension because it isn’t nice to tilers.</p>
<hr />
<p>Official docs state that LRZ is disabled with “Use of secondary command buffers (Vulkan)”, and on another page that “Snapdragon 865 and newer will not disable LRZ based on this criteria”.</p>
<p>Why?</p>
<p>Because up to Snapdragon 865 tracking of the direction is done on the CPU, meaning that LRZ direction is kept in internal renderpass object, updated and checked without any GPU involvement.</p>
<p>But starting from Snapdragon 865 the direction could be tracked on GPU which allows driver not to know previous LRZ direction during a command buffer construction. Therefor secondary command buffers could now use LRZ!</p>
<hr />
<p>Recently Vulkan 1.3 came out and mandated the support of <code class="language-plaintext highlighter-rouge">VK_KHR_dynamic_rendering</code>. It gets rid of complicated <code class="language-plaintext highlighter-rouge">VkRenderpass</code> and <code class="language-plaintext highlighter-rouge">VkFramebuffer</code> setup, but much more exciting is a simpler way for parallel renderpasses construction (with <code class="language-plaintext highlighter-rouge">VK_RENDERING_SUSPENDING_BIT</code> / <code class="language-plaintext highlighter-rouge">VK_RENDERING_RESUMING_BIT</code> flags).</p>
<p><code class="language-plaintext highlighter-rouge">VK_KHR_dynamic_rendering</code> poses a similar challenge for LRZ as secondary command buffers and has the same solution.</p>
<h2 id="reusing-lrz-between-renderpasses">Reusing LRZ between renderpasses</h2>
<p><strong>TLDR</strong>: Since Snapdragon 865 (Adreno 650) LRZ would work if you store depth in one renderpass and load it later, giving depth image isn’t changed in-between.</p>
<hr />
<p>Another major improvement brought by Snapdragon 865 is the possibility to reuse LRZ state between renderpasses.</p>
<p>The on-GPU direction tracking is part of the equation here, another part is the tracking of a depth view being used. Depth image has a single LRZ buffer which corresponds to a <code class="language-plaintext highlighter-rouge">single array layer</code> + <code class="language-plaintext highlighter-rouge">single mip level</code> of the image. So if view with different array layer or mip layer is used - LRZ state couldn’t be reused and will be invalidated.</p>
<p>With the above knowledge here are the conditions when LRZ state could be reused:</p>
<ul>
<li>Depth attachment was stored (<code class="language-plaintext highlighter-rouge">STORE_OP_STORE</code>) at the end of some past renderpass;</li>
<li>The same depth attachment with the same depth view settings is being loaded (not cleared) in the current renderpass;</li>
<li>There were no changes in the underlying depth image, meaning there was no <code class="language-plaintext highlighter-rouge">vkCmdBlitImage*</code>, <code class="language-plaintext highlighter-rouge">vkCmdCopyBufferToImage*</code>, or <code class="language-plaintext highlighter-rouge">vkCmdCopyImage*</code>. Otherwise LRZ state would be invalidated;</li>
</ul>
<p>Misc notes:</p>
<ul>
<li>LRZ state is saved per depth image so you don’t lose the state if you you have several renderpasses with different depth attachments;</li>
<li><code class="language-plaintext highlighter-rouge">vkCmdClearAttachments</code> + <code class="language-plaintext highlighter-rouge">LOAD_OP_LOAD</code> is just equal to <code class="language-plaintext highlighter-rouge">LOAD_OP_CLEAR</code>.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>While there are many rules listed above - it all boils down to keeping things simple in the main renderpass(es) and not being too clever.</p>Danylo PiliaievLow-resolution-Z (LRZ) is an important optimization but it's easy to accidentally disable it if you don't know the limitations.Graphics Flight Recorder - unknown but handy tool to debug GPU hangs2022-01-04T00:00:00+01:002022-01-04T00:00:00+01:00https://blogs.igalia.com/dpiliaiev/google-flight-recorder<p>It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called <a href="https://github.com/googlestadia/gfr">Graphics Flight Recorder (GFR)</a> and was open-sourced a year ago but didn’t receive any attention. From the readme:</p>
<blockquote>
<p>The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.</p>
</blockquote>
<p>It requires <code class="language-plaintext highlighter-rouge">VK_AMD_buffer_marker</code> support; however, this extension is rather trivial to implement - I had only <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13553">to copy-paste the code</a> from our <code class="language-plaintext highlighter-rouge">vkCmdSetEvent</code> implementation and that was it. Note, at the moment of writing, <a href="https://github.com/googlestadia/gfr/issues/5">GFR unconditionally uses</a><code class="language-plaintext highlighter-rouge">VK_AMD_device_coherent_memory</code>, which could be manually patched out for it to run on other GPUs.</p>
<p>GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
- # Command:
id: 6/9
markerValue: 0x000A0006
name: vkCmdBindPipeline
state: [SUBMITTED_EXECUTION_COMPLETE]
parameters:
- # parameter:
name: commandBuffer
value: 0x000000558CFD2A10
- # parameter:
name: pipelineBindPoint
value: 1
- # parameter:
name: pipeline
value: 0x000000558D3D6750
- # Command:
id: 6/9
message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<'
- # Command:
id: 7/9
markerValue: 0x000A0007
name: vkCmdDispatch
state: [SUBMITTED_EXECUTION_INCOMPLETE]
parameters:
- # parameter:
name: commandBuffer
value: 0x000000558CFD2A10
- # parameter:
name: groupCountX
value: 5
- # parameter:
name: groupCountY
value: 1
- # parameter:
name: groupCountZ
value: 1
internalState:
pipeline:
vkHandle: 0x000000558D3D6750
bindPoint: compute
shaderInfos:
- # shaderInfo:
stage: cs
module: (0x000000558F82B2A0)
entry: "main"
descriptorSets:
- # descriptorSet:
index: 0
set: 0x000000558E498728
- # Command:
id: 8/9
markerValue: 0x000A0008
name: vkCmdPipelineBarrier
state: [SUBMITTED_EXECUTION_NOT_STARTED]
...
</code></pre></div></div>
<p>After confirming that corresponding <code class="language-plaintext highlighter-rouge">vkCmdDispatch</code> is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.</p>
<p>With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14044">MR#14044</a> for “Alien: Isolation” and <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14110">MR#14110</a> for “Digital Combat Simulator”.</p>
<p>Unfortunately this tool is not a panacea:</p>
<ul>
<li>It likely would fail to help with unrecoverable hangs where it would be impossible to read the completion tags back.</li>
<li>Or when the mere addition of the tags could “fix” the issue which may happen with synchronization issues.</li>
<li>If draw/dispatch calls run in parallel on the GPU, writing tags may force them to execute sequentially or to be imprecise.</li>
</ul>
<p>Anyway, it’s easy to use so you should give it a try.</p>Danylo PiliaievIt appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme: The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash. It requires VK_AMD_buffer_marker support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our vkCmdSetEvent implementation and that was it. Note, at the moment of writing, GFR unconditionally usesVK_AMD_device_coherent_memory, which could be manually patched out for it to run on other GPUs. GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like: ... - # Command: id: 6/9 markerValue: 0x000A0006 name: vkCmdBindPipeline state: [SUBMITTED_EXECUTION_COMPLETE] parameters: - # parameter: name: commandBuffer value: 0x000000558CFD2A10 - # parameter: name: pipelineBindPoint value: 1 - # parameter: name: pipeline value: 0x000000558D3D6750 - # Command: id: 6/9 message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<' - # Command: id: 7/9 markerValue: 0x000A0007 name: vkCmdDispatch state: [SUBMITTED_EXECUTION_INCOMPLETE] parameters: - # parameter: name: commandBuffer value: 0x000000558CFD2A10 - # parameter: name: groupCountX value: 5 - # parameter: name: groupCountY value: 1 - # parameter: name: groupCountZ value: 1 internalState: pipeline: vkHandle: 0x000000558D3D6750 bindPoint: compute shaderInfos: - # shaderInfo: stage: cs module: (0x000000558F82B2A0) entry: "main" descriptorSets: - # descriptorSet: index: 0 set: 0x000000558E498728 - # Command: id: 8/9 markerValue: 0x000A0008 name: vkCmdPipelineBarrier state: [SUBMITTED_EXECUTION_NOT_STARTED] ... After confirming that corresponding vkCmdDispatch is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs. With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: MR#14044 for “Alien: Isolation” and MR#14110 for “Digital Combat Simulator”. Unfortunately this tool is not a panacea: It likely would fail to help with unrecoverable hangs where it would be impossible to read the completion tags back. Or when the mere addition of the tags could “fix” the issue which may happen with synchronization issues. If draw/dispatch calls run in parallel on the GPU, writing tags may force them to execute sequentially or to be imprecise. Anyway, it’s easy to use so you should give it a try.:tada: Turnip is Vulkan 1.1 Conformant :tada:2021-12-03T00:00:00+01:002021-12-03T00:00:00+01:00https://blogs.igalia.com/dpiliaiev/turnip-1-1-conformance<p><a href="https://www.khronos.org/conformance/adopters/conformant-products/vulkan#submission_598">Khronos submission</a> indicating Vulkan 1.1 conformance for Turnip on Adreno 618 GPU.</p>
<p>It is a great feat, especially for a driver which is created without hardware documentation. And we <a href="https://vulkan.gpuinfo.org/displayreport.php?id=13100#device">support features far from the bare minimum required for conformance</a>.</p>
<p>But first of all, I want to thank and congratulate everyone working on the driver: Connor Abbott, Rob Clark, Emma Anholt, Jonathan Marek, Hyunjun Ko, Samuel Iglesias. And special thanks to Samuel Iglesias and Ricardo Garcia for tirelessly improving Khronos Vulkan Conformance Tests.</p>
<hr />
<p>At the start of the year, when I started working on Turnip, I looked at the <a href="https://gitlab.freedesktop.org/mesa/mesa/-/blob/5331b1d9456e674751ffe0d68c08e0c6d3ea0d17/.gitlab-ci/deqp-freedreno-a630-fails.txt#L14-59">list of failing tests</a> and thought “It wouldn’t take a lot to fix them!”, right, sure… And so I started fixing issues alongside of looking for missing features.</p>
<p>In June there were <a href="https://gitlab.freedesktop.org/mesa/mesa/-/blob/ea5707c52f5e45f16ef554ee3d9fe2d3f318dae5/src/freedreno/ci/deqp-freedreno-a630-fails.txt#L12-120">even more failures</a> than there were in January, how could it be? Of course we were adding new features and it accounted for some of them. However even this list was likely not exhaustive because for gitlab CI instead of running the whole Vulkan CTS suite - we ran 1/3 of it. We didn’t have enough devices to run the whole suite fast enough to make it usable in CI. So I just ran it locally from time to time.</p>
<p>1/3 of the tests doesn’t sound bad and for the most part it’s good enough since we have a huge amount of tests looking like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_clear_texture_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_copy_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_load_format_list
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture
dEQP-VK.image.mutable.2d_array.b8g8r8a8_unorm_r32_sfloat_copy_texture_format_list
...
</code></pre></div></div>
<p>Every format, every operation, etc. Tens of thousands of them.</p>
<p>Unfortunately the selection of tests for a fractional run is as straightforward as possible - just every third test. Which bites us when there a single unique tests, like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.no_early_fragment_tests_stencil_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_depth_no_attachment
dEQP-VK.fragment_operations.early_fragment.early_fragment_tests_stencil_no_attachment
...
</code></pre></div></div>
<p>Most of them test something unique that has much higher probability of triggering a special path in a driver compared to uncountable image tests. And they fell through the cracks. I even had to fix one test twice because the CI didn’t run it.</p>
<p>A possible solution is to skip tests only when there is a large swath of them and run smaller groups as-is. But it’s likely more productive to just throw more hardware at the issue =).</p>
<h2 id="not-enough-hardware-in-ci">Not enough hardware in CI</h2>
<p>Another trouble is that we had only one 6xx sub-generation present in CI - Adreno 630. We distinguish four sub-generations. Not only they have some different capabilities, there are also differences in the existing ones, causing the same test to pass on CI and being broken on another newer GPU. Presently in CI we test only Adreno 618 and 630 which are “Gen 1” GPUs and we claimed conformance only for Adreno 618.</p>
<p>Yet another issue is that we could render in tiling and bypass (sysmem) modes. That’s because there are a few features we could support only when there is no tiling and we render directly into the sysmem, and sometimes rendering directly into sysmem is just faster. At the moment we use tiling rendering by default unless we meet an edge case, so by default CTS tests only tiling rendering.</p>
<p>We are forcing sysmem mode for a subset of tests on CI, however it’s not enough because the difference between modes is relevant for more than just a few tests. Thus ideally we should run twice as many tests, and even better would be thrice as many to account for tiling mode without binning vertex shader.</p>
<p>That issue became apparent when I implemented a magical eight-ball to choose between tiling and bypass modes depending on the run-time information in order to squeeze more performance (it’s still work-in-progress). The basic idea is that a single draw call or a few small draw calls is faster to render directly into system memory instead of loading framebuffer into the tile memory and storing it back. But almost every single CTS test does exactly this! Do a single or a few draw calls per render pass, which causes all tests to run in bypass mode. Fun!</p>
<p>Now we would be forced to deal with this issue since with the magic eight-ball games would partly run in the tiling mode and partly in the bypass, making them equally important for real-world workload.</p>
<h2 id="does-conformance-matter-does-it-reflect-anything-real-world">Does conformance matter? Does it reflect anything real-world?</h2>
<p>Unfortunately no test suite could wholly reflect what game developers do in their games. However, the amount of tests grows and new tests are getting contributed based on issues found in games and other applications.</p>
<p>When I ran my stash of D3D11 game traces through DXVK on Turnip for the first time - I found a bunch of new crashes and hangs but it took fixing just a few of them for majority of games to render correctly. This shows that Khronos Vulkan Conformance Tests are doing their job and we at Igalia are striving to make them even better.</p>Danylo PiliaievThe Khronos Group has granted Vulkan 1.1 conformance to the open source Adreno GPU driverTesting Vulkan drivers with games that cannot run on the target device2021-08-12T00:00:00+02:002021-08-12T00:00:00+02:00https://blogs.igalia.com/dpiliaiev/gfxreconstruct-test-mobile-gpus<!--more-->
<p>Here I’m playing “Spelunky 2” on my laptop and simultaneously replaying the same Vulkan calls on an ARM board with Adreno GPU running the open source Turnip Vulkan driver. Hint: it’s an x64 Windows game that doesn’t run on ARM.</p>
<video autoplay="" loop="" controls="" style="display:block; width:100%; height:auto;">
<source src="../assets/video/gfxreconstruct-test-mobile-gpus/spelunky_2_remote_vk_replay.webm" type="video/webm" />
</video>
<p class="text-center">The bottom right is the game I’m playing on my laptop, the top left is GFXReconstruct immediately replaying Vulkan calls from the game on ARM board.</p>
<p>How is it done? And why would it be useful for debugging? Read below!</p>
<hr />
<p>:exclamation::exclamation::exclamation: In 2022 <code class="language-plaintext highlighter-rouge">VK_LAYER_LUNARG_device_simulation</code> is deprecated, use <a href="https://github.com/KhronosGroup/Vulkan-Profiles">Vulkan-Profiles</a> :exclamation::exclamation::exclamation:</p>
<p>Latest <code class="language-plaintext highlighter-rouge">vulkaninfo -j</code> now outputs json that is compatible with Vulkan-Profiles. You’d need to obtain profiles for both host and remote GPUs, then you have to intersect their capabilities. It could be done like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python ./combine_profiles.py \
-profile_path2 /path/to/VP_VULKANINFO_Turnip_Adreno_TM_660_22_1_99.json \
-profile2 "VP_VULKANINFO_Turnip_Adreno_TM_660_22_1_99" \
-profile_path /path/to/VP_VULKANINFO_AMD_RADV_RENOIR_22_1_2.json \
-profile "VP_VULKANINFO_AMD_RADV_RENOIR_22_1_2" \
-output_path /path/to/VP_VULKANINFO_Turnip_Radeon_intersection_v2.json
-registry /path/to/Vulkan-Headers/registry/vk.xml
-mode intersection
</code></pre></div></div>
<p>Then used by the layer:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export VK_INSTANCE_LAYERS=VK_LAYER_KHRONOS_profiles
export VK_KHRONOS_PROFILES_PROFILE_FILE="/path/to/VP_VULKANINFO_Turnip_Radeon_intersection_v2.json"
export VK_KHRONOS_PROFILES_PROFILE_VALIDATION=true
export VK_KHRONOS_PROFILES_SIMULATE_CAPABILITIES=SIMULATE_FEATURES_BIT,SIMULATE_PROPERTIES_BIT,SIMULATE_EXTENSIONS_BIT,SIMULATE_FORMATS_BIT
</code></pre></div></div>
<hr />
<p>Debugging issues a driver faces with real-world applications requires the ability to capture and replay graphics API calls. However, for mobile GPUs it becomes even more challenging since for Vulkan driver the main “source” of real-world workload are x86-64 apps that run via Wine + DXVK, mainly games which were made for desktop x86-64 Windows and do not run on ARM. Efforts are being made to run these apps on ARM but it is still work-in-progress. And we want to test the drivers NOW.</p>
<p>The obvious solution would be to run those applications on an x86-64 machine capturing all Vulkan calls. Then replaying those calls on a second machine where we cannot run the app. This way it would be possible to test the driver even without running the application directly on it.</p>
<p>The main trouble is that Vulkan calls made on one GPU + Driver combo are not generally compatible with other GPU + Driver combo, sometimes even for one GPU vendor. There are different memory capabilities (<a href="https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/VkPhysicalDeviceMemoryProperties.html">VkPhysicalDeviceMemoryProperties</a>), different memory requirements for buffer and images, different extensions available, and different optional features supported. It is easier with OpenGL but there are also some incompatibilities there.</p>
<p>There are two open-source vendor-agnostic tools for capturing Vulkan calls: <a href="https://github.com/baldurk/renderdoc">RenderDoc</a> (captures single frame) and <a href="https://github.com/LunarG/gfxreconstruct">GFXReconstruct</a> (captures multiple frames). RenderDoc at the moment isn’t suitable for the task of capturing applications on desktop GPUs and replaying on mobile because it doesn’t translate memory type and requirements (<a href="https://github.com/baldurk/renderdoc/issues/814">see issue #814</a>). GFXReconstruct on the other hand has the necessary features for this.</p>
<p>I’ll show a couple of tricks with GFXReconstruct I’m using to test things on Turnip.</p>
<hr />
<h3 id="capturing-with-gfxreconstruct">Capturing with GFXReconstruct</h3>
<p>At this point you either have the application itself or, if it doesn’t use Vulkan, a trace of its calls that could be translated to Vulkan. There is a <a href="https://github.com/LunarG/gfxreconstruct/blob/dev/USAGE_desktop.md">detailed instruction</a> on how to use GFXReconstruct to capture a trace on desktop OS. However there is no clear instruction of how to do this on Android (see <a href="https://github.com/LunarG/gfxreconstruct/issues/534">issue #534</a>), fortunately there is <a href="https://developer.android.com/ndk/guides/graphics/validation-layer#layers-local-storag">one</a> in Android’s documentation:</p>
<details>
<summary style="cursor: pointer">
Android how-to (click me)
</summary>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>For Android 9 you should copy layers to the application which will be traced
For Android 10+ it's easier to copy them to com.lunarg.gfxreconstruct.replay
You should have userdebug build of Android or probably rooted Android
# Push GFXReconstruct layer to the device
adb push libVkLayer_gfxreconstruct.so /sdcard/
# Since there is to APK for capture layer,
# copy the layer to e.g. folder of com.lunarg.gfxreconstruct.replay
adb shell run-as com.lunarg.gfxreconstruct.replay cp /sdcard/libVkLayer_gfxreconstruct.so .
# Enable layers
adb shell settings put global enable_gpu_debug_layers 1
# Specify target application
adb shell settings put global gpu_debug_app <package_name>
# Specify layer list (from top to bottom)
adb shell settings put global gpu_debug_layers VK_LAYER_LUNARG_gfxreconstruct
# Specify packages to search for layers
adb shell settings put global gpu_debug_layer_app com.lunarg.gfxreconstruct.replay
</code></pre></div> </div>
<p>If the target application doesn’t have rights to write into external storage - you should change where the capture file is created:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>adb shell "setprop debug.gfxrecon.capture_file '/data/data/<target_app_folder>/files/'"
</code></pre></div> </div>
</details>
<p><br />
However, when trying to replay the trace you captured on another GPU - most likely it will result in an error:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[gfxrecon] FATAL - API call vkCreateDevice returned error value VK_ERROR_EXTENSION_NOT_PRESENT that does not match the result from the capture file: VK_SUCCESS. Replay cannot continue.
Replay has encountered a fatal error and cannot continue: the specified extension does not exist
</code></pre></div></div>
<p>Or other errors/crashes. Fortunately we could limit the capabilities of desktop GPU with <a href="https://github.com/LunarG/VulkanTools/blob/master/layersvt/device_simulation.md">VK_LAYER_LUNARG_device_simulation</a></p>
<p>VK_LAYER_LUNARG_device_simulation when simulating another GPU should be told to intersect the capabilities of both GPUs, making the capture compatible with both of them. This could be achieved by recently added environment variables:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">whitelist</code> name is rather confusing because it’s essentially means “intersection”.</p>
<p>One would also need to get a json file which describes target GPU capabilities, this should be done by running:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vulkaninfo -j &> <device_name>.json
</code></pre></div></div>
<p>The final command to capture a trace would be:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_gfxreconstruct:VK_LAYER_LUNARG_device_simulation \
VK_DEVSIM_FILENAME=<device_name>.json \
VK_DEVSIM_MODIFY_EXTENSION_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_LIST=whitelist \
VK_DEVSIM_MODIFY_FORMAT_PROPERTIES=whitelist \
<the_app>
</code></pre></div></div>
<h3 id="replaying-with-gfxreconstruct">Replaying with GFXReconstruct</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gfxrecon-replay -m rebind --skip-failed-allocations <trace_name>.gfxr
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">-m</code> Enable memory translation for replay on GPUs with memory types that are not compatible with the capture GPU’s
<ul>
<li><code class="language-plaintext highlighter-rouge">rebind</code> Change memory allocation behavior based on resource usage and replay memory properties. Resources may be bound to different allocations with different offsets.</li>
</ul>
</li>
<li><code class="language-plaintext highlighter-rouge">--skip-failed-allocations</code> skip vkAllocateMemory, vkAllocateCommandBuffers, and vkAllocateDescriptorSets calls that failed during capture</li>
</ul>
<p>Without these options replay would fail.</p>
<p>Now you could easily test any app/game on your ARM board, if you have enough RAM =) I even successfully ran a capture of “Metro Exodus” on Turnip.</p>
<h4 id="but-what-if-i-want-to-test-something-that-requires-interactivity">But what if I want to test something that requires interactivity?</h4>
<p>Or you don’t want to save a huge trace on disk, which could grow tens of gigabytes if application is running for considerable amount of time.</p>
<p>During the recording GFXReconstruct just appends calls to a file, there are no additional post-processing steps. Given that the next logical step is to just skip writing to a disk and send Vulkan calls over the network!</p>
<p>This would allow us to interact with the application and immediately see the results on another device with different GPU. And so I <a href="https://github.com/werman/gfxreconstruct/commits/feature/network-realtime-replay">hacked together</a> a crude support of over-the-network replay.</p>
<p>The only difference with ordinary tracing is that now instead of file we have to specify a network address of the target device:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>VK_LAYER_PATH=<path/to/device-simulation-layer>:<path/to/gfxreconstruct-layer> \
...
GFXRECON_CAPTURE_FILE="<ip>:<port>" \
<the_app>
</code></pre></div></div>
<p>And on the target device:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>while true; do gfxrecon-replay -m rebind --sfa ":<port>"; done
</code></pre></div></div>
<p>Why <code class="language-plaintext highlighter-rouge">while true</code>? It is common for DXVK to call <code class="language-plaintext highlighter-rouge">vkCreateInstance</code> several times leading to the creation of several traces. When replaying over the network we therefor want <code class="language-plaintext highlighter-rouge">gfxrecon-replay</code> to immediately restart when one trace ends to be ready for another.</p>
<p>You may want to bring the FPS down to match the capabilities of lower power GPU in order to prevent constant hiccups. It could be done either with <a href="https://gitlab.com/torkel104/libstrangle"><code class="language-plaintext highlighter-rouge">libstrangle</code></a> or with <a href="https://github.com/flightlessmango/MangoHud"><code class="language-plaintext highlighter-rouge">mangohud</code></a>:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">stranglevk -f 10</code></li>
<li><code class="language-plaintext highlighter-rouge">MANGOHUD_CONFIG=fps_limit=10 mangohud</code></li>
</ul>
<p>You have seen the result at the start of the post.</p>Danylo PiliaievUsing GFXReconstruct to test any Vulkan x86-64 game on a sufficiently capable mobile GPU.Turnips in the wild (Part 2)2021-05-06T00:00:00+02:002021-05-06T00:00:00+02:00https://blogs.igalia.com/dpiliaiev/turnips-in-the-wild-part-2<p>In <a href="/dpiliaiev/turnips-in-the-wild-part-1/">Turnips in the wild (Part 1)</a> we walked through two issues, one in TauCeti Benchmark and the other in Genshin Impact. Today, I have an update about the one I didn’t have plan to fix, and a showcase of two remaining issues I met in Genshin Impact.</p>
<!--more-->
<aside class="sidebar__right sticky">
<nav class="toc">
<header>
<h4 class="nav__title"><i class="fas fa-file-alt"></i> Table of Contents </h4>
</header>
<ol id="markdown-toc">
<li><a href="#genshin-impact" id="markdown-toc-genshin-impact">Genshin Impact</a> <ol>
<li><a href="#gameplay--disco-water" id="markdown-toc-gameplay--disco-water">Gameplay – Disco Water</a></li>
<li><a href="#login-screen" id="markdown-toc-login-screen">Login Screen</a></li>
<li><a href="#gameplay--where-did-the-trees-go" id="markdown-toc-gameplay--where-did-the-trees-go">Gameplay – Where did the trees go?</a></li>
</ol>
</li>
</ol>
</nav>
</aside>
<h2 id="genshin-impact">Genshin Impact</h2>
<h3 id="gameplay--disco-water">Gameplay – Disco Water</h3>
<p>In the previous post I said that I’m not planning to fix the broken water effect since it relied on undefined behavior.</p>
<p>As a refresher, the issue was caused by fragment shader not writing anything into the second attachment, which is UB in Vulkan:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>UNASSIGNED-CoreValidation-Shader-InputNotProduced(WARN / SPEC): Validation Warning:
Attachment 1 not written by fragment shader; undefined values will be written to attachment
</code></pre></div></div>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_disco_water.jpg" alt="Screenshot of the gameplay with body of water that has large colorful artifacts" /><figcaption>
</figcaption></figure>
<p>However, I was notified that same issue was fixed in OpenGL driver for Adreno (Freedreno) and the fix is rather easy. Even though for Vulkan it is clearly an undefined behavior, with other APIs it might not be so clear. Thus, given that we want to support translation from other APIs, there are already apps which rely on this behavior, and it would be just a bit more performant - I made a fix for it.</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_disco_water_fixed.jpg" alt="Screenshot of the gameplay with body of water without artifacts" /><figcaption>
</figcaption></figure>
<p>The issue was fixed by <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/10489">“tu: do not corrupt unwritten render targets (!10489)”</a></p>
<h3 id="login-screen">Login Screen</h3>
<p>The login screen welcomes us with not-so-healthy colors:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_first_screen_bad_colors.jpg" alt="Screenshot of a login screen in Genshin Impact which has wrong colors - columns and road are blue and white" /><figcaption>
</figcaption></figure>
<p>And with a few failures to allocate registers in the logs. The failure to allocate registers isn’t good and may cause some important shader not to run, but let’s hope it’s not that. Thus, again, we should take a closer look at the frame.</p>
<p>Once the frame is loaded I’m staring at an empty image at the end of the frame… Not a great start.</p>
<p>Such things mostly happen due to a GPU hang. Since I’m inspecting frames on Linux I took a look at <code class="language-plaintext highlighter-rouge">dmesg</code> and confirmed the hang:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> [drm:a6xx_irq [msm]] *ERROR* gpu fault ring 0 fence ...
</code></pre></div></div>
<p>Fortunately, after walking through draw calls, I found that the mis-rendering happens before the call which hangs. Let’s look at it:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_first_screen_call_before_issue.jpg" alt="Screenshot of a correct draw call right before the wrong one being inspected in RenderDoc" /><figcaption>
Draw call right before
</figcaption></figure>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_first_screen_call_with_issue.jpg" alt="Screenshot of a draw call, that draws the wrong colors, being inspected in RenderDoc" /><figcaption>
Draw call with the issue
</figcaption></figure>
<p>It looks like some fullscreen effect. As in the previous case - the inputs are fine, the only image input is a depth buffer. Also, there are always uniforms passed to the shaders, but when there is only a single problematic draw call - they are rarely an issue (also they are easily comparable with the proprietary driver if I spot some nonsensical values among them).</p>
<p>Now it’s time to look at the shader, ~150 assembly instructions, nothing fancy, nothing obvious, and a lonely <code class="language-plaintext highlighter-rouge">kill</code> near the top. Before going into the most “fun” part, it’s a good idea to make sure that the issue is 99% in the shader. RenderDoc has a cool feature which allows to debug shader (its SPIRV code) at a certain fragment (or vertex, or CS invocation), it does the evaluation on CPU, so I can use it as some kind of a reference implementation. In our case the output between RenderDoc and actual shader evaluation on GPU is different:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_first_screen_renderdoc_shader_evaluation.jpg" alt="Screenshot of the color value calculated on CPU by RenderDoc" /><figcaption>
Evaluation on CPU: color = vec4(0.17134, 0.40289, 0.69859, 0.00124)
</figcaption></figure>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_first_screen_actual_shader_values.jpg" alt="Screenshot of the color value calculated on GPU" /><figcaption>
On GPU: color = vec4(3.1875, 4.25, 5.625, 0.00061)
</figcaption></figure>
<p>Knowing the above there is only one thing left to do - reduce the shader until we find the problematic instruction(s). Fortunately there is a proprietary driver which renders the scene correctly, therefor instead of relying on intuition, luck, and persistance - we could quickly bisect to the issue by editing and comparing the edited shader with a reference driver. Actually, it’s possible to do this with shader debugging in RenderDoc, but I had problems with it at that moment and it’s not that easy to do.</p>
<p>The process goes like this:</p>
<ol>
<li>Decompile SPIRV into GLSL and check that it compiles back (sometimes it requires some editing)</li>
<li>Remove half of the code, write the most promising temporary variable as a color, and take a look at results</li>
<li>Copy the edited code to RenderDoc instance which runs on proprietary driver</li>
<li>Compare the results</li>
<li>If there is a difference - return deleted code, now we know that the issue is probably in it. Thus, bisect it by returning to step 2.</li>
</ol>
<p>This way I bisected to this fragment:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>_243 = clamp(_243, 0.0, 1.0);
_279 = clamp(_279, 0.0, 1.0);
float _290;
if (_72.x) {
_290 = _279;
} else {
_290 = _243;
}
color0 = vec4(_290);
return;
</code></pre></div></div>
<p>Writing <code class="language-plaintext highlighter-rouge">_279</code> or <code class="language-plaintext highlighter-rouge">_243</code> to <code class="language-plaintext highlighter-rouge">color0</code> produced reasonable results, but writing <code class="language-plaintext highlighter-rouge">_290</code> produced nonsense. The difference was only the presence of condition. Now, having a minimal change which reproduces the issue, it’s possible to compare native assembly.</p>
<p>Bad:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mad.f32 r0.z, c0.y, r0.x, c6.w
sqrt r0.y, r0.y
mul.f r0.x, r1.y, c1.z
(ss)(nop2) mad.f32 r1.z, c6.x, r0.y, c6.y
(nop3) cmps.f.ge r0.y, r0.x, r1.w
(sat)(nop3) sel.b32 r0.w, r0.z, r0.y, r1.z
</code></pre></div></div>
<p>Good:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(sat)mad.f32 r0.z, c0.y, r0.x, c6.w
sqrt r0.y, r0.y
(ss)(sat)mad.f32 r1.z, c6.x, r0.y, c6.y
(nop2) mul.f r0.y, r1.y, c1.z
add.f r0.x, r0.z, r1.z
(nop3) cmps.f.ge r0.w, r0.y, r1.w
cov.u r1.w, r0.w
(rpt2)nop
(nop3) add.f r0.w, r0.x, r1.w
</code></pre></div></div>
<p>By running them in my head I reasoned that they should produce the same results. Something works not as expected. After a bit more changes in GLSL, it became apparent that something wrong with <code class="language-plaintext highlighter-rouge">clamp(x, 0, 1)</code> which is translated into <code class="language-plaintext highlighter-rouge">(sat)</code> modifier for instructions. A bit more digging and I found out that hardware doesn’t understand saturation modifier being placed on <code class="language-plaintext highlighter-rouge">sel.</code> instruction (<code class="language-plaintext highlighter-rouge">sel</code> is a selection between two values based on third).</p>
<p>Disallowing compiler to place saturation on <code class="language-plaintext highlighter-rouge">sel</code> instruction resolved the bug:</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_first_screen_good.jpg" alt="" /><figcaption>
Login screen after the fix
</figcaption></figure>
<p>The issue was fixed by <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9666">“ir3: disallow .sat on SEL instructions (!9666)”</a></p>
<h3 id="gameplay--where-did-the-trees-go">Gameplay – Where did the trees go?</h3>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_black_trees.jpg" alt="Screenshot of the gameplay with trees and grass being almost black" /><figcaption>
</figcaption></figure>
<p>The trees and grass are seem to be rendered incorrectly. After looking through the trace and not finding where they were actually rendered, I studied the trace on proprietary driver and found them. However, there weren’t any such draw calls on Turnip!</p>
<p>The answer was simple, shaders failed to compile due to the failure in a register allocation I mentioned earlier… The general solution would be an implementation of register spilling. However in this case there is a pending merge request that implements a new register allocator, which later would help us implement register spilling. With it shaders can now be compiled!</p>
<figure class="">
<img src="/dpiliaiev/assets/images/turnip-real-world-rendering/genshin_impact_black_trees_fixed.jpg" alt="Screenshot of the gameplay with trees and grass being rendered correctly" /><figcaption>
</figcaption></figure>
<p>More Turnip adventures to come!</p>Danylo PiliaievHow well does the open source Vulkan driver for Adreno GPUs work on real-world tasks? Featuring Genshin Impact.