Graphics Flight Recorder - unknown but handy tool to debug GPU hangs
It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme:
The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.
It requires VK_AMD_buffer_marker
support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our vkCmdSetEvent
implementation and that was it. Note, at the moment of writing, GFR unconditionally usesVK_AMD_device_coherent_memory
, which could be manually patched out for it to run on other GPUs.
GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:
...
- # Command:
id: 6/9
markerValue: 0x000A0006
name: vkCmdBindPipeline
state: [SUBMITTED_EXECUTION_COMPLETE]
parameters:
- # parameter:
name: commandBuffer
value: 0x000000558CFD2A10
- # parameter:
name: pipelineBindPoint
value: 1
- # parameter:
name: pipeline
value: 0x000000558D3D6750
- # Command:
id: 6/9
message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<'
- # Command:
id: 7/9
markerValue: 0x000A0007
name: vkCmdDispatch
state: [SUBMITTED_EXECUTION_INCOMPLETE]
parameters:
- # parameter:
name: commandBuffer
value: 0x000000558CFD2A10
- # parameter:
name: groupCountX
value: 5
- # parameter:
name: groupCountY
value: 1
- # parameter:
name: groupCountZ
value: 1
internalState:
pipeline:
vkHandle: 0x000000558D3D6750
bindPoint: compute
shaderInfos:
- # shaderInfo:
stage: cs
module: (0x000000558F82B2A0)
entry: "main"
descriptorSets:
- # descriptorSet:
index: 0
set: 0x000000558E498728
- # Command:
id: 8/9
markerValue: 0x000A0008
name: vkCmdPipelineBarrier
state: [SUBMITTED_EXECUTION_NOT_STARTED]
...
After confirming that corresponding vkCmdDispatch
is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.
With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: MR#14044 for “Alien: Isolation” and MR#14110 for “Digital Combat Simulator”.
Unfortunately this tool is not a panacea:
- It likely would fail to help with unrecoverable hangs where it would be impossible to read the completion tags back.
- Or when the mere addition of the tags could “fix” the issue which may happen with synchronization issues.
- If draw/dispatch calls run in parallel on the GPU, writing tags may force them to execute sequentially or to be imprecise.
Anyway, it’s easy to use so you should give it a try.
Comments