It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme:
The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.
VK_AMD_buffer_marker support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our
vkCmdSetEvent implementation and that was it. Note, at the moment of writing, GFR unconditionally uses
VK_AMD_device_coherent_memory, which could be manually patched out for it to run on other GPUs.
GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:
... - # Command: id: 6/9 markerValue: 0x000A0006 name: vkCmdBindPipeline state: [SUBMITTED_EXECUTION_COMPLETE] parameters: - # parameter: name: commandBuffer value: 0x000000558CFD2A10 - # parameter: name: pipelineBindPoint value: 1 - # parameter: name: pipeline value: 0x000000558D3D6750 - # Command: id: 6/9 message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<' - # Command: id: 7/9 markerValue: 0x000A0007 name: vkCmdDispatch state: [SUBMITTED_EXECUTION_INCOMPLETE] parameters: - # parameter: name: commandBuffer value: 0x000000558CFD2A10 - # parameter: name: groupCountX value: 5 - # parameter: name: groupCountY value: 1 - # parameter: name: groupCountZ value: 1 internalState: pipeline: vkHandle: 0x000000558D3D6750 bindPoint: compute shaderInfos: - # shaderInfo: stage: cs module: (0x000000558F82B2A0) entry: "main" descriptorSets: - # descriptorSet: index: 0 set: 0x000000558E498728 - # Command: id: 8/9 markerValue: 0x000A0008 name: vkCmdPipelineBarrier state: [SUBMITTED_EXECUTION_NOT_STARTED] ...
After confirming that corresponding
vkCmdDispatch is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.
Unfortunately this tool is not a panacea:
- It likely would fail to help with unrecoverable hangs where it would be impossible to read the completion tags back.
- Or when the mere addition of the tags could “fix” the issue which may happen with synchronization issues.
- If draw/dispatch calls run in parallel on the GPU, writing tags may force them to execute sequentially or to be imprecise.
Anyway, it’s easy to use so you should give it a try.