Graphics Flight Recorder - unknown but handy tool to debug GPU hangs

3 minute read

It appears that Google created a handy tool that helps finding the command which causes a GPU hang/crash. It is called Graphics Flight Recorder (GFR) and was open-sourced a year ago but didn’t receive any attention. From the readme:

The Graphics Flight Recorder (GFR) is a Vulkan layer to help trackdown and identify the cause of GPU hangs and crashes. It works by instrumenting command buffers with completion tags. When an error is detected a log file containing incomplete command buffers is written. Often the last complete or incomplete commands are responsible for the crash.

It requires VK_AMD_buffer_marker support; however, this extension is rather trivial to implement - I had only to copy-paste the code from our vkCmdSetEvent implementation and that was it. Note, at the moment of writing, GFR unconditionally usesVK_AMD_device_coherent_memory, which could be manually patched out for it to run on other GPUs.

GFR already helped me to fix hangs in “Alien: Isolation” and “Digital Combat Simulator”. In both cases the hang was in a compute shader and the output from GFR looked like:

...
- # Command:
        id: 6/9
        markerValue: 0x000A0006
        name: vkCmdBindPipeline
        state: [SUBMITTED_EXECUTION_COMPLETE]
        parameters:
          - # parameter:
            name: commandBuffer
            value: 0x000000558CFD2A10
          - # parameter:
            name: pipelineBindPoint
            value: 1
          - # parameter:
            name: pipeline
            value: 0x000000558D3D6750
      - # Command:
        id: 6/9
        message: '>>>>>>>>>>>>>> LAST COMPLETE COMMAND <<<<<<<<<<<<<<'
      - # Command:
        id: 7/9
        markerValue: 0x000A0007
        name: vkCmdDispatch
        state: [SUBMITTED_EXECUTION_INCOMPLETE]
        parameters:
          - # parameter:
            name: commandBuffer
            value: 0x000000558CFD2A10
          - # parameter:
            name: groupCountX
            value: 5
          - # parameter:
            name: groupCountY
            value: 1
          - # parameter:
            name: groupCountZ
            value: 1
        internalState:
          pipeline:
            vkHandle: 0x000000558D3D6750
            bindPoint: compute
            shaderInfos:
              - # shaderInfo:
                stage: cs
                module: (0x000000558F82B2A0)
                entry: "main"
          descriptorSets:
            - # descriptorSet:
              index: 0
              set: 0x000000558E498728
      - # Command:
        id: 8/9
        markerValue: 0x000A0008
        name: vkCmdPipelineBarrier
        state: [SUBMITTED_EXECUTION_NOT_STARTED]
...

After confirming that corresponding vkCmdDispatch is indeed the call which hangs, in both cases I made an Amber test which fully simulated the call. For a compute shader, this is relatively easy to do since all you need is to save the decompiled shader and buffers being used by it. Luckily in both cases these Amber tests reproduced the hangs.

With standalone reproducers, the problems were much easier to debug, and fixes were made shortly: MR#14044 for “Alien: Isolation” and MR#14110 for “Digital Combat Simulator”.

Unfortunately this tool is not a panacea:

  • It likely would fail to help with unrecoverable hangs where it would be impossible to read the completion tags back.
  • Or when the mere addition of the tags could “fix” the issue which may happen with synchronization issues.
  • If draw/dispatch calls run in parallel on the GPU, writing tags may force them to execute sequentially or to be imprecise.

Anyway, it’s easy to use so you should give it a try.