Debugging Unrecoverable GPU Hangs

8 minute read

I already talked about debugging hangs in “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs”, now I want to talk about the most nasty kind of GPU hangs - the ones which cannot be recovered from, where your computer becomes completely unresponsive and you cannot even ssh into it.

How would one debug this? There is no data to get after the hang and it’s incredibly frustrating to even try different debug options and hypothesis, if you are wrong - you get to reboot the machine!

If you are a hardware manufacturer creating a driver for your own GPU, you could just run the workload in your fancy hardware simulator, wait for a few hours for the result and call it a day. But what if you don’t have access to a simulator, or to some debug side channel?

There are a few things one could try:

  • Try all debug options you have, try to disable compute dispatches, some draw calls, blits, and so on. The downside is that you have to reboot every time you hit the issue.
  • Eyeball the GPU packets until they start eyeballing you.
  • Breadcrumbs!

Today we will talk about the breadcrumbs. Unfortunately, Graphics Flight Recorder, which I already wrote about, isn’t of much help here. The hang is unrecoverable so GFR isn’t able to gather the results and write them to the disk.

But the idea of breadcrumbs is still useful! What if instead of gathering result post factum, we stream the results to some other machine. This would allow us to get the results even if our target becomes unresponsive. Though, the requirement to get the results ASAP considerably changes the workflow.

What if we write breadcrumbs on GPU after each command and spin a thread on CPU reading it in a busy loop?

In practice the the amount of breadcrumbs between the one sent over network and the one currently executed is just too big to be practical.

So we have to make GPU and CPU running in a lockstep.

  • GPU writes a breadcrumb and immediately waits for this value to be acknowledged;
  • CPU in a busy loop checks the last written breadcrumb value, sends it over socket, and writes it back to the fixed address;
  • GPU sees a new value and continues execution.

This way the most recent breadcrumb gets immediately sent over the network. In practice, some breadcrumbs are still lost between the last sent over the network and the one where GPU hangs. But the difference is only a few of them.

With the lockstep execution we could narrow the hanging command even further. For this we have to wait for a certain time after each breadcrumb before proceeding to the next one. I chose to just promt user for explicit keyboard input for each breadcrumb.

I ended up with the following workflow:

  • Run with breadcrumbs enabled but without requiring explicit user input;
  • On another machine receive the stream of breadcrumbs via network;
  • Note the last received breadcrumb, the hanging command would be nearby;
  • Reboot the target;
  • Enable breadcrumbs starting from a few breadcrumbs before the last one received, require explicit ack from the user on each breadcrumb;
  • In a few steps GPU should hang;
  • Now that we know the closest breadcrumb to the real hang location - we can get the command stream and see what happens right after the breadcrumb;
  • Knowing which command(s) is causing our hang - it’s now possible to test various changes in the driver.

Could Graphics Flight Recorder do this?

In theory yes, but it would require additional VK extension to be able to wait for value in memory. However, it still would have a crucial limitation.

Even with Vulkan being really close to the hardware there are still many cases where one Vulkan command is translated into many GPU commands under the hood. Things like image copies, blits, renderpass boundaries. For unrecoverable hangs we want to narrow down the hanging GPU command as much as possible, so it makes more sense to implement such functionality in a driver.

Turnip’s take on the breadcrumbs

I recently implemented it in Turnip (open-source Vulkan driver for Qualcomm’s GPUs) and used a few times with good results.

Current implementation in Turnip is rather spartan and gives the minimal amount of instruments to achieve the workflow described above. While looks like this:

1) Launch workload with TU_BREADCRUMBS envvar (TU_BREADCRUMBS=$IP:$PORT,break=$BREAKPOINT:$BREAKPOINT_HITS):

TU_BREADCRUMBS=$IP:$PORT,break=-1:0 ./some_workload

2) Receive breadcrumbs on another machine via this bash spaghetti:

> nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'

Received packet from 10.42.0.19:49116 -> 10.42.0.1:11113 (local)
1:0
7:0
8:0
9:0
10:0
11:0
12:0
13:0
14:0
[...]
10:3
11:3
12:3
13:3
14:3
15:3
16:3
17:3
18:3

Each line is a breadcrumb № and how many times it was repeated (either if command buffer is reusable or if breadcrumb is in a command stream which repeats for each tile).

3) Increase hang timeout:

echo -n 120000 > /sys/kernel/debug/dri/0/hangcheck_period_ms

4) Launch workload and break on the last known executed breadcrumb:

TU_BREADCRUMBS=$IP:$PORT,break=15:3 ./some_workload
GPU is on breadcrumb 18, continue?y
GPU is on breadcrumb 19, continue?y
GPU is on breadcrumb 20, continue?y
GPU is on breadcrumb 21, continue?y
...

5) Continue until GPU hangs.

:tada: Now you know two breadcrumbs between which there is a command which causes the hang.

Limitations

  • If the hang was caused by a lack of synchronization, or a lack of cache flushing - it may not be reproducible with breadcrumbs enabled.
  • While breadcrumbs could help to narrow down the hang to a single command - a single command may access many different states and have many steps under the hood (e.g. draw or blit commands).
  • The method for breadcrumbs described here do affect performance, especially if tiling rendering is used, since breadcrumbs are repeated for each tile.

Comments