Debugging Unrecoverable Hangs of GPU

6 minute read

I already talked about debugging hangs in “Graphics Flight Recorder - unknown but handy tool to debug GPU hangs”, now I want to talk about the most nasty kind of GPU hangs - the ones which cannot be recovered from, where your computer becomes completely unresponsive and you cannot even ssh into it.

How would one debug this? There is no data to get after the hang and it’s incredibly frustrating to even try different debug options and hypothesis, if you are wrong - you get to reboot the machine!

If you are a hardware manufacturer creating a driver for your own GPU, it you could just run the hanging workload in a hardware simulator, wait for a few hours for the result and call it a day. But if you don’t have access to simulators, or to some debug side channel?

There are a few things to try:

  • Eyeball the workload until.
  • Try all debug options you have, try to disable different types of calls like all compute calls, or some draw calls, and so on. The downside is that you have to reboot every time you hit the issue.
  • Breadcrumbs!

Today we will talk about the breadcrumbs. Unfortunately, GFR, which I already wrote about, isn’t of much a help here. The hang is unrecoverable so GFR isn’t able to gather the results.

But the idea of breadcrumbs is still useful! What if instead of gathering result post factum, we stream the results to some other machine. This would allow us to get results even if our target becomes unresponsive. The need to get results ASAP considerably changes the breadcrumbs workflow.

What if we write breadcrumbs on GPU after each command and spin a thread on CPU reading it in a busy loop?

In practice the the amount of breadcrumbs between the one sent over network and the one currently executed is just too big to be practical.

So we have to make GPU and CPU running in a lockstep.

  • GPU writes a breadcrumb and immediately waits for this value to be written to another fixed address;
  • CPU in a busy loop checks the last written breadcrumb value, sends it over socket, and writes it back;
  • GPU sees a new value and continue execution.

This way the most recent breadcrumb gets immediately sent over the network. In practice, some breadcrumbs are still lost between the last sent over the network and the one where GPU hangs. But the difference is only a few of them.

With the lockstep execution we could narrow the hanging command even further. For this we have to wait for a certain time after each breadcrumb before proceeding to the next one. I chose to just ask user for explicit keyboard input for each breadcrumb.

In the end the workflow looks like this:

  • Run with breadcrumbs enabled but without requiring explicit user input;
  • On another machine receive the stream of breadcrumbs;
  • Note the last received breadcrumb;
  • Reboot the target;
  • Run the workload enabling breadcrumbs starting from the last one we previously received, requiring explicit ack from user;
  • In a few steps the target hangs;
  • Now that we know the closest breadcrumb the hang location - we can get the command stream and see what happens right after the breadcrumb;
  • Given the potential command it’s now possible to test various changes in the driver.

Could Graphics Flight Recorder do this?

In theory yes, but it would require additional VK extension to be able to wait for value in memory. However, it still would have a crucial limitation.

Even with Vulkan being really close to the hardware there are still many cases where one Vulkan command is translated into many GPU commands under the hood. Things like image copies, blits, renderpass boundaries. For unrecoverable hangs we want to narrow down the hanging GPU command as much as possible, so it makes more sense to implement such functionality in a driver.

Turnip take on the breadcrumbs

I recently implemented it In Turnip (open-source Vulkan driver for Qualcomm’s GPUs) and used a few times with good results.

Current implementation in Turnip is rather spartan and gives the minimal amount of instruments to achieve the workflow described above:

1) Launch workload with TU_BREADCRUMBS envvar:

TU_BREADCRUMBS=$IP:$PORT,break=$BREAKPOINT:$BREAKPOINT_HITS

2) Receive breadcrumbs on another machine via this bash spaghetti:

> nc -lvup $PORT | stdbuf -o0 xxd -pc -c 4 | awk -Wposix '{printf("%u:%u\n", "0x" $0, a[$0]++)}'

Received packet from 10.42.0.19:49116 -> 10.42.0.1:11113 (local)
1:0
7:0
8:0
9:0
10:0
11:0
12:0
13:0
14:0
[...]
10:3
11:3
12:3
13:3
14:3
15:3
16:3
17:3
18:3

Each line is a breadcrumb № and how many times it was repeated (either if command buffer is reusable or if breadcrumb is in a command stream repeat for each tile).

3) Increase hang timeout:

echo -n 120000 > /sys/kernel/debug/dri/0/hangcheck_period_ms

4) Launch workload and break on the last known executed breadcrumb:


Possible future improvements:

  • Easier matching of breadcrumb to a specific command in a command stream;
  • Print surrounding commands when requiring explicit user ack;