Bugaroos

Innocuous

Who else out there knows the feeling of opening up a bug ticket, thinking Oh, I fixed that, and then closing the tab?

Don’t lie, I know you do it too.

Such was the case when I (repeatedly) pulled up this ticket about a rendering regression in Dirt Rally.

dirt-rally.png

To me, this is a normal, accurate photo of a car driving along a dirt track in the woods. A very bright track, sure, but it’s plainly visible, just like the spectators lining the sides of the track.

The reporter insisted there was a bug in action, however, so I strapped on my snorkel and oven mitt and got to work.

For the first time.

Last week.

Bug #1

We all remember our first bug. It’s usually “the code didn’t compile the first time I tried”. The worst bug there is.

The first bug I found in pursuit of this so-called rendering anomaly was not a simple compile error. No, unfortunately the code built and ran fine thanks to Mesa’s incredible and rock-stable CI systems (this was last week), which meant I had to continue on this futile quest to find out whatever was wrong.

Next, I looked at the ticket again and downloaded the trace, which was unfortunately provided to prevent me from claiming that I didn’t have the game or couldn’t find it or couldn’t launch it or was too lazy. Another unlucky roll of the dice.

After sitting through thirty minutes of loading and shader compiling in the trace, I was treated to another picturesque view of a realistic racing environment:

dirt-fail.png

B-e-a-u-tiful.

If this is a bug, I don’t want to know what it’s supposed to look like.

Nevertheless, I struggled gamely onward. The first tool in any driver developers kit is naturally RenderDoc, which I definitely know how to use and am an expert in wielding for all manner of purposes, the least of which is fixing trivial “bugs” like this one. So I fired up RenderDoc—again, without even the slightest of troubles since I use it all the time and it works great for doing all the things I want it to do—got my package of frame-related data, and then loaded it up in QRenderDoc.

And, of course, QRenderDoc crashed because that was what I wanted it to do. It’s what I always want it to do, really.

But, regrettably, the legendary Baldur Karlsson is on my team and can fix bugs faster than I can report them, so I was unable to continue claiming I couldn’t proceed due to this minor inconvenience for long. I did my usual work in RenderDoc, which I won’t even describe because it took such a short amount of time due to my staggering expertise in the area and definitely didn’t result in finding any more RenderDoc bugs, and the result was clear to me: there were no zink bugs anywhere.

I did, however, notice that descriptor caching was performing suboptimally such that only the first descriptor used for UBOs/SSBOs was used to identify the cache entry, so I fixed that in less time than it took to describe this entire non-bug nothingburger.

Bug #2

I want you to look at the original image above, then look at this one here:

dirt-fail2.png

To me, these are indistinguishable. If you disagree, you are nitpicking and biased.

But now I was getting pile-on bug reports: if zink were to run out of BAR memory while executing this trace, again there was this nebulous claim that “bugs” occurred.

My skepticism was at a regular-time average after the original report yielded nothing but fixes for RenderDoc, so I closed the tab.

I did some other stuff for most of the week that I can’t remember because it was all so important I stored the memories somewhere safe—so safe that I can’t access them now and accidentally doodle all over them, but I was definitely very, very busy with legitimate work—but then today I decided to be charitable and give the issue a once-over in case there happened to be some merit to the claims.

Seriously though, who runs out of BAR memory? Doesn’t everyone have a GPU with 16GB of BAR these days?

I threw in a quick hack to disable BAR access in zink’s allocator and loaded up the trace again.

dirt-fail3.png

As we can see here, the weather is bad, so there are fewer spectators out to spectate, but everything is totally fine and bug-free. Just as expected. Very realistic. Maybe it was a little weird that the spectators were appearing and disappearing with each frame, but then I remembered this was the Road To The Haunted House track, and naturally such a track is haunted by the ghosts of players who didn’t place first.

At a conceptual level, I knew it was nonsense to be considering such things. I’m an adult now, and there’s no scientific proof that racing games can be haunted. Despite this, I couldn’t help feel a chill running down my fingers as I realized I’d discovered a bug. Not specifically because of anything I was doing, but I’d found one. Entirely independently. Just by thinking about it.

Indeed, by looking at the innermost allocation function of zink’s suballocator, there’s this bit of vaguely interesting code:

VkResult ret = VKSCR(AllocateMemory)(screen->dev, &mai, NULL, &bo->mem);
if (!zink_screen_handle_vkresult(screen, ret)) {
   if (heap == ZINK_HEAP_DEVICE_LOCAL_VISIBLE) {
      heap = ZINK_HEAP_DEVICE_LOCAL;
      mesa_loge("zink: %p couldn't allocate memory! from BAR heap: retrying as device-local", bo);
      goto demote;
   }
   mesa_loge("zink: couldn't allocate memory! from heap %u", heap);
   goto fail;
}

This is the fallback path for demoting BAR allocations to device-local. It works great. Except there’s an obvious bug for anyone thinking about it.

I’ll give you a few moments since it was actually a little tricky to spot.

By now though, you’ve all figured it out. Obviously by performing the demotion here, the entire BO is demoted to device-local, meaning that suballocator slabs get demoted and then cached as BAR, breaking the whole thing.

Terrifying, just like the not-at-all-haunted race track.

I migrated the demotion code out a couple layers so the resource manager could handle demotion, and the footgun was removed before it could do any real harm.

Bugs #3-5

Just to pass the time while I waited to see if an unrelated CI pipeline would fail, I decided to try running CTS with BAR allocations disabled. As everyone knows, if CTS fails, there’s a real bug somewhere, so it would be easy to determine whether there might be any bugs that were totally unrelated to the reports in this absolute lunatic’s ticket.

Imagine my surprise when I did, in fact, find failing test cases. Not anything related to the trace or scenario in the ticket, of course, since that’s not a test case, it’s a game, and my parents always told me to be responsible and stop playing games.

It turns out there were some corner case bugs that nothing had been hitting until now. The first one of these was in the second-worst place I could think of: threaded context. Debugging problems here is always nightmarish for varying reasons, though mostly because of threads.

The latest problem in question was that I was receiving a buffer_map call on a device-local buffer with the unsynchronized flag set, meaning that I was unable to use my command buffer. This was fine in the case where I was discarding any pre-existing data on the buffer, but if I did have data I wanted to save, I was screwed.

Well, not me specifically, but my performance was definitely not going to be in a good place since I’d have to force a thread sync and…

Anyway, this was bad, and the cause was that the is_resource_busy call from TC was never passing along the unsynchronized flag, which prevented the driver from accurately determining whether the resource was actually busy.

With that fixed, I definitely didn’t find any more issues whatsoever.

Except that I did.

They required the complex understanding of what some would call “basic arithmetic”, however, so I’m not going to get into them. Suffice to say that buffer copies were very subtly broken in a couple ways.

In Summary

I win.

I didn’t explicitly set out to prove this bug report wrong, but it clearly was, and there are absolutely no bugs anywhere so don’t even bother trying to find them.

And if you do find them, don’t bother reporting them because it’s not like anyone’s gonna fix them.

And if someone does fix them, it’s not like they’re gonna spend hours pointlessly blogging about it.

And if someone does spend hours pointlessly blogging about it, it’s not like anyone will read it anyway.

Written on June 16, 2022