New Month, New Post
I’m going to kick off this post and month by saying that in my defense, I was going to write this post weeks ago, but then I didn’t, and then I got sidetracked, but I had the screenshots open the whole time so it’s not like I forgot, but then I did forget for a little while, and then my session died because the man the myth the legend the as-seen-on-the-web-with-a-different-meaning Adam “ajax” Jackson pranked me with a GLX patch, but I started a new session, and gimp recovered my screenshots, and I remembered I needed to post, and I got distracted even more, and now it’s like three whole weeks later and here we are at the post I was going to write last month but didn’t get around to but now it’s totally been gotten to.
There’s a new Mesa release in the pipeline, and its name is 22.2. Cool stuff happening in that release: Zink passes GL 4.6 conformance tests on Lavapipe and ANV, would totally pass on RADV if Mesa could even compile shaders for sparse texture tests, and is reasonably close to passing on NVIDIA’s beta driver except for the cases which don’t pass because of unfixed NVIDIA driver bugs. Also Kopper is way better. And some other stuff that I could handwave about but I’m too tired and already losing focus.
Recap over, now let’s get down to some technical mumbo-jumbo.
They exist in Vulkan, and I hate them, but I’m not talking specifically about VkRenderPass. Instead, I’m referring to the abstract concept of “render passes” which includes the better-in-every-way dynamic rendering variant.
Shut up, subpasses, nobody was talking to you.
Render passes control how rendering works. There’s load operations which determine how data is retrieved from the framebuffer (I also hate framebuffers) attachments, there’s store operations which determine how data is stored back to the attachments (I hate this part too), and then there’s “dependencies” (better believe I hate these) which manage synchronization between operations, and input attachments (everyone hates these) which enable reading attachment data in shaders, and then also render pass instances have to be started and stopped any time any attachments or framebuffer geometry changes (this sucks), and to top it all off, transfer operations can’t be executed while render passes are active (mega sucks).
Also there’s nested render passes, but I’m literally fearing for my life even mentioning them where other driver developers can see, so let’s move on.
Today’s suck is focused on render passes and transfer operations: why can’t they just get along?
Here’s a sample command stream from a bug I recently solved:
Notice how the command stream continually starts and stops render passes to execute transfer operations. If you’re on a desktop, you’re like whatever just give me frames, but if you’re on a mobile device with a tiling GPU, this is probably sending you sprinting for your Tik-Tok so you don’t have to look at the scary perf demons anymore.
Until recently, I would’ve been in the former camp, but guess what: as of Mesa 22.2, Zink on Turnip is Officially Supported. Not only that, it has only 14 failures for all of GL 4.6 conformance, so it can render stuff pretty accurately too. Given this as well as the rapid response from Turnip developers, it seems only right that some changes be made to improve Zink on tiling GPUs so that performance isn’t an unmitigated disaster (really hate when this happens).
When running on a tiling GPU, it’s important to avoid
VK_ATTACHMENT_LOAD_OP_LOAD at all costs. These GPUs incur a penalty whenever they load attachment data, so ideally it’s best to either
VK_ATTACHMENT_LOAD_OP_CLEAR or, for attachments without valid data,
VK_ATTACHMENT_LOAD_OP_DONT_CARE. Failure to abide by this guideline will result in performance going straight to the dumpster (again, really hate this).
With this in mind, it may seem obvious, but the above command stream is terrible for tiling GPUs. Each render pass contains only a single draw before terminating to execute a transfer operation. Sure, an argument could be made that this is running a desktop GL application on mobile and thus any performance issues are expected, but I hate it when people argument.
Stop argumenting and get back to writing software.
Problem Space is a cool term used by real experts, so I can only imagine how much SEO using it twice gets me. In this Problem Space, none of the problem space problems that are spacing are new to people who have been in the problem space for longer than me, which is basically everyone since I only got to this problem space a couple years ago. Looking at Freedreno, there’s already a mechanism in place that I can copy to solve one problem Zink has, namely that of never using
Delving deeper into the problem space, use of
VK_ATTACHMENT_LOAD_OP_DONT_CARE is predicated upon the driver being able to determine that a framebuffer attachment does not have valid data when beginning a render pass instance. Thus, the first step is to track that status on the image. It turns out there’s not many places where this needs to be managed. On write-map, import, and write operations, mark the image as valid, and if the image is explicitly invalidated, mark it as invalid. This flag can then be leveraged to skip loading attachment data where possible.
Problem spaced. Or at least one of them.
Historically, Zink has always been a driver of order. It gets the GL commands as Gallium callbacks, it records the corresponding Vulkan command, and then it sends the command buffer to the GPU. It’s simple, yet effective.
But what if there was a better way? What if Zink could see all these operations that were splitting the render passes and determine whether they needed to occur “now” or could instead occur “then”?
This is entirely possible, and it’s similar to the tracking for invalidation. Think of it like a sieve:
- Create two command buffers
Ais the main command buffer
Bis for “unordered” commands
- When recording commands, implement an algorithm to determine whether a given command can be reordered to occur earlier in the command stream
- If yes, record command into
- If no, record command into
- If yes, record command into
- At submission time, execute
It sounds simple, but the details are a bit more complex. To handle this in Zink, I added more tracking data to each buffer and image resource such that:
- At the start of every command buffer, all transfer operations can be promoted to occur on cmdbuf B
- If a transfer operation is promoted to cmdbuf
B, the resources involved are tagged with the appropriate unordered read/write access flag(s)
- Any time a resource is used for a draw or dispatch operation, it gets flagged with read or write access that unsets the corresponding unordered access flag(s)
- When evaluating transfer operations, operations can be promoted to
Bif one of the following is true:
- The resources involved are already used only in
- There is no read access occurring on
Afor the resource being written to AND ONE OF
- There is no write access for the resources involved
- The only write access for the resources involved is on
- The resources involved are already used only in
By following this set of rules, transfer operations can be effectively filtered out of render passes on cmdbuf
A and into the dumpster of cmdbuf
B where all the other transfer operations live. Using a bit more creativity, the above command stream can be rewritten to look like this:
It’s not perfect, but it’s quite effective in not turning the driver code to spaghetti while yielding some nice performance boosts.
It’s also shipping along with the other changes in Mesa 22.2.
While both of these changesets solved real problems, there’s still work to be done in the future. Command stream reordering currently disallows moving draw and dispatch operations from A to B, but in some cases, this may actually be optimal. Finding a method to manage this could yield further gains in performance by avoiding more