<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.supergoodcode.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.supergoodcode.com/" rel="alternate" type="text/html" /><updated>2026-01-23T18:26:39+00:00</updated><id>https://www.supergoodcode.com/feed.xml</id><title type="html">Mike Blumenkrantz</title><subtitle>Super. Good. Code.</subtitle><entry><title type="html">Unpopular Opinion</title><link href="https://www.supergoodcode.com/unpopular-opinion/" rel="alternate" type="text/html" title="Unpopular Opinion" /><published>2026-01-23T00:00:00+00:00</published><updated>2026-01-23T00:00:00+00:00</updated><id>https://www.supergoodcode.com/unpopular-opinion</id><content type="html" xml:base="https://www.supergoodcode.com/unpopular-opinion/"><![CDATA[<h1 id="a-big-day-for-graphics">A Big Day For Graphics</h1>

<p>Today is a big day for graphics. We got shiny new extensions and a new RM2026 profile, huzzah.</p>

<p><a href="https://docs.vulkan.org/features/latest/features/proposals/VK_EXT_descriptor_heap.html">VK_EXT_descriptor_heap</a> is huge. I mean in terms of surface area, the sheer girth of the spec, and the number of years it’s been under development. Seriously, check out that contributor list. Is it the longest ever? I’m not about to do comparisons, but it might be.</p>

<p>So this is a big deal, and everyone is out in the streets (I assume to celebrate such a monumental leap forward), and I’m not.</p>

<p>All hats off. Person to person, let’s talk.</p>

<h1 id="power-overwhelming">Power Overwhelming</h1>

<p>It’s true that descriptor heap is incredibly powerful. It perfectly exemplifies everything that Vulkan is: low-level, verbose, flexible. vkd3d-proton will make good use of it (eventually), as this more closely relates to the DX12 mechanics it translates. Game engines will finally have something that allows them to footgun as hard as they deserve. This functionality even maps more closely to certain types of hardware, as described by a great <a href="https://gfxstrand.net/faith/blog/2022/08/descriptors-are-hard/">gfxstrand blog post</a>.</p>

<p>There is, to my knowledge, just about nothing you can’t do with <code class="language-plaintext highlighter-rouge">VK_EXT_descriptor_heap</code>. It’s really, really good, and I’m proud of what the Vulkan WG has accomplished here.</p>

<p>But I don’t like it.</p>

<h1 id="what-is-this-incredibly-hot-take">What Is This Incredibly Hot Take?</h1>

<p>It’s a risky position; I don’t want anyone’s takeaway to be “Mike shoots down new descriptor extension as worst idea in history”. We’re all smart people, and we can comprehend nuance, like the difference between rb and ab in EGL patch review (protip: if anyone ever gives you an rb, they’re fucking lying because nobody can fully comprehend that code).</p>

<p>In short, I don’t expect zink to ever move to descriptor heap. If it does, it’ll be years from now as a result of taking on some other even more amazing extension which depends on heaps. Why is this, I’m sure you ask. Well, there’s a few reasons:</p>

<h2 id="code-complexity">Code Complexity</h2>
<p>Like all things Vulkan, “getting it right” with descriptors meant creating an API so verbose that I could write novels with fewer characters than some of the struct names. Everything is brand new, with no sharing/reuse of any existing code. As anyone who has ever stepped into an unfamiliar bit of code and thought “this is garbage, I should rewrite it all” knows too well, existing code is always the worst code–but it’s also the code that works and is tied into all the other existing code. Pretty soon, attempting to parachute in a new descriptor API becomes rewriting literally everything because it’s all incompatible. Great for those with time and resources to spare, not so great for everyone else.</p>

<p>Gone are image views, which is cool and good, except that everything else in Vulkan still uses them, meaning now all image descriptors need an extra pile of code to initialize the new structs which are used only for heaps. Hope none of that was shared between rendering and descriptor use, because now there will be rendering use and descriptor use and they are completely separate. Do I hate image views? Undoubtedly, and I like this direction, but hit me up in a few more years when I can delete them everywhere.</p>

<p>Shader interfaces are going to be the source of most pain. Sure, it’s very possible to keep existing shader infrastructure and use the mapping API with its glorious nested structs. But now you have an extra 1000 lines of mapping API structs to juggle on top. Alternatively, you can get AI to rewrite all your shaders to use the new spirv extension and have direct heap access.</p>

<h2 id="performance">Performance</h2>
<p>Descriptor heap maps closer to hardware, which should enable users to get more performant execution by eliminating indirection with direct heap access. This is great. Full stop.</p>

<p>…Unless you’re like zink, where the only way to avoid shredding 47 CPUs every time you change descriptors is to use a “sliding” offset for descriptors and update it each draw (i.e., <code class="language-plaintext highlighter-rouge">VK_DESCRIPTOR_MAPPING_SOURCE_HEAP_WITH_PUSH_INDEX_EXT</code>). Then you can’t use direct heap access. Which means you’re still indirecting your descriptor access (which has always been the purported perf pain point of 1.0 descriptors and EXT_descriptor_buffer). You do not pass Go, you do not collect $200. All you do is write a ton of new code.</p>

<h2 id="opinionated-development">Opinionated Development</h2>
<p>There’s a tremendous piece of exposition outlining the reasons why EXT_descriptor_heap exists in <a href="https://docs.vulkan.org/features/latest/features/proposals/VK_EXT_descriptor_heap.html#_problem_statement">the proposal</a>. None of these items are incorrect. I’ve even contributed to this document. If I were writing an engine from scratch, I would certainly expect to use heaps for portability reasons (i.e., in theory, it should eventually be available on all hardware).</p>

<p>But as flexible and powerful as descriptor heap is, there are some annoying cases where it passes the buck to the user. Specifically, I’m talking about management of the sampler heap. 1.0 descriptors and descriptor buffer just handwave away the exact hardware details, but with VK_EXT_descriptor_heap, you are now the captain of your own destiny and also the manager of exactly how the hardware is allocating its samplers. So if you’re on NVIDIA, where you have exactly 4096 available samplers as a hardware limit, you now have to juggle that limit yourself instead of letting the driver handle it for you.</p>

<p>This also applies to border colors, which has its own <a href="https://docs.vulkan.org/features/latest/features/proposals/VK_EXT_descriptor_heap.html#_why_is_there_an_explicit_custom_border_color_registration">note in the proposal</a>. At an objective, high-view level, it’s awesome to have such fine-grained control over the hardware. Then again, it’s one more thing the driver is no longer managing.</p>

<h1 id="i-dont-have-a-better-solution">I Don’t Have A Better Solution</h1>
<p>That’s certainly the takeaway here. I’m not saying go back to 1.0 descriptors. Nobody should do that. I’m not saying stick with descriptor buffers either. Descriptor heap has been under development since before I could legally drive, and I’m certainly not smarter than everyone (or anyone, most likely) who worked on it.</p>

<p>Maybe this is the best we’ll get. Maybe the future of descriptors really is micromanaging every byte of device memory and material stored within because we haven’t read every blog post in existence and don’t trust driver developers to make our shit run good. Maybe OpenGL, with its drivers that “just worked” under the hood (with the caveat that you, the developer, can’t be an idiot), wasn’t what we all wanted.</p>

<p>Maybe I was wrong, and we do need like five trillion more blog posts about Vulkan descriptor models. Because releasing a new descriptor extension is definitely how you get more of those blog posts.</p>

<p><a href="https://knowyourmeme.com/memes/im-tired-boss">I’m tired, boss.</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[A Big Day For Graphics]]></summary></entry><entry><title type="html">2026 Status</title><link href="https://www.supergoodcode.com/2026-status/" rel="alternate" type="text/html" title="2026 Status" /><published>2026-01-14T00:00:00+00:00</published><updated>2026-01-14T00:00:00+00:00</updated><id>https://www.supergoodcode.com/2026-status</id><content type="html" xml:base="https://www.supergoodcode.com/2026-status/"><![CDATA[<h1 id="not-a-real-post">Not A Real Post</h1>

<p>Still digging myself out of a backlog (and remembering how to computer), so probably no real post this week. I do have some exciting news for the blog though.</p>

<p>Now that various public announcements have been made, I can finally reveal the reason why I’ve been less active in Mesa of late is because I’ve been hard at work on Steam Frame. There’s a lot of very cool tech involved, and I’m planning to do some rundowns on the software-related projects I’ve been tackling.</p>

<p>Temper your expectations: I won’t be discussing anything hardware-related, and there will likely be no mentions of any specific game performance/issues.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Not A Real Post]]></summary></entry><entry><title type="html">Hibernate On</title><link href="https://www.supergoodcode.com/hibernate-on/" rel="alternate" type="text/html" title="Hibernate On" /><published>2025-10-31T00:00:00+00:00</published><updated>2025-10-31T00:00:00+00:00</updated><id>https://www.supergoodcode.com/hibernate-on</id><content type="html" xml:base="https://www.supergoodcode.com/hibernate-on/"><![CDATA[<h1 id="take-a-break">Take A Break</h1>

<p>We’ve reached Q4 of another year, and after the mad scramble that has been crunch-time over the past few weeks, it’s time for SGC to once again retire into a deep, restful sleep.</p>

<p>2025 saw a lot of ground covered:</p>
<ul>
  <li>NVK-Zink synergy</li>
  <li>Continued Rusticl improvements</li>
  <li>Viewperf perf and general CPU overhead reduction</li>
  <li>Tiler GPU perf</li>
  <li>Mesh shaders</li>
  <li>apitrace perf</li>
  <li>More GL extensions released than any other year this decade</li>
</ul>

<p>It’s been a real roller coaster ride of a year as always, but I can say authoritatively that fans of the blog, you need to take care of yourselves. You need to use this break time wisely. Rest. Recover. Train your bodies. Travel and broaden your horizons. Invest in night classes to expand your minds.</p>

<p>You are not prepared for the insanity that will be this blog in 2026.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Take A Break]]></summary></entry><entry><title type="html">Apitrace Goes Vroom</title><link href="https://www.supergoodcode.com/apitrace-goes-vroom/" rel="alternate" type="text/html" title="Apitrace Goes Vroom" /><published>2025-10-27T00:00:00+00:00</published><updated>2025-10-27T00:00:00+00:00</updated><id>https://www.supergoodcode.com/apitrace-goes-vroom</id><content type="html" xml:base="https://www.supergoodcode.com/apitrace-goes-vroom/"><![CDATA[<h1 id="first-time">First Time</h1>

<p>Today marks the first post of a type that I’ve wanted to have for a long while: a guest post. There are lots of graphics developers who work on cool stuff and don’t want to waste time setting up blogs, but with enough cajoling they will write a single blog post. If you’re out there thinking you just did some awesome work and you want the world to know the grimy, gory details, let me know.</p>

<p>The first <del>victim</del>recipient of this honor is an individual famous for small and extremely sane endeavors such as descriptor buffers in Lavapipe, ray tracing in Lavapipe, and sparse support in Lavapipe. Also wrangling ray tracing for RADV.</p>

<p>Below is the debut blog post by none other than Konstantin Seurer.</p>

<h1 id="what-is-apitrace">What is apitrace?</h1>

<p>Apitrace is a powerful tool for capturing and replaying traces of GL and DX applications. The problem is that it is not really suitable for performance testing. This blog post is about implementing a faster method for replaying traces.</p>

<p>About six weeks ago, Mike asked me if I wanted to work on this.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[6:58:58 pm] &lt;zmike&gt; on the topic of traces
[6:59:08 pm] &lt;zmike&gt; I have a longer-term project that could use your expertise
[6:59:19 pm] &lt;zmike&gt; it's low work but high complexity
[7:00:12 pm] &lt;zmike&gt; specifically I would like apitrace to be able to competently output C code from traces and to have this functionality merged upstream
</code></pre></div></div>

<blockquote>
  <p>low work</p>
</blockquote>

<p>Sure. <a href="https://www.supergoodcode.com/assets/glreplay/clueless.png"><img src="https://www.supergoodcode.com/assets/glreplay/clueless.png" alt="clueless.png" /></a></p>

<h1 id="the-state-of-glretrace">The state of <code class="language-plaintext highlighter-rouge">glretrace</code></h1>

<p>This first obvious step was measuring how <code class="language-plaintext highlighter-rouge">glretrace</code> currently performs. Mike kindly provided a couple of traces from his personal collection, and I immediately timed a trace of the only relevant OpenGL game:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ time ./glretrace -b minecraft-perf.trace
/Users/Cortex/Downloads/graalvm-jdk-23.0.1+11.1/bin/java
Rendered 1261 frames in 10.4269 secs, average of 120.937 fps

real    0m10.554s
user    0m12.938s
sys     0m2.712s
</code></pre></div></div>

<p>This looks fine, but I have no idea how fast the application is supposed to run. Running the same trace with <code class="language-plaintext highlighter-rouge">perf</code> reveals that there is room for improvement.</p>

<p><a href="https://www.supergoodcode.com/assets/glreplay/trace_parse_time.png"><img src="https://www.supergoodcode.com/assets/glreplay/trace_parse_time.png" alt="trace_parse_time.png" /></a></p>

<p>2/3 of frametime is spent parsing the trace.</p>

<h1 id="implementation">Implementation</h1>

<p>An apitrace trace stores API call information in an object-oriented style. This makes basic codegen really easy because the objects map directly to the generated C/C++ code. However, not all API calls are made equal, and the countless special cases that I needed to handle are what made this project take so long.</p>

<p><code class="language-plaintext highlighter-rouge">glretrace</code> has custom implementations for WSI API calls, and it would be a shame not to use them. The easiest way of doing that is generating a shared library instead of an executable and having <code class="language-plaintext highlighter-rouge">glretrace</code> load it. The shared library can then provide a bunch of callbacks for the call sequences we can do codegen for and <code class="language-plaintext highlighter-rouge">Call</code> objects for everything else.</p>

<p>Besides WSI, there are also arguments and return values that need special treatment. OpenGL allows the application to create all kinds of objects that are represented using IDs. Those IDs are assigned by the driver, and they can be different during replay. <code class="language-plaintext highlighter-rouge">glretrace</code> remaps them using <code class="language-plaintext highlighter-rouge">std::map</code>s which have non-trivial overhead. I initially did that as well for the codegen to get things up and running, but it is actually possible to emit global variables and have most of the remapping run logic during codegen.</p>

<h1 id="data-streaming">Data streaming</h1>

<p>With the main replay overhead being taken care of, a major amount of replay time is now spent loading texture and buffer data. In large traces, there can also be &gt;10GiB of data, so loading everything upfront is not an option. I decided to create one thread for reading the data file and <code class="language-plaintext highlighter-rouge">nproc</code> decompression threads. The read thread will wait if enough data has been loaded to limit memory usage. Decompression threads are needed because decompression is slower than reading the compressed data.</p>

<h1 id="codegen-in-action">codegen in action</h1>

<p>The results speak for themselves:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./glretrace --generate-c minecraft-perf minecraft-perf.trace
/Users/Cortex/Downloads/graalvm-jdk-23.0.1+11.1/bin/java
Rendered 0 frames in 79.4072 secs, average of 0 fps
$ cd minecraft-perf
$ echo "Invoke the superior build tool"
$ meson build --buildtype release
$ ninja -Cbuild
$ time ../glretrace build/minecraft-perf.so
info: Opening 'minecraft-perf.so'... (0.00668795 secs)
warning: Waited 0.0461142 secs for data (sequence = 19)
Rendered 1261 frames in 5.19587 secs, average of 242.693 fps

real    0m5.415s
user    0m5.429s
sys     0m4.983s
</code></pre></div></div>

<p>Nice.</p>

<p>Looking at <code class="language-plaintext highlighter-rouge">perf</code> most CPU time is now spent in driver code or streaming binary data for stuff like textures on a separate thread.</p>

<p><a href="https://www.supergoodcode.com/assets/glreplay/result_perf.png"><img src="https://www.supergoodcode.com/assets/glreplay/result_perf.png" alt="result_perf.png" /></a></p>

<p>If you are interested in trying this out yourself, feel free to build the <a href="https://github.com/apitrace/apitrace/pull/965">upstream PR</a> and report on <del>bugs</del> unintended features. It would also be nice to have DX support in the future, but that will be something for the dxvk developers unless I need something to procrastinate from doing RT work.</p>

<p>- Konstantin</p>]]></content><author><name></name></author><summary type="html"><![CDATA[First Time]]></summary></entry><entry><title type="html">Mesh Shaders In The Current Year</title><link href="https://www.supergoodcode.com/mesh-shaders-in-the-current-year/" rel="alternate" type="text/html" title="Mesh Shaders In The Current Year" /><published>2025-10-09T00:00:00+00:00</published><updated>2025-10-09T00:00:00+00:00</updated><id>https://www.supergoodcode.com/mesh-shaders-in-the-current-year</id><content type="html" xml:base="https://www.supergoodcode.com/mesh-shaders-in-the-current-year/"><![CDATA[<h1 id="it-happened">It Happened.</h1>

<p>Just a quick post to confirm that the OpenGL/ES Working Group has signed off on the release of <a href="https://github.com/KhronosGroup/OpenGL-Registry/pull/640">GL_EXT_mesh_shader</a>.</p>

<h1 id="credits">Credits</h1>
<p>This is a monumental release, the largest extension shipped for GL this decade, and the culmination of many, many months of work by AMD. In particular we all need to thank Qiang Yu (AMD), who spearheaded this initiative and did the vast majority of the work both in writing the specification and doing the core mesa implementation. Shihao Wang (AMD) took on the difficult task of writing actual CTS cases (not mandatory for EXT extensions in GL, so this is a huge benefit to the ecosystem).</p>

<p>Big thanks to both of you, and everyone else behind the scenes at AMD, for making this happen.</p>

<p>Also we have to thank the <a href="https://github.com/MCRcortex/nvidium">nvidium</a> project and its author, Cortex, for single-handedly pushing the industry forward through the power of Minecraft modding. Stay sane out there.</p>

<h1 id="support">Support</h1>
<p>Minecraft mod support is already underway, so expect that to happen “soon”.</p>

<p>The bones of this extension have already been merged into mesa over the past couple months. I opened a <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37788">MR to enable zink support</a> this morning since I have already merged the implementation.</p>

<p>Currently, I’m planning to wait until either just before the branch point next week or until RadeonSI merges its support to merge the zink MR. This is out of respect: Qiang Yu did a huge lift for everyone here, and ideally AMD’s driver should be the first to be able to advertise that extension to reflect that. But the branchpoint is coming up in a week, and SGC will be going into hibernation at the end of the month until 2026, so this offer does have an expiration date.</p>

<p>In any case, we’re done here.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[It Happened.]]></summary></entry><entry><title type="html">Now We CAD</title><link href="https://www.supergoodcode.com/now-we-cad/" rel="alternate" type="text/html" title="Now We CAD" /><published>2025-09-16T00:00:00+00:00</published><updated>2025-09-16T00:00:00+00:00</updated><id>https://www.supergoodcode.com/now-we-cad</id><content type="html" xml:base="https://www.supergoodcode.com/now-we-cad/"><![CDATA[<h1 id="perf-must-increase">Perf Must Increase.</h1>

<p>After my last post, I’m sure everyone was speculating about the forthcoming zink takeover of the CAD industry. Or maybe just wondering why I’m bothering with this at all. Well, the answer is simple: CAD performance is all performance. If I can improve FPS in viewperf, I’m decreasing CPU utilization in all apps, which is generally useful.</p>

<p>As in the previous post, the catia section of viewperf was improved to a whopping 34fps against the reference driver (radeonsi) by eliminating a few hundred thousand atomic operations per frame. An interesting observation here is that while eliminating atomic operations in radeonsi does improve FPS there by ~5% (105fps), there is no bottlenecking, so this does not “unlock” further optimizations in the same way that it does for zink. I speculate this is because zink has radv underneath, which affects memory access across ccx in ways that do not affect radeonsi.</p>

<p>In short: a rising tide lifts all ships in the harbor, but since zink was effectively a sunken ship, it is rising much more than the others.</p>

<h1 id="even-more-improvements">Even More Improvements</h1>

<p>Since that previous post, I and others have been working quietly in the background on other improvements, all of which have landed in mesa main already:</p>

<p><a href="https://www.supergoodcode.com/assets/catia-quietly.png"><img src="https://www.supergoodcode.com/assets/catia-quietly.png" alt="catia-quietly.png" /></a></p>

<p>A nice 35% improvement, largely from three MRs:</p>
<ul>
  <li><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37318">zink draw optimizations</a></li>
  <li><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37277">zink vbo binding optimizations</a></li>
  <li><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36988">radv dynamic state optimizations</a></li>
</ul>

<p>That’s right. In my quest to maximize perf, I have roped in veteran radv developer and part-time vacation enthusiast, Samuel Pitoiset. Because radv is slow. <a href="https://github.com/zmike/vkoverhead">vkoverhead</a> exists to target noticeably slow cases, and by harnessing the forbidden power of rewriting the whole driver, it was possible for a lone Frenchman to significantly reduce bottlenecking during draw emission.</p>

<h1 id="this-isnt-even-my-final-form">This Isn’t Even My Final Form</h1>

<p>Obviously. I’m not about to say that I’ll only stop when I reach performance parity, but the FPS can still go up.</p>

<p>At this point, however, it’s becoming less useful (in zink) to look at flamegraphs. There’s only so much optimization that can be done once the code has been simplified to a certain extent, and eventually those optimizations will lead to obfuscated code which is harder to maintain.</p>

<p>Thus, it’s time to step back and look architecturally. What is the app doing? How does that reach the driver? Can it be improved?</p>

<p><code class="language-plaintext highlighter-rouge">GALLIUM_TRACE</code> is a great tool for this, as it logs the API stream as it reaches the backend driver, and there are parser tools to convert the output XML to something readable. Let’s take a look at a small cross-section of the trace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10043], [is_user_buffer = 0, buffer_offset = 7440, buffer.resource = resource_10043]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10044], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10045], [is_user_buffer = 0, buffer_offset = 7632, buffer.resource = resource_10045]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10046], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10047], [is_user_buffer = 0, buffer_offset = 7680, buffer.resource = resource_10047]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10048], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10049], [is_user_buffer = 0, buffer_offset = 7656, buffer.resource = resource_10049]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10050], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10051], [is_user_buffer = 0, buffer_offset = 7752, buffer.resource = resource_10051]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10052], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10053], [is_user_buffer = 0, buffer_offset = 7800, buffer.resource = resource_10053]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10054], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10055], [is_user_buffer = 0, buffer_offset = 7968, buffer.resource = resource_10055]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10056], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10057], [is_user_buffer = 0, buffer_offset = 7968, buffer.resource = resource_10057]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10058], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10059], [is_user_buffer = 0, buffer_offset = 8136, buffer.resource = resource_10059]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10060], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10061], [is_user_buffer = 0, buffer_offset = 8280, buffer.resource = resource_10061]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10062], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10063], [is_user_buffer = 0, buffer_offset = 8040, buffer.resource = resource_10063]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10064], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
pipe_context::set_vertex_buffers(pipe = context_2, num_buffers = 2, buffers = [[is_user_buffer = 0, buffer_offset = 0, buffer.resource = resource_10065], [is_user_buffer = 0, buffer_offset = 7608, buffer.resource = resource_10065]])
pipe_context::draw_vbo(pipe = context_2, info = [index_size = 2, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 0, max_index = 1257, primitive_restart = 0, restart_index = 0, index.resource = resource_10066], drawid_offset = 0, indirect = NULL, draws = [[start = 0, count = 1257, index_bias = 0]], num_draws = 1)
</code></pre></div></div>

<p>As expected, a huge chunk of the runtime is just <code class="language-plaintext highlighter-rouge">set_vertex_buffers</code> -&gt; <code class="language-plaintext highlighter-rouge">draw_vbo</code>. Architecturally, this leads to a lot of unavoidably wasted cycles in drivers:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">set_vertex_buffers</code> “binds” vertex buffers to the context and flags state updates</li>
  <li><code class="language-plaintext highlighter-rouge">draw_vbo</code> checks all of the driver’s update-able states, updates the flagged ones, and then emits draws</li>
</ul>

<p>But in the scenario where the driver can know ahead of time exactly what states will be updated, couldn’t that yield an improvement? For example, bundling these two calls into a single draw call would eliminate:</p>
<ul>
  <li>“binding” of vertex buffers</li>
  <li>vbo state update flagging</li>
  <li>draw-time validation</li>
  <li>calling multiple driver entrypoints</li>
</ul>

<p>In theory, it seems like this should be pretty good. And now that vertex buffer lifetimes have been reworked to use explicit ownership rather than garbage collection, it’s actually possible to do this. The optimal site for the optimization would be in threaded-context, where similar types of draw merging are already occurring.</p>

<p><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37420">The result</a> looks something like this in a comparable trace:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1141, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 163536, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 191032, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771328, buffer.resource = resource_29602]], draws = [[start = 1141, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1146, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 218528, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 246144, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771360, buffer.resource = resource_29602]], draws = [[start = 1146, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1151, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 273760, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 301496, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771392, buffer.resource = resource_29602]], draws = [[start = 1151, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1156, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 329232, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 357088, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771424, buffer.resource = resource_29602]], draws = [[start = 1156, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1161, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 384944, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 412920, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771456, buffer.resource = resource_29602]], draws = [[start = 1161, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1166, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 440896, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 468992, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771488, buffer.resource = resource_29602]], draws = [[start = 1166, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1171, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 497088, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 525304, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771520, buffer.resource = resource_29602]], draws = [[start = 1171, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1176, max_index = 11, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 553520, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 582000, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771552, buffer.resource = resource_29602]], draws = [[start = 1176, count = 11]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1187, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 610480, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 639080, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771584, buffer.resource = resource_29602]], draws = [[start = 1187, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1192, max_index = 6, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 667680, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 696424, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771616, buffer.resource = resource_29602]], draws = [[start = 1192, count = 6]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1198, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 725168, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 754032, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771648, buffer.resource = resource_29602]], draws = [[start = 1198, count = 5]], num_draws = 1)
pipe_context::draw_vbo_buffers(pipe = pipe_2, info = [index_size = 0, has_user_indices = 0, mode = 5, start_instance = 0, instance_count = 1, min_index = 1203, max_index = 5, primitive_restart = 0, restart_index = 0, index.resource = NULL], buffer_count = 3, buffers = [[is_user_buffer = 0, buffer_offset = 782896, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 811880, buffer.resource = resource_30210], [is_user_buffer = 0, buffer_offset = 771680, buffer.resource = resource_29602]], draws = [[start = 1203, count = 5]], num_draws = 1)
</code></pre></div></div>

<p>It’s more compact, which is nice, but how does the perf look?</p>

<p><a href="https://www.supergoodcode.com/assets/catia-vroom.png"><img src="https://www.supergoodcode.com/assets/catia-vroom.png" alt="catia-vroom.png" /></a></p>

<p>About another 40% improvement, now over 60fps: nearly double the endpoint of the last post. Huge.</p>

<p><em>And</em> this is driving ecosystem improvements which will affect other apps and games which don’t even use zink.</p>

<p>Stay winning, Open Source graphics.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Perf Must Increase.]]></summary></entry><entry><title type="html">Big Lifts</title><link href="https://www.supergoodcode.com/big-lifts/" rel="alternate" type="text/html" title="Big Lifts" /><published>2025-09-09T00:00:00+00:00</published><updated>2025-09-09T00:00:00+00:00</updated><id>https://www.supergoodcode.com/big-lifts</id><content type="html" xml:base="https://www.supergoodcode.com/big-lifts/"><![CDATA[<h1 id="new-record">New Record</h1>

<p>For months now I’ve been writing <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33813">increasingly</a> <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33946">unhinged</a> patchsets. Sometimes it might even seem like there <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34045">is no real point</a> to what I’m doing. Or that I’m just <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34054">churning code</a> to have something to do.</p>

<p>But I’m here today to tell you that finally, the long journey is over.</p>

<p>We have reached the promised land of perf.</p>

<h1 id="huge">Huge.</h1>

<p>Many months ago, I began examining viewperf, AKA the final frontier of driver performance. <em>What makes this the final frontier?</em> some of you might be asking.</p>

<p>Imagine an application which does 10,000 individual draws per frame, each with their own vertex buffer bindings. That’s a lot of draws.</p>

<p>Now imagine an application which does <strong>ten times that many draws per frame</strong>. This is viewperf, which represents common use cases of CAD-adjacent technologies. Where other applications might hammer on the GPU, viewperf tests the CPU utilization. It’s what separates the real developers from average, sane people.</p>

<p>So all those months ago, I ran viewperf on zink, and I ended up here:</p>

<p><a href="https://www.supergoodcode.com/assets/catia-before.png"><img src="https://www.supergoodcode.com/assets/catia-before.png" alt="catia-before.png" /></a></p>

<p>18fps. This is on threadripper 5975WX with RADV; not the most modern or powerful CPU, but it’s still pretty quick.</p>

<p>Then I loaded up radeonsi and got 100fps. Brutal.</p>

<h1 id="plumbing-the-abyss">Plumbing The Abyss</h1>

<p>Examining this was where I entered into into realms of insanity not known to mere mortals. <code class="language-plaintext highlighter-rouge">perf</code> started to fail and give confusing results, other profilers just drew a circle around the driver and pointed to the whole thing as the problem area, and some tools just gave up entirely. No changes affected the performance in any way. This is when the savvy hacker begins profiling by elimination: delete as much code as possible and try to force changes.</p>

<p>Thus, I deleted a lot of code to see what would pop out, and eventually I discovered the horrifying truth: I was being bottlenecked by the sheer number of atomic operations occurring.</p>

<p>Like I said before, viewperf does upwards of 100,000 draw calls per frame. This means 100,000 draw calls, 100,000 vertex buffer binds (times two because there are two vertex buffers), 100,000 index buffer binds, and a few shader changes sprinkled in. The way that mesa/gallium work means that every single vertex buffer and index buffer which get sent to the driver incur multiple atomic operations (each) along the way for refcounting: because gallium uses refcounting rather than an ownership model since it is much easier to manage. That means we’re talking about upwards of 300,000 atomic operations per frame.</p>

<p>Unfortunately, hackily deleting all the refcounting made the FPS go brrrrr, and it was a long road to legitimately get there. A very, very long road. Six months, in fact. But all the unhinged MRs above landed, reducing the surface area of the refcounting to just buffers, which put me in a position to do <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36296">this pro gamer move</a> where I also am removing all the refcounting from the buffers.</p>

<p>This works, roughly speaking, by enforcing ownership on the buffers and then releasing them when they are no longer used. Sounds simple, but plumbing it through all the gallium drivers without breaking everything was less so. Let’s see where moving to that model gets the numbers:</p>

<p><a href="https://www.supergoodcode.com/assets/catia-during.png"><img src="https://www.supergoodcode.com/assets/catia-during.png" alt="catia-during.png" /></a></p>

<p>One more frame. Tremendous.</p>

<p>But wait, there’s more. The other part of that MR further deletes all the refcounting in zink for buffers, fully removing the atomics. And…</p>

<p><a href="https://www.supergoodcode.com/assets/catia-after.png"><img src="https://www.supergoodcode.com/assets/catia-after.png" alt="catia-after.png" /></a></p>

<p>Blammo, that doubles the perf and manages to eliminate the bottleneck, which sets the stage for further improvements. The gap is still large, but it’s about to close real fast.</p>

<p>Shout out to Marek for heroically undertaking the review of this leviathan.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[New Record]]></summary></entry><entry><title type="html">Mesh Shader Progress</title><link href="https://www.supergoodcode.com/mesh-shader-progress/" rel="alternate" type="text/html" title="Mesh Shader Progress" /><published>2025-09-05T00:00:00+00:00</published><updated>2025-09-05T00:00:00+00:00</updated><id>https://www.supergoodcode.com/mesh-shader-progress</id><content type="html" xml:base="https://www.supergoodcode.com/mesh-shader-progress/"><![CDATA[<h1 id="vkcts-tests-27890--glcts-tests-227--percentage-of-vulkan-drivers-with-mesh-bugs-100">VKCTS Tests: 27,890 | GLCTS Tests: 227 | Percentage of Vulkan Drivers With Mesh Bugs: 100%</h1>

<p><a href="https://www.supergoodcode.com/assets/meshcts.png"><img src="https://www.supergoodcode.com/assets/meshcts.png" alt="meshcts.png" /></a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[VKCTS Tests: 27,890 | GLCTS Tests: 227 | Percentage of Vulkan Drivers With Mesh Bugs: 100%]]></summary></entry><entry><title type="html">Tiler Improvements</title><link href="https://www.supergoodcode.com/tiler-improvements/" rel="alternate" type="text/html" title="Tiler Improvements" /><published>2025-08-29T00:00:00+00:00</published><updated>2025-08-29T00:00:00+00:00</updated><id>https://www.supergoodcode.com/tiler-improvements</id><content type="html" xml:base="https://www.supergoodcode.com/tiler-improvements/"><![CDATA[<h1 id="super-late-code">Super Late Code</h1>

<p>Meant to blog about this last quarter, but somehow another two months went by and here we are.</p>

<p>A while back, I did some work to improve zink performance on tiling GPUs. Namely this entailed adding renderpass tracking into threaded-context, and also implementing command stream reordering, and inlining swapchain resolves, and framebuffer discards, and actually maybe it’s more than just “some” work. All of this amounted to improved performance by reducing memory bandwidth.</p>

<p>How much improved performance? All of it.</p>

<p>And then, around two months ago, a colleague told me he was no longer going to use zink on his tiling GPU.</p>

<h1 id="devastated">Devastated</h1>

<p>Some of you noticed that the blog has gone quiet in recent times. I’m going to take this opportunity to foist all the blame onto that colleague: to preserve his identity, let’s just call him Gabe.</p>

<p>Gabe came to me a few months ago and told me zink was too slow. Vulkan was better. Faster. More “reliable”.</p>

<p>I said there’s no way that could be true; I’ve put way more bugs into Vulkan than I have into zink.</p>

<p>Unblinking, he stared at me across the digital divide. I task-switched to important whitespace cleanups.</p>

<p>Time passed, and I pulled myself together. I compiled some app traces. Analyzed them. Did some deep thinking. There was one place where zink indeed could be less performant than this “Vulkan” thing. The final frontier of driver performance. Some call it graphics heaven.</p>

<p>I call it hell.</p>

<h1 id="web-browsers">Web Browsers</h1>

<p>Chrome is the web browser, and, statistically, everyone uses it. It ships on desktops and phones, embeds in apps, and even allows you to read this blog. Haters will say <em>No I uSe FiReFoX</em>, but they may as well be Netscape users in the year 2000.</p>

<p>In the past, Chrome defaulted to using GL, which made testing easy. Now, however, <code class="language-plaintext highlighter-rouge">--disable-features=Vulkan</code> is needed to return to the comfort of an API so reliable it no longer receives versioned updates. Looking at an apitrace of Chrome, I saw a disturbing rendering pattern that went something like this:</p>

<ul>
  <li>draw some element on a page using multisampled FBO1</li>
  <li>resolve FBO1 to texture1</li>
  <li>composite texture1 onto larger FBO2/texture2</li>
  <li>composite texture2 onto even larger, multisampled FBO3</li>
  <li>resolve FBO3 to swapchain</li>
  <li>present</li>
</ul>

<p>In this case, zink would correctly inline the FBO3/swapchain resolve at the end, but the intermediate multisampled rendering on <code class="language-plaintext highlighter-rouge">FBO1</code> would pay the full performance penalty of storing the multisampled image data and then loading it again for the separate resolve operation.</p>

<p>I’d like to say it was simple to inline this intermediate resolve. That I just slapped <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35477">a single MR</a> into mesa and it magically worked. Unfortunately, nothing is ever that simple. There were <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35777">minor</a> <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36069">fixups</a> <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36521">all over</a> <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/36576">the place</a>. And this brought me to the real insanity.</p>

<p>Chrome has bugs too.</p>

<h1 id="literal-hell">Literal Hell</h1>

<p>Let’s take a concrete example: launch Chrome with <code class="language-plaintext highlighter-rouge">--disable-features=Vulkan</code> and check out this tiny SVG: <a href="https://www.supergoodcode.com/assets/chromebug.html">chromebug.html</a></p>

<p>This is most likely what you see:</p>

<p><a href="https://www.supergoodcode.com/assets/chromebug-good.png"><img src="https://www.supergoodcode.com/assets/chromebug-good.png" alt="chromebug-good.png" /></a></p>

<p>The reason you see this is because you are on a big, strong desktop GPU which doesn’t give a shit about load/store ops or uninitialized GPU memory. You’re driving a giant industrial bulldozer on your morning commute: traffic no longer exists and stop signals are fully optional. On a wimpy tiling GPU, however, things are different.</p>

<p>Using a recent version of zink, even on a desktop GPU, you can run the same Chrome browser using <code class="language-plaintext highlighter-rouge">ZINK_DEBUG=rp,rploads</code> to enable the same codepaths used by tilers and also clear all uninitialized memory to red. Now load the same SVG, and you’ll see this:</p>

<p><a href="https://www.supergoodcode.com/assets/chromebug-bad.png"><img src="https://www.supergoodcode.com/assets/chromebug-bad.png" alt="chromebug-bad.png" /></a></p>

<p>It took nearly a week of pair debugging and a new zink debug mode to prune down test cases and figure out what was happening. All around the composited SVG texture, memory is uninitialized.</p>

<p>But this only shows up on tiling GPUs. And only if the driver is doing near-lethal amounts of very legal renderpass optimizations.</p>

<p>This fast is too fast.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Super Late Code]]></summary></entry><entry><title type="html">Behind Schedule</title><link href="https://www.supergoodcode.com/behind-schedule/" rel="alternate" type="text/html" title="Behind Schedule" /><published>2025-07-01T00:00:00+00:00</published><updated>2025-07-01T00:00:00+00:00</updated><id>https://www.supergoodcode.com/behind-schedule</id><content type="html" xml:base="https://www.supergoodcode.com/behind-schedule/"><![CDATA[<h1 id="timelines">Timelines</h1>

<p>It’s hot out. I know this because Big Triangle allowed me a peek through my three-sided window for good behavior, and all the pixels were red. Sure am glad I’m inside.</p>

<p>Today’s a new day in a new month, which means it’s time to talk about new GL stuff. I’m allowed to do that once in a while, even though GL stuff is never actually new. In this post we’re going to be looking at <a href="https://registry.khronos.org/OpenGL/extensions/NV/NV_timeline_semaphore.txt">GL_NV_timeline_semaphore</a>, an extension everyone has definitely heard of.</p>

<p>Mesa has supported <a href="https://registry.khronos.org/OpenGL/extensions/EXT/EXT_external_objects.txt">GL_EXT_external_objects</a> for a long while, and it’s no exaggeration to say that this is the reference implementation: there are no proprietary drivers of which I’m aware that can pass the super-strict piglit tests we’ve accumulated over the years. Yes, that includes Green Triangle. Also Red Triangle, but we knew that already–it’s in the name.</p>

<p><a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/35866">This MR</a> adds support for importing and using Vulkan timeline semaphores into GL, which further improves interop-reliant workflows by eliminating binary semaphore requirements. Zink supports it anywhere that additionally supports <a href="https://registry.khronos.org/vulkan/specs/latest/man/html/VK_KHR_timeline_semaphore.html">VK_KHR_timeline_semaphore</a>, which is to say that any platform capable of supporting the base external objects spec will also support this.</p>

<p>For testing, we get to have even more fun with the industry-standard <a href="https://gitlab.freedesktop.org/mesa/piglit/-/merge_requests/1022">ping-pong test</a> originally contributed by @gfxstrand. This verifies that timeline operations function as expected on every side of the API divide.</p>

<p>Next up: more optimizations. How fast is too fast?</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Timelines]]></summary></entry></feed>