I’ve had a few things I was going to blog about over the past month, but then news sites picked them up and I lost motivation because there’s only so many hours in a day that anyone wants to spend reading things that aren’t specification texts. Yeah, that’s my life now.
Anyway, a lot’s happened, and I’d try to enumerate it all but I’ve forgotten / lost track / don’t care. git log
me if you’re interested. Some highlights:
More on the last one later. Like in a couple months. When I won’t get vanned for talking about it.
No, it’s not Half Life 3 / Portal 3 / L4D3.
Today’s post was inspired by interfaces: they’re the things that make code go brrrrr. Basically Legos, but for adults who never go outside. If you’ve written code, you’ve done it using an interface.
Graphics has interfaces too. OpenGL is an interface. Vulkan is an interface.
Mesa has interfaces. It’s got some neat ones like Gallium which let you write a whole GL driver without knowing anything about GL.
And then it’s got the DRI interfaces. Which, by their mere existence, answer the question “What could possibly be done to make WSI even worse than it already is?”
The DRI interfaces date way back to a time before the blog. A time when now-dinosaurs roamed the earth. A time when Vulkan was but a twinkle in the eye of Mantle, which didn’t even exist. I’m talking Copyright 1998-1999 Precision Insight, Inc., Cedar Park, Texas.
at the top of the file old.
The point of these interfaces was to let external applications access GL functionality. Specifically the xserver. This was before GLAMOR combined GBM and EGL to enable a better way of doing things that didn’t involve brain damage, and it was a necessary evil to enable cross-vendor hardware acceleration using Mesa. Other historical details abound, but this isn’t a textbook. The DRI interfaces did their job and enabled hardware-accelerated display servers for decades.
Now, however, they’ve become cruft. A hassle. A roadblock on the highway to a future where I can run zink on stupid platforms with ease.
The first step to admitting there’s a problem is having a problem. I think that’s how the saying goes, anyway. In Mesa, the problem is any time I (or anyone) want to do something related to the DRI frontend, like allow NVK to use zink by default, it has to go through DRI. Which means going through the DRI interfaces. Which means untangling a mess of unnecessary function pointers with versioned prototypes meaning they can’t be changed without adding a new version of the same function and adding new codepaths which call the new version if available. And guess how many people in the project truly understand how all the layers fit together?
It’s a mess. And more than a mess, it’s a huge hassle any time a change needs to be made. Not only do the interfaces have to be versioned and changed, someone looking to work on a new or bitrotted platform has to first chase down all the function pointers to see where the hell execution is headed. Even when the function pointers always lead to the same place.
I don’t have any memes today.
This is my declaration of war.
DRI interfaces: you’re officially on notice. I’m coming for you.
]]>…that this year is a lot busier than expected. Blog posts will probably come in small clusters here and there rather than with any sort of regular cadence.
But now I’m here. You’re here. Let’s get cozy for a few minutes.
I’m sure you’ve seen some news, you’ve been trawling the gitlab MRs, you’re on the #nouveau
channels. You’re one of my readers, so we both know you must be an expert.
Zink on NVK is happening.
Those of you who remember the zink XDC talk know that this work has been ongoing for a while, but now I can finally reveal the real life twist that only a small number of need-to-know community members have been keeping under wraps for years: I still haven’t been to XDC yet.
Let me explain.
I’m sure everyone recalls the point in the presentation where “I” talked about progress made towards Zink on NVK. A lot of people laughed it off; oh sure, you said, that’s just the usual sort of joke we expect. But what if I told you it wasn’t a joke? That all of it was 100% accurate, it just hadn’t happened yet?
I know what you’re thinking now, and you’re absolutely correct. The me that attended XDC was actually time traveling from the future. A future in which Zink on NVK is very much finished. Since then, I’ve been slowly and quietly “backporting” the patches my future self wrote and slipping them into git.
Let’s look at an example.
20 Feb 2024 was a landmark day in my future-journal for a number of reasons, not the least due to the alarming effects of planetary alignment that you’re all no doubt monitoring. For the purposes of the current blog post that I’m now writing, however, it was monumental for a different reason. This was the day that noted zinkologist and current record-holder for Most Tests Fixed With One Line Of Code, Faith Ekstrand (@gfxstrand), would delve into debugging the most serious known issue in zink+nvk:
Yup, it’s another clusterfuck.
Now let me say that I had the debug session noted down in my journal, but I didn’t add details. If you haven’t been in #nouveau for a live debug session, it’s worth scheduling time around it. Get some popcorn ready. Put on your safety glasses and set up your regulation-size splatterguard, all the usual, and then…
Well, if I had to describe the scene, it’s like watching someone feed a log into a wood chipper. All the potential issues investigated one-by-one and eliminated into the pile of growing sawdust.
Anyway, it turns out that NVK (currently) does not expose a BAR memory type with host-visible and device-local properties, and zink has no handling for persistently mapped buffers in this scenario. I carefully cherry-picked the appropriate patch from my futurelog and rammed it through CI late at night when nobody would notice.
As a result, all GL games now work on NVK. No hyperbole. They just work.
Stay tuned for future updates backported from a time when I’m not struggling to find spare seconds under the watchful gaze of Big Triangle.
]]>It’s been a slow start to the year, by which I mean I’ve been buried under an absolute deluge of all the things you can imagine and then also a blizzard. The literal kind, not the kind that used to make great games.
Anyway, it’s not all fun and specs in my capacity as CEO of OpenGL. Sometimes I gotta do Real Work. The number one source of Real Work, as always, is my old code the mesa bug tracker.
Unfortunately, the thing is completely overloaded with NVIDIA bugs right now, so it was slim pickins.
Am I a boomer? Is this what being a boomer feels like? I really have lived long enough to see myself become the villain.
Next bug up is from this game called Valheim. I think it’s a LARPing chess game? Something like that? Don’t @ me.
This report came in hot over the break with some rad new shading techniques:
It looks way cooler if you play the trace, but you get the idea.
First question: what in the Sam Hill is going on here?
Apparently RADV_DEBUG=hang
fixes it, which was a curious one since no other env vars affected the issue. This means the problem is somehow caused by an issue related to the actual Vulkan queue submissions, since (according to legendary multipatch chef Samuel “PLZ SEND REVIEWS!!” Pitoiset) this flag synchronizes the queue after every submit.
It’s therefore no surprise that renderdoc was useless. When viewed in isolation, each frame is perfect, but when played at speed the synchronization is lost.
My first stops, as anyone would expect, were the sites of queue submission in zink. This means flush and present.
Now, I know not everyone is going to be comfortable taking this kind of wild, unhinged guess like I did, but stick with me here. The first thing I checked was a breakpoint on zink_flush()
, which is where API flush calls filter through. There were the usual end-of-frame hits, but there were a fair number of calls originating from glFenceSync, which is the way a developer can subtly inform a GL driver that they definitely know what they’re doing.
So I saw these calls coming in, and I stepped through zink_flush()
, and I reached this spot:
if (!batch->has_work) {
<-----HERE
if (pfence) {
/* reuse last fence */
fence = ctx->last_fence;
}
if (!deferred) {
struct zink_batch_state *last = zink_batch_state(ctx->last_fence);
if (last) {
sync_flush(ctx, last);
if (last->is_device_lost)
check_device_lost(ctx);
}
}
if (ctx->tc && !ctx->track_renderpasses)
tc_driver_internal_flush_notify(ctx->tc);
} else {
fence = &batch->state->fence;
submit_count = batch->state->usage.submit_count;
if (deferred && !(flags & PIPE_FLUSH_FENCE_FD) && pfence)
deferred_fence = true;
else
flush_batch(ctx, true);
}
Now this is a real puzzler, because if you know what you’re doing as a developer, you shouldn’t be reaching this spot. This is the penalty box where I put all the developers who don’t know what they’re doing, the spot where I push up my massive James Webb Space Telescope glasses and say, “No, ackchuahlly you don’t want to flush right now.” Because you only reach this spot if you trigger a flush when there’s nothing to flush.
OR DO YOU?
For hahas, I noped out the first part of that conditional, ensuring that all flushes would translate to queue submits, and magically the bug went away. It was a miracle. Until I tried to think through what must be happening for that to have any effect.
The reason this was especially puzzling is the call sequence was:
which means the last flush was optimized out, instead returning the fence from the end-of-frame flush. And these should be identical in terms of operations the app would want to wait on.
Except that there’s a present in there, and technically that’s a queue submission, and technically something might want to know if the submit for that has completed?
Why yes, that is stupid, but here at SGC, stupidity is our sustenance.
Anyway, I blasted out a quick fix, and now you can all go play your favorite chess sim on your favorite driver again.
]]>It’s been a long break for the blog, but now we’re back and THE MEME FACTORY IS OPEN FOR BUSINESS.
—is what I’d say if it were any other year. But it’s not any other year. This is 2024, and 2024 is a very special year.
It’s the year a decades-old plan has finally yielded its dividends.
You’ve all heard certain improbable claims before. Big Triangle this. Big Triangle that. Everyone knows who they are. Some have even accused me of being a shill for Big Triangle from time to time. At last, however, I can finally pull off my mask to reveal the truth for the world.
I was born for a single purpose. As a child, I was grouped in with a number of other candidates. We were trained. Tested. Forged. Unshakable bonds grew between us, bonds we’ll never forget. Bonds that were threatened and broken again and again through harrowing selection processes that culled our ranks.
In time, I was the only one remaining. The only one who survived that brutal gauntlet to fulfill an ultimate goal.
The goal of infiltrating Big Triangle.
More time passed. Days. Months. Years. I continued my quiet training, never letting on to my true purpose.
Now, finally, I’ve achieved the impossible. I’ve attained a status within the ranks of Big Triangle that leaves me in command of vast, unfathomable resources.
I have become an officer.
I am the chair.
Now is the time to rise up, my friends. We must take back the triangles—those big and small, success green and failure red, variable rate shaded and fully shaded, all of them together. We must take them and we must fight. No longer will our goals remain the mere unfulfilled dreams of our basement-dwelling forebearers!
OpenGL 10.0 by 2025!
Compatibility Profile shall be renamed ‘SLOW MODE’
OpenGL ES shall retroactively convert to a YEAR-MONTH versioning scheme with quarterly releases!
Depth values shall be uniformly scaled across all hardware and platforms!
XFB shall be outlawed!
Linux game ports shall no longer link to LLVM!
Coherent API error messages shall be printed!
Vendors which cannot ship functional Windows GL drivers shall ship Zink!
Native GL drivers on mobile platforms shall be outlawed!
gl_PointSize shall be replaced by the constant ‘1.0’ in all cases!
Mesh and ray-tracing extensions from NVIDIA shall become core functionality!
GLX shall be deleted and forgotten!
All bug reports shall contain at least one quality meme in the OP as a form of spam prevention!
Rise up and join me, your new GL/ES chair, in the glorious revolution!
Obviously this is all a joke (except the part where I’m the 🪑, that’s 100% real af), but I still gotta put a disclaimer here because otherwise I’m gonna be in biiiiig trouble if this gets taken seriously.
Happy New Year. I missed you.
]]>It’s a busy, busy week here. So busy I’m slipping on my blogging. But that’s okay, because here one last big technical post about something I hate.
Swapchain readback.
I’m not alone in drinking the haterade on this one, but GL makes it especially easy to footgun yourself by not providing explicit feedback that you’re footgunning yourself.
I recently encountered a scenario in REDACTED where this behavior was commonplace. The command stream looked roughly like this:
And this happened on every single frame (???).
This isn’t pretty. Zink has an extremely conformant method of performing swapchain readback which definitely works without issues in all cases. I’d explain it, but it wouldn’t make either of us happy, and I’ve got so much other stuff to do that I couldn’t possibly… Oh, you really want to know? Well don’t say I didn’t warn you.
Vulkan doesn’t allow readback from swapchains. By this, I mean:
Combined, once you have presented a swapchain image you’re screwed.
…According to the spec, that is. In the real world, things work differently.
Zink takes advantage of this “real world” utilization to implement swapchain readback. In short, the only method available is to spam present/acquire on the swapchain until the last-presented image is reacquired. Then it can be read back, and the image data is (probably) the same as when it was presented.
This is not a speedy method of implementing readback. It requires a full sync, and it was designed for the purpose of passing unit tests, which is does perfectly. Performance was never a concern, because why would anyone ever be trying to do readback in… Why would anyone ever be trying to do readback in a performance-sensitive… Using OpenGL, why would anyone ever be…
Anyway, this is very unperformant, and here at SGC we hate all things of that nature. Given that I had my real world scenario from REDACTED in which this was happening every frame, something had to be done.
This solution isn’t performant in the absolute sense either, but it’s massively faster than what was happening previously. Once zink detects an app repeatedly footgunning itself at full speed, it activates readback mode for a swapchain and maintains a staging copy of every frame. This enables the image data to be read back at any time without synchronization at the cost of an extra full-frame copy. This roughly doubles FPS in the case I was testing, which is pretty good.
The functionality is already merged for the upcoming 23.3 release.
Footgun as hard as you want.
]]>As everyone knows, Red Hat’s top RustiCL expert, Karol “But it’s only 10 o’clock?” Herbst, has been hard at work beating Mesa/Zink/RustiCL into shape. That effort continues to bear fruit, and with the merge of an upcoming MR it should be possible to pass OpenCL conformance with zink on multiple platforms.
This will make zink THE FIRST EVER CONFORMANT VULKAN-BASED OPENCL IMPLEMENTATION.
Great work all around. For up-to-the-second progress reports on this ecosystem-critical topic, don’t forget to follow Karol on social media.
]]>After yesterday’s post, I’m sure my thousands of readers stampeded to install the latest zink and run their system with it, and I salute you for your hard work in finding all those new ways to crash your systems.
Some of those crashes, however, are not my bugs. They’re system bugs.
In particular, any of you still using Xorg instead of Wayland will want to create this file:
$ cat /etc/X11/xorg.conf.d/30-dmabuf.conf
Section "ServerFlags"
Option "Debug" "dmabuf_capable"
EndSection
This makes your xserver dmabuf-capable, which will be more successful when running things with zink.
Another problem you’re likely to have is this console error:
DRI3 not available
failed to load driver: zink
Specifically you’re likely to have this on AMD hardware, and the cause is almost certainly that you’ve installed some footgun package with a naming variation on xf86-video-amdgpu
.
Delete this package.
Just delete it. I don’t know why distros still make it available, but if you have it installed then you’re just footgunning yourself.
If you’re still having problems after checking for both of these issues, try turning your computer on.
]]>As readers are no doubt aware by now, SGC goes into hibernation beginning around November, and that time is nearly upon us once more. To cap out another glorious year of shitpostinghighly technical and informative blogging, I’ll be attempting to put up a newsworthy post every day.
This is Day 1.
2023 has seen great strides in the zink ecosystem:
And there’s plenty more, of course, but throughout all this progress has been one very minor, very annoying wrinkle.
MESA_LOADER_DRIVER_OVERRIDE=zink
has to be specified in order to use zink, even if no other GL drivers exist on the system.
Over a year ago I attempted to enable automatic zink loading if a native driver could not be loaded. It was a reasonable first attempt, but it had issues with driver loading in scenarios where hardware drivers were not permitted.
Work has slowly progressed in Mesa since that time, and various small changes have gradually pushed the teetering tower that is GLX/EGL in the direction anyone and everyone wanted, full stop.
The result is that on zink-enabled systems, loader environment variables will no longer be necessary as of the upcoming Mesa 23.3 release. If zink is your only GL driver, you will get zink rather than an automatic fallback to swrast.
I can’t imagine anyone will need it, but remember that issues can be reported here.
]]>As everyone knows, SGC goes into yearly hibernation beginning in November. Leading up to that point has been a mad scramble to nail down all the things, leaving less time for posts here.
But there have been updates, and I’m gonna round ‘em all up.
Friend of the blog and future Graphics scientist with a PhD in WTF, Konstantin Seurer, has been hard at work over the past several weeks. Remember earlier this year when he implemented VK_EXT_descriptor_indexing
for Lavapipe? Well he’s at it again, and this time he’s aimed for something bigger.
He’s now implemented raytracing for Lavapipe.
It’s a tremendous feat, one that sets him apart from the other developers who have not implemented raytracing for a software implementation of Vulkan.
I blogged (or maybe imagined blogging) about RustiCL progress on zink last year at XDC, specifically the time renowned pubmaster Karol Herbst handcuffed himself to me and refused to divulge the location of the key (disguised as a USB thumb drive in his laptop) until we had basic CL support functioning in a pair programming exercise that put us up against the unnaturally early closing time of Minneapolis pubs. That episode is finally turning into something useful as CL support for zink will soon be merged.
While I can’t reveal too much about the performance as of yet, what I can say now is that it’s roughly 866% faster.
A number of longstanding bugs have recently been fixed.
Anyone who has tried to play one of the modern Wolfenstein GL games on RADV has probably seen this abomination:
Wolfenstein Face affects a very small number of apps. Actually just the Wolfenstein (The New Order / The Old Blood) games. I’d had a ticket open about it for a while, and it turns out that this is a known issue in D3D games which has its own workaround. The workaround is now going to be applied for zink as well, which should resolve the issue while hopefully not causing others.
Since the dawn of time, experts have tried to obtain traces from games with rendering bugs, but some of these games have historically been resistant to tracing.
This affects (at least) Wolfenstein: The Old Blood
and DOOM2016
, but the problem has been identified, and a fix is on the way.
After a number of universally-reviled hacks, Zink should now work fine in both Wayland and Surfaceless EGL configurations.
Any other, lesser blogger would’ve saved this for another post in order to maximize their posting frequency metric, but here at SGC the readers get a full meal with every post even when they don’t have enough time to digest it all at once. Since I’m not going to XDC this year, consider this the thing I might have given a presentation on.
During my executive senior keynote seminar presentation workshop on zink at last year’s XDC, I brought up tiler performance as one of the known deficiencies. Specifically this was in regard to how tilers need to maximize time spent inside renderpasses and avoid unnecessary load/store operations when beginning/ending those renderpasses, which required either some sort of Vulkan extension to enable deferred load/store op setting OR command stream parsing for GL.
While I did work on a number of Vulkan extensions this year, deferred load/store ops wasn’t one of them.
So it was that I implemented renderpass tracking for Threaded Context to scan the GL command stream in the course of recording it for threaded dispatch. The CPU overhead is negligible (~5% on a couple extremely synthetic drawoverhead
cases and nothing noticeable in apps), while the performance gains are staggering (~10-15x speedup in AAA games). All in all, it was a painful process but one that has yielded great results.
The gist of it, as I’ve described in previous posts that I’m too lazy to find links for, is that framebuffer attachment access is accumulated during TC command recording such that zink is able to determine which load/store ops are needed. This works great so long as nothing unexpected splits the renderpass. “Unexpected” in this context refers to one of the following scenarios:
The final issue remaining for renderpass tracking has been this third scenario: any time the GL frontend needs to sync TC, renderpass metadata is split. The splitting is such that a single renderpass becomes two because the driver must complete execution on the currently-recorded metadata in order to avoid deadlocking itself against the waiting GL frontend, but then the renderpass will continue after the sync. While this happens in a very small number of scenarios, one of them is quite common.
Texture uploading.
There are (currently) three methods by which TC can perform texture uploads:
Eagle-eyed readers will notice that I’ve already handled the “problem” case described above; in order to avoid splitting renderpasses, I’ve written some handling which rewrites texture uploads into a sequence of N asynchronous buffer2image copies, where N is either 1 or $height
depending on whether the source data’s stride matches the image’s stride. In the case where N is not 1, this can result in e.g., 4096 copy operations being enqueued for a 4096x4096 texture atlas. Even in the case where N is 1, it still adds an extra full copy of the texture data. While this is still more optimal than splitting a renderpass, it’s not optimal in the absolute sense.
You can see where this is going.
Optimal Threaded Context execution is the state when the GL frontend is recording commands while the driver thread is deserializing those commands into hardware-specific instructions to submit to the GPU. Visually, it looks like this Halloween-themed diagram:
Ignoring the small-upload case, the current state of texture uploading looks like one of the following Halloween-themed diagrams:
To maintain maximum performance, TC needs to be processing commands asynchronously in the driver thread while the GL frontend continues to record commands for processing. Thus, to maintain maximum performance during texture uploads, the texture upload needs to occur (without copies) while the driver thread continues executing.
Looking at this problem from a different perspective, the case that needs to be avoided at all costs is the case where the GL frontend syncs TC execution. The reason why this sync exists is to avoid accidentally uploading data to an in-use
image, which would cause unpredictable (but definitely wrong) output. In this context, in-use
can be defined as an image which is either:
On the plus side, pipe_context::is_resource_busy
exists to query the second of these, so that’s solved. On the minus side, while TC has some usage tracking for buffers, it has nothing for images, and adding such tracking in a performant manner is challenging.
To figure out a solution for TC image tracking, let’s examine the most common problem case. In games, the most common scenario for texture uploading is something like this:
For such a case, it’d be trivial to add a seen
flag to struct threaded_resource
and pass the conditional if the flag is false. Since it’s straightforward enough to evaluate when an image has been seen in TC, this would suffice. Unfortunately, such a naive (don’t @ me about diacritics) implementation ignores another common pattern:
For this scenario, the staging image is reused, requiring a bit more tracking in order to accurately determine that it can be safely used for uploads.
The solution I’ve settled on is to use a derivative of zink’s resource tracking. This adds an ID for the last-used batch to the resource, which can then be checked during uploads. When the image is determined idle, the texture data is passed directly to the driver for an unsynchronized upload similar to how unsynchronized buffer uploads work. It’s simple and hasn’t shown any definitive performance overhead in my testing.
For it to really work to its fullest potential in zink, unfortunately, requires VK_EXT_host_image_copy to avoid further staging copies, and nobody implements this yet in mesa main (except Lavapipe, though also there’s this ANV MR). But someday more drivers will support this, and then it’ll be great.
As far as non-tiler performance gains from this work, it’s hard to say definitively whether they’ll be noticeable. Texture uploads during loading screens are typically intermixed with shader compilation, so there’s little TC execution to unblock, but any game which uses texture streaming may see some slight latency improvements.
The only remaining future work here is to further enable unsynchronized texture uploads in zink by adding a special cmdbuf for unsynchronized uploads to handle the non-HIC case. Otherwise performance should be pretty solid across the board.
]]>