Dumber

It’s Not Maintenance5

I just got back from lunch and have to work off some cals, and that means it’s time for another big lift on the blog. Today’s topic: how dumb can a driver’s compiler stack be?

As I outlined in the previous post, zink’s compiler stack is about to get a whole lot dumber for various reasons. But what exactly does that look like?

Lesser bloggers would simply link to the MR and say “figure it out”.

Here at SGC, however, I put in the extra effort so my readers can comprehend all the stringy awfulness that goes into each individual graphical sausage that this triangle factory is pumping out.

Let’s get at it.

Step 1: How Much Work Is This, Exactly?

The key point of using the theoretical new NIR linker (that definitely exists and will be merged in finite time) is that it requires drivers to accept lowered i/o. This means, effectively, that zink must begin consuming lowered i/o as soon as it receives shaders. Naturally the first step to that was evaluating all the shader passes which operate on specific i/o variables using derefs (AKA “explicit” i/o):

lower_64bit_vertex_attribs
lower_64bit_vars
split_blocks
lower_bindless_io
rewrite_read_as_0

The first four are called from zink_shader_create, the first time zink sees new shaders, while the last one is called zink_compiler_assign_io. As shaders won’t have derefs again until just before they go through NTV, they’ll all have to be…

What’s that you say, creator of the patented Delete The Code methodology and planar YUV expert, Faith Ekstrand? I can just delete some of this code?

That sounds like a pretty smart idea. Looking at the list again, and then cross-referencing against all the features lowered i/o provides, and then pushing up my glasses so I can read the very real documentation that nir has, let’s see where that leads:

~~lower_64bit_vertex_attribs~~
- nir_lower_io_lower_64bit_to_32 is available during i/o lowering, so this can all be deleted
lower_64bit_vars
- the drivers that need this won’t magically grow 64bit support for shader-local variables, but the parts that operate specifically on i/o can be deleted
~~split_blocks~~
- this was written to improve XFB inlining and reduce the number of “extra” outputs needed to generate XFB data, but if new, location-based variables are being generated anyway this is no longer needed
lower_bindless_io
- this converts bindless texture i/o (because obviously you can pass bindless texture handles between shader stages) to something that is spirv-legal
- needs to be updated
rewrite_read_as_0
- this handles uninitialized reads from mismatched shader interfaces (i.e., the consumer reads more components than the producer writes)
- needs to be updated

Not actually that much work, huzzah.

Step 2: New Variables

As in the flowchart, this process involves taking explicit i/o, converting to lowered i/o, then converting back to explicit. Explicit i/o is characterized by using derefs to explicit variables for access, which means variables are needed. A work-intensive benefit to this means simpler variables: since lowered i/o is characterized by location-based access to components, the subsequent conversion back to explicit i/o can use entirely new variables, and since these variables are location-based, there’s no need to retain any* of the gross struct/array typing that GLSL yields.
* except where arrays are indirectly accessed

For those of you who are truly in the know, this means goku in his SSJB form

struct TestStruct {
   dmat2x3 a[2];
   mat2x3 b[2];
   dvec2 c;
};
layout (location = 0, xfb_offset = 0) flat out TestStruct goku;

gets blasted into a series of smaller and more vincible variables:

decl_var shader_out INTERP_MODE_FLAT dvec3 goku#0 (VARYING_SLOT_VAR2.xyz, 2, 0)
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#1 (VARYING_SLOT_VAR4.xyz, 4, 0)
decl_var shader_out INTERP_MODE_FLAT dvec3 goku#2 (VARYING_SLOT_VAR6.xyz, 6, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#3 (VARYING_SLOT_VAR8.xyz, 8, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#4 (VARYING_SLOT_VAR9.xyz, 9, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#5 (VARYING_SLOT_VAR10.xyz, 10, 0)
decl_var shader_out INTERP_MODE_FLAT vec3 goku#6 (VARYING_SLOT_VAR11.xyz, 11, 0)
decl_var shader_out INTERP_MODE_FLAT dvec2 goku#7 (VARYING_SLOT_VAR12.xy, 12, 0)

Beautiful and easy to parse. There’s only one snag: I gotta do this manually.

Long-time fans of the blog will recall some wild ravings in the past where I described a pass I wrote to handle a similar issue. lower_64bit_vars is that pass, and it both splits variables containing 64bit types into 32bit types and then rewrites all access to them to use those new types.

And now I have to do basically the same thing. Again. But in a different enough way that none of the code is reusable.

The process for doing this variable rewrite is split in three:

scan the existing variables and make a table linking them to all the locations/components they consume
scan the shader for access to these variables and determine which ones have indirect access
scan the shader for access to these variables again, this time creating new vector-based variables which consume at most a single location*
- except for variables accessed indirectly, which need to retain arrayness for indirect access

But then there’s also the bonus step (everyone loves bonuses!) of scanning all the new variables and comparing them against the original variables to ensure they have the same number of per-location components (i.e., if the original variable consumes all components for a given location, the new one must too) in order to maintain shader interface compatibility, and for all the locations where a mismatch is detected, single-component variables have to be inserted, and they have to have associated access added too so various optimizers don’t delete them again, and it’s obviously one of the first things anyone embarking on this idiotic journey would consider and not a last-second thing that someone would only realize after running a series of esoteric piglit tests and seeing bizarre failures.

Variables. Done.

Step 3: Explicit Again

The next step is where things get really stupid, because this is where things need to happen so that the shader goes back to having all the derefs and explicit variable access it used to have before some idiot went and deleted them.

I called this the add_derefs pass because I’m a creative type. An auteur.

For this, all the i/o variables need to be iterated through, and for each variable, scan the shader for access, where “access” means the location and component are consumed by the variable. And also its fbfetch-edness matches. Then take this lowered load/store access, krangle in whatever possibly-indirect derefs the variable needs to mimic the lowered operation, and write in a new explicit load/store access.

Except also I forgot to mention that i/o lowering needs to lower interpolation instructions, which are also (currently) in explicit deref format. And these explicit interpolation instructions get converted to lowered ones, and then sometimes a load_deref becomes load_barycentric_centroid. And you know (lol) it wouldn’t be a real adventure (lol) if a zink change didn’t uncover (lol) some incredibly obscure and opaque (lol) llvmpipe bug! So then there’s the usual spelunking through there, and whispered backchannel discussions and cursing with Dave, and OF FUCKING COURSE IT’S TGSI AGAIN but we got it done.

Also it’s possible there might be a future where llvmpipe doesn’t use TGSI but don’t quote me (I’ll deny it to my grave) and if anyone asks you didn’t hear it from me.

Step 3a: Done

You’d think by the way I just went off on my usual TGSI rant that I was done exploring this section, but think again because none of us asked what gl_ClipDistance or gl_CullDistance thought about any of this.

Well I asked, and they’re not happy.

Clip/cull distance are ~~stupid~~weird ones because they’re array[8] variables that consume two locations. And that means all the calculations/heuristics for accessing arrays that work for every other array are broken for these.

But it’s fine, because this is zink and the whole thing is just a jenga tower of hacks all the way down anyway.

Step 4: Done

I’ll be completely and brutally honest with you, this all worked perfectly the first time I ran it.

On NVK, that is, which, as I mentioned in my historic XDC keynote, has been relying on the now-merged NIR 2.0 since last year. Truly a driver living in the future.

Other drivers, however, required considerably more work to make CI explode. Sorry, I meant not explode. Obviously. Totally a joke. The absolute state of CI is 100% not the fault of this lowered i/o conversion.

Anyway, the clear choice once parity was achieved was to then start deleting code.

Remember all that gross NTV code I linked in the previous post? Gone.

More stupid XFB code that’s been jenga-ing around for years? Gone.

Obscure ticket from years ago? Fixed incidentally

 src/compiler/nir/nir_passthrough_gs.c                |    2 +-
 src/gallium/auxiliary/nir/nir_to_tgsi_info.c         |    4 +
 src/gallium/drivers/zink/nir_to_spirv/nir_to_spirv.c |  412 +------------------------------------
 src/gallium/drivers/zink/zink_compiler.c             | 1081 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------
 src/gallium/drivers/zink/zink_compiler.h             |    3 +-
 src/gallium/drivers/zink/zink_draw.cpp               |    2 +-
 src/gallium/drivers/zink/zink_program.c              |    8 +-
 src/gallium/drivers/zink/zink_types.h                |    6 +-
 8 files changed, 736 insertions(+), 782 deletions(-)

And as seen by the statistics here another bonus ticket fixed too through the magic of code deletion.

Step 5: The Struggle Continues

I didn’t even get to mention the great things that happened related to maintenance5 yet. Be sure to read again next week when I inevitably shovel more garbage onto the internet in the form of an unfortunately large blog post riddled with memes that obfuscate the truly interesting parts.

Written on August 11, 2023