How To Bug

This Is A Serious Blog

I posted some fun fluff pieces last week to kick off the new year, but now it’s time to get down to brass tacks.

Everyone knows adding features is just flipping on the enable button. Now it’s time to see some real work.

If you don’t like real work, stop reading. Stop right now. Now.

Alright, now that all the haters are gone, let’s put on our bisecting snorkels and dive in.

Regressions Suck

The dream of 2022 was that I’d come back and everything would work exactly how I left it. All the same tests would pass, all the perf would be there, and my driver would compile.

I got two of those things, which isn’t too bad.

After spending a while bisecting and debugging last week, I categorized a number of regressions to RADV problems which probably only affect me since there’s no Vulkan CTS cases for them (yet). But today I came to the last of the problem cases: dEQP-GLES31.functional.tessellation_geometry_interaction.feedback.tessellation_output_quads_geometry_output_points.

There’s nothing too remarkable about the test. It’s XFB, so, according to Jason Ekstrand, future Head of Graphic Wows at Pixar, it’s terrible.

What is remarkable, however is that the test passes fine when run in isolation.

Here We Go Again

Anyone who’s anyone knows what comes next.

You find the caselist of the other 499 tests that were run in this block
You run the caselist
You find out that the test still fails in that caselist
You tip your head back to stare at the ceiling and groan

Then it’s another X minutes (where X is usually between 5 and 180 depending on test runtimes) to slowly pare down the caselist to the sequence which actually triggers the failure. For those not in the know, this type of failure indicates a pathological driver bug where a sequence of commands triggers different results if tests are run in a different order.

There is, to my knowledge, no ‘automatic’ way to determine exactly which tests are required to trigger this type of failure from a caselist. It would be great if there was, and it would save me (and probably others who are similarly unaware) considerable time doing this type of caselist fuzzing.

Finally, I was left with this shortened caselist:

dEQP-GLES31.functional.shaders.builtin_constants.tessellation_shader.max_tess_evaluation_texture_image_units
dEQP-GLES31.functional.tessellation_geometry_interaction.feedback.tessellation_output_quads_geometry_output_points
dEQP-GLES31.functional.ubo.random.all_per_block_buffers.25

What Now?

Ideally, it would be great to be able to use something like gfxreconstruct for this. I could record two captures—one of the test failing in the caselist and one where it passes in isolation—and then compare them.

Here’s an excerpt from that attempt:

"[790]vkCreateShaderModule": {
    "return": "VK_SUCCESS",
    "device": "0x0x4",
    "pCreateInfo": {
        "sType": "VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO",
        "pNext": null,
        "flags": 0,
        "codeSize": Unhandled VkFormatFeatureFlagBits2KHR,
        "pCode": "0x0x285c8e0"
    },
    "pAllocator": null,
    "[out]pShaderModule": "0x0xe0"
},

Why is it trying to print an enum value for codeSize you might ask?

I’m not the only one to ask, and it’s still an unresolved mystery.

I was successful in doing the comparison with gfxreconstruct, but it yielded nothing of interest.

Puzzled, I decided to try the test out on lavapipe. Would it pass?

No.

It similarly fails on llvmpipe and IRIS.

But my lavapipe testing revealed an important clue. Given that there are no synchronization issues with lavapipe, this meant I could be certain this was a zink bug. Furthermore, the test failed both when the bug was exhibiting and when it wasn’t, meaning that I could actually see the “passing” values in addition to the failing ones for comparison.

Here’s the failing error output:

Verifying feedback results.
Element at index 0 (tessellation invocation 0) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 1 (tessellation invocation 1) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 2 (tessellation invocation 2) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.166663, 0.5, 1)
Element at index 3 (tessellation invocation 3) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 4 (tessellation invocation 4) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 5 (tessellation invocation 5) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0.5, 1)
Element at index 6 (tessellation invocation 6) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0.5, 1)
Element at index 7 (tessellation invocation 7) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0.5, 1)
Omitted 24 error(s).

And here’s the passing error output:

Verifying feedback results.
Element at index 3 (tessellation invocation 1) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 4 (tessellation invocation 2) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 5 (tessellation invocation 3) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 6 (tessellation invocation 4) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 7 (tessellation invocation 5) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 8 (tessellation invocation 6) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.4, -0.433337, 0, 1)
Element at index 9 (tessellation invocation 7) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Element at index 10 (tessellation invocation 8) expected vertex in range: ( [-0.4, 0.4], [-0.4, 0.4], 0.0, 1.0 ) got: (-0.133337, -0.433337, 0, 1)
Omitted 18 error(s).

This might not look like much, but to any zinkologists, there’s an immediate red flag: the Z component of the vertex is 0.5 in the failing case.

What does this remind us of?

Naturally it reminds us of nir_lower_clip_halfz, the compiler pass which converts OpenGL Z coordinate ranges ([-1, 1]) to Vulkan ([0, 1]). This pass is run on the last vertex stage, but if it gets run more than once, a value of -1 becomes 0.5.

Thus, it looks like the pass is being run twice in this test. How can this be verified?

ZINK_DEBUG=spirv will export all spirv shaders used by an app. Therefore, dumping all the shaders for passing and failing runs should confirm that the conversion pass is being run an extra time when they’re compared. The verdict?

@@ -1,7 +1,7 @@
 ; SPIR-V
 ; Version: 1.5
 ; Generator: Khronos; 0
-; Bound: 23
+; Bound: 38
 ; Schema: 0
                OpCapability TransformFeedback
                OpCapability Shader
@@ -36,13 +36,28 @@
 %_ptr_Output_v4float = OpTypePointer Output %v4float
 %gl_Position = OpVariable %_ptr_Output_v4float Output
      %v4uint = OpTypeVector %uint 4
+%uint_1056964608 = OpConstant %uint 1056964608
        %main = OpFunction %void None %3
          %18 = OpLabel
                OpBranch %17
          %17 = OpLabel
          %19 = OpLoad %v4float %a_position
          %21 = OpBitcast %v4uint %19
-         %22 = OpBitcast %v4float %21
-               OpStore %gl_Position %22
+         %22 = OpCompositeExtract %uint %21 3
+         %23 = OpCompositeExtract %uint %21 3
+         %24 = OpCompositeExtract %uint %21 2
+         %25 = OpBitcast %float %24
+         %26 = OpBitcast %float %23
+         %27 = OpFAdd %float %25 %26
+         %28 = OpBitcast %uint %27
+         %30 = OpBitcast %float %28
+         %31 = OpBitcast %float %uint_1056964608
+         %32 = OpFMul %float %30 %31
+         %33 = OpBitcast %uint %32
+         %34 = OpCompositeExtract %uint %21 1
+         %35 = OpCompositeExtract %uint %21 0
+         %36 = OpCompositeConstruct %v4uint %35 %34 %33 %22
+         %37 = OpBitcast %v4float %36
+               OpStore %gl_Position %37
                OpReturn
                OpFunctionEnd

And, as is the rule for such things, the fix was a simple one-liner to unset values in the vertex shader key.

It wasn’t technically a regression, but it manifested as such, and fixing it yielded another dozen or so fixes for cases which were affected by the same issue.

Blammo.

Written on January 10, 2022