Breaking

Stop The Optimizing

I had planned to write more posts about some optimizations and whatever other cool stuff I’ve been working on.

I had planned to make more zink-wip snapshots.

I did shower; stop spamming frog emotes at me.

But I also encountered a bug so bizarre, so infruating, so esoteric, that I need to take a bit of a victory lap now that I’ve successfully corralled it. So let’s get into a real, vintage SGC blog post like we used to have back when SGC was a good blog and dive into what it took to fix a bug that took me four full days to resolve.

The Problem

In the course of writing a suballocator, I ran zero tests, as is my way. When the coding fugue ended, I stumbled weakly to my local CI script and managed to hit the Enter key before collapsing into a restless sleep where I was chased by angry triangles. Things were different when I awoke; namely I now had a lot of failing tests.

But I fixed them, because that’s what driver developers do.

All except one, which I assumed was a flake after running it a few times and seeing no failures.

This was how I got to know the horror of dEQP-GLES3.functional.vertex_array_objects.all_attributes.

The test itself is awful to debug. It generates GL_MAX_VERTEX_ATTRIBS vertex attributes to use with the maximum number of vertex buffers and does a series of draws, verifying the results. Normal enough.

Except the attributes are completely randomized, even whether they’re enabled, so no two runs are the same.

And the bisect hit just right.

The Basics Of Problem Solving

When a new bug is found with a driver, the first question is usually “Did this used to work?” followed quickly by “When did it start?” if the first answer was yes. The reason for this is that determining exactly when a problem began and what caused it to manifest gives the developer some vague starting point for determining what is happening to cause a bug.

So it was that I embarked on a bisect to figure out why dEQP-GLES3.functional.vertex_array_objects.all_attributes was suddenly failing. But obviously I couldn’t just bisect for this test. No, no, that would be far too easy.

This test only fails if run in conjunction with a series of other tests. Thus my deqp caselist file:

dEQP-GLES3.functional.vertex_array_objects.all_attributes
dEQP-GLES3.functional.vertex_arrays.single_attribute.first.byte.first6_offset1_stride2_quads256
dEQP-GLES3.functional.vertex_arrays.single_attribute.output_types.int.components2_ivec4_quads256
dEQP-GLES3.functional.vertex_arrays.single_attribute.usages.static_copy.stride32_fixed_quads1

The problem test always runs last, so something was clearly going on over time that was causing the failure. Armed with this knowledge, and so sure that this would end up being some trivial one-liner that I could fix in a few minutes, I set up my startpoint and endpoint for the bisect and went to work.

Zink By Bisection

Generally speaking, I assume every bug I find is going to be a zink bug. Just by the numbers, that’s usually the case, but then also it’s just always the case. It was therefore no surprise that my bisect landed on a certain commit:

commit 6b13e7cede95504ce8309744d8b9d83c7dbab7c9
Author: Mike Blumenkrantz <michael.blumenkrantz@gmail.com>
Date:   Mon May 17 08:44:02 2021 -0400

    try better map flags

diff --git a/src/gallium/drivers/zink/zink_resource.c b/src/gallium/drivers/zink/zink_resource.c
index 55f37380d9f..121f6f0076e 100644
--- a/src/gallium/drivers/zink/zink_resource.c
+++ b/src/gallium/drivers/zink/zink_resource.c
@@ -1201,7 +1201,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
          /* At this point, the buffer is always idle (we checked it above). */
          usage |= PIPE_MAP_UNSYNCHRONIZED;
       }
-   } else if ((usage & PIPE_MAP_READ) && !(usage & PIPE_MAP_PERSISTENT)) {
+   } else if (((usage & PIPE_MAP_READ) && !(usage & PIPE_MAP_PERSISTENT)) || !res->obj->host_visible) {
       assert(!(usage & (TC_TRANSFER_MAP_THREADED_UNSYNC | PIPE_MAP_THREAD_SAFE)));
       if (usage & PIPE_MAP_DONTBLOCK) {
          /* sparse/device-local will always need to wait since it has to copy */
@@ -1209,7 +1209,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
             return NULL;
          if (!zink_resource_usage_check_completion(ctx, res, ZINK_RESOURCE_ACCESS_WRITE))
             return NULL;
-      } else if (!res->obj->host_visible) {
+      } else if (!res->obj->host_visible || res->base.b.usage != PIPE_USAGE_STAGING) {
          trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
          if (!trans->staging_res)
             return NULL;
@@ -1218,8 +1218,12 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
          zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, box->width);
          res = staging_res;
          zink_fence_wait(&ctx->base);
-      } else
-         zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_WRITE);
+      } else {
+         if (!(usage & PIPE_MAP_WRITE))
+            zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_WRITE);
+         else
+            zink_resource_usage_wait(ctx, res, ZINK_RESOURCE_ACCESS_RW);
+      }
    }
 
    if (!ptr) {

As clearly explained by my laconic commit log, this patch aims to improve non-persistent buffer mappings by forcing non-staging resources to use a snooped staging resource. For more details on why this is desirable, check out this encyclopedia of wisdom on the topic, written by RADV co-founder and Commander Of The Rays, Bas Nieuwenhuizen.

But somehow this small patch was breaking the test, so I set out to investigate.

Isolation

Once a problem area is identified, it’s usually helpful to try and isolate the exact hunks of a patch which cause the problem. In this case, I had three distinct and only vaguely-related hunks, so it was an ideal case for this strategy. The middle hunk ended up being the culprit:

@@ -1209,7 +1209,7 @@ buffer_transfer_map(struct zink_context *ctx, struct zink_resource *res, unsigne
             return NULL;
          if (!zink_resource_usage_check_completion(ctx, res, ZINK_RESOURCE_ACCESS_WRITE))
             return NULL;
-      } else if (!res->obj->host_visible) {
+      } else if (!res->obj->host_visible || res->base.b.usage != PIPE_USAGE_STAGING) {
          trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
          if (!trans->staging_res)
             return NULL;

It seemed a bit odd to me, but nothing that stood out as impossible; perhaps there was some manner of issue with buffer copying offsets for setting up the staging resource, or some synchronization issue, or whatever. There were options, and I now knew that the problem was caused by setting up a staging buffer. Further prinfs revealed that this conditional was only hit for read access, so it was now narrowed down even further.

Initial Testing

Was it a buffer offset problem with copying the data for the staging resource?

Well.

No.

As interesting as it would’ve been for that to have been the case, there’s zero chance that this one test case was invoking a magical offset that wasn’t also triggered in other cases. If the general buffer copying code here was broken, it was probably broken everywhere in zink, so there would’ve been many, many more failures. There was only this one case, however, and deeper investigation confirmed this, as I directly mapped both buffers and compared the data ranges, which matched.

Synchronization it was, then, and I can hear the disembodied voice of Dave Airlie now shouting “Barriers!” before vanishing off into the ether.

First, I tried adding more GPU stalls. Like, lots more. Like, so many that the test took minutes to complete. There was no change, however. Just for hahas, I even added some usleep calls around.

Still nothing.

At this point I was seriously stumped. By now I’d fully instrumented all of the buffer access codepaths with asserts to verify the mapped contents matched the real buffer contents in all cases, and none of the asserts were ever hit.

But if it wasn’t actually an issue with synchronizing the staging buffer, what could it be?

I decided to check the test with with ANV at this point, being the case that I always run CTS against lavapipe to avoid killing my session in case I’ve foolished and added some code which triggers hangs, and…

And the test passed with ANV.

This was a real thinker, so I went to get a second opinion from Bas and RADV. RADV told me that ANV didn’t know what it was talking about, and the test was definitely failing, so I went with that answer because it seemed more sane.

As a final idea, I did the truly unthinkable: I threw in a malloc call, allocated some host memory, and copied the map contents directly into that buffer.

And leaked it.

Yes, I know, I know, We Don’t Do That, but it was just this one time. Just a little bit. Just to see if I could valg—Of course valgrind crashes when running anything in lavapipe due to unimplemented instructions, so why did I bother?

Getting Deeper

There comes a time when saying We Need To Go Deeper isn’t just a meme. That time was now, and I was about to get, as they say in technical terms when such depth is approached, deep as fuck.

Continuing to experiment with my memory leaking, the conditional block in question had by now degenerated into spaghetti:

         trans->staging_res = pipe_buffer_create(&screen->base, PIPE_BIND_LINEAR, PIPE_USAGE_STAGING, box->x + box->width);
         if (!trans->staging_res)
            return NULL;
         struct zink_resource *staging_res = zink_resource(trans->staging_res);
         trans->offset = staging_res->obj->offset;
         uint8_t *p = map_resource(screen, res);
         trans->data = malloc(box->x + box->width);
         trans->data2 = malloc(box->x + box->width);
         memset(trans->data, 0, box->x + box->width);
         memset(trans->data2, 0, box->x + box->width);
         memcpy(trans->data, p, box->x + box->width);
         memcpy(trans->data2, p, box->x + box->width);
         printf("SIZE NEEDED %u\n", box->x + box->width);
      for (unsigned i = 0; i < box->x + box->width; i++) {
         uint8_t *map = res->obj->map;
         assert(trans->data[i] == trans->data2[i]);
         assert(map[i] == trans->data2[i]);
         printf("MAP[%u] = %u\n", i, trans->data2[i]);
      }
         //zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, MIN2(box->width + 4, res->base.b.width0-box->x));
         //zink_copy_buffer(ctx, NULL, staging_res, res, box->x, box->x, box->width);
         ptr = trans->data;

         res = staging_res;
         zink_fence_wait(&ctx->base);

Obviously I’m gonna double buffer my memory leak so I can verify that it’s not secretly being modified on unmap (it wasn’t), and then also verify that the data matches before returning the pointer. And print it all, of course, because if you can actually read your terminal when you reach this sort of depth in the course of a debugging session, probably you’re doing it wrong.

But the time had come to start applying hacks elsewhere: namely the test itself. Being a random test case made it impossible to figure out what was going on between runs, but I’d determined one thing of interest: no matter what, unless I returned the direct mapping for the buffer, the test failed.

Let’s see what Mr. Crowbar had to say about that though when I applied him to the CTS case:

diff --git a/modules/gles3/functional/es3fVertexArrayObjectTests.cpp b/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
index 82578b1ce..e231c4b1a 100644
--- a/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
+++ b/modules/gles3/functional/es3fVertexArrayObjectTests.cpp
@@ -765,14 +765,14 @@ void MultiVertexArrayObjectTest::init (void)
 		m_spec.buffers.push_back(shortCoordBuffer48);
 
 		m_spec.state.attributes.push_back(Attribute());
-		m_spec.state.attributes[attribNdx].enabled		= (m_random.getInt(0, 4) == 0) ? GL_FALSE : GL_TRUE;
-		m_spec.state.attributes[attribNdx].size			= m_random.getInt(2,4);
-		m_spec.state.attributes[attribNdx].stride		= 2*m_random.getInt(1, 3);
+		m_spec.state.attributes[attribNdx].enabled		= GL_TRUE;
+		m_spec.state.attributes[attribNdx].size			= (attribNdx % 2) + 2;
+		m_spec.state.attributes[attribNdx].stride	= 2 * ((attribNdx % 2) + 1);
 		m_spec.state.attributes[attribNdx].type			= GL_SHORT;
-		m_spec.state.attributes[attribNdx].integer		= m_random.getBool();
-		m_spec.state.attributes[attribNdx].divisor		= m_random.getInt(0, 1);
-		m_spec.state.attributes[attribNdx].offset		= 2*m_random.getInt(0, 2);
-		m_spec.state.attributes[attribNdx].normalized	= m_random.getBool();
+		m_spec.state.attributes[attribNdx].integer		= attribNdx % 3 == 1;
+		m_spec.state.attributes[attribNdx].divisor		= 0;
+		m_spec.state.attributes[attribNdx].offset		= attribNdx % 5;
+		m_spec.state.attributes[attribNdx].normalized	= attribNdx % 3 == 1;
 		m_spec.state.attributes[attribNdx].bufferNdx	= attribNdx+1;
 
 		if (attribNdx == 0)
@@ -783,14 +783,14 @@ void MultiVertexArrayObjectTest::init (void)
 		}
 
 		m_spec.vao.attributes.push_back(Attribute());
-		m_spec.vao.attributes[attribNdx].enabled		= (m_random.getInt(0, 4) == 0) ? GL_FALSE : GL_TRUE;
-		m_spec.vao.attributes[attribNdx].size			= m_random.getInt(2,4);
-		m_spec.vao.attributes[attribNdx].stride			= 2*m_random.getInt(1, 3);
+		m_spec.vao.attributes[attribNdx].enabled		= GL_TRUE;
+		m_spec.vao.attributes[attribNdx].size			= (attribNdx % 2) + 2;
+		m_spec.vao.attributes[attribNdx].stride			= 2 * ((attribNdx % 2) + 1);
 		m_spec.vao.attributes[attribNdx].type			= GL_SHORT;
-		m_spec.vao.attributes[attribNdx].integer		= m_random.getBool();
-		m_spec.vao.attributes[attribNdx].divisor		= m_random.getInt(0, 1);
-		m_spec.vao.attributes[attribNdx].offset			= 2*m_random.getInt(0, 2);
-		m_spec.vao.attributes[attribNdx].normalized		= m_random.getBool();
+		m_spec.vao.attributes[attribNdx].integer		= attribNdx % 3 == 1;
+		m_spec.vao.attributes[attribNdx].divisor		= 0;
+		m_spec.vao.attributes[attribNdx].offset			= attribNdx % 5;
+		m_spec.vao.attributes[attribNdx].normalized		= attribNdx % 3 == 1;
 		m_spec.vao.attributes[attribNdx].bufferNdx		= attribCount - attribNdx;
 
 		if (attribNdx == 0)

Now I had a consistently failing test (as long as I ran it with the other test cases so it didn’t feel too lonely and accidentally pass) with consistent data, and I was dumping it all to logs that I could compare if I returned the direct pointer for the map to legitimately pass the test.

Naturally the output data that I was printing matched. It’d be pretty weird if it didn’t considering all the asserts that I had in the code, right? Hah, yeah, that’d be… That’d be pretty weird, all right…

The Forgotten Depths

By this point I had determined that it was a specific range of buffer mappings causing the problem, specifically those sized between 50 and 100 bytes. I also knew that these buffers were being mapped by u_vbuf, also known colloquially as the hinterlands of Gallium, an obscure component used to handle translating unsupported vertex buffer formats.

Veteran Mesa developers reading along are going full sensiblechuckle.gif right now, but I’ll request that we continue our no spoiler policy.

If the buffer contents were the same as the mapped contents but the test was still failing, then there had to be a reason for that. I fumbled my way over to the vertex attribute translator and fingerpainted in a printf to dump the translated vertex attributes. This enabled me to diff between a good run and a bad run.

It was then that I made a bewildering discovery.

Any time I had a 96-byte buffer map, the attributes starting at offset 92 didn’t match in cases when the test failed.

This was another thinker, so I decided to enhance my memory leaks a bit to copy more buffer since this was all 4096-aligned and it wasn’t like I was going to be copying out of bounds. This was when things started to get really weird.

Returning a copy of the resquested 96 bytes of the buffer failed the test, but returning 100 bytes passed it.

Uh-oh.

Now that I took a closer look at those vertex attribs, I realized that the ones which were failing were the ones that were read from bytes 96 and 97 of the buffer. The buffer which only had 96 bytes mapped, meaning that only the range of [0..95] was valid…

At Last

Resolution. What I had tripped over was a buffer overrun, one that was undetectable through normal means because of reasons like:

this is a GPU buffer, so tools which would normally catch buffer overruns wouldn’t detect it
this is u_vbuf, which is code that’s generally known to work pretty well given that it’s 10+ years old and is widely used and tested
RadeonSI is likely the only other driver which uses the same sorts of buffer mapping optimizations, and it doesn’t use u_vbuf

Iteration on various fixes finally yielded a patch that was upstreamable; the crux of the problem here was that the stride of vertex attributes was being used to calculate the size of the region to map, but the stride only determines the number of bytes between elements, not their size. For example, if the stride was 4 bytes but the element was 8 bytes, the overrun would be 4 bytes for the last element. The solution was to calculate the offset of the last element being mapped, then add the size of the element using the attribute’s format block size, which guarantees that the last attribute won’t be truncated.

Fuck that bug.

Written on May 24, 2021