Overhead

A Quintessential Metric

There’s been a lot of talk about driver overhead in the Mesa community as of late, in large part begun by Marek Olšák and his daredevil stunts driving RadeonSI through flaming hoops while juggling chainsaws.

While zink isn’t quite at that level yet (and neither am I), there’s still some progress being made that I’d like to dig into a bit.

What Is Overhead?

As in all software, overhead is the performance penalty that is incurred as compared to a baseline measurement. In Mesa, a lot of people know of driver overhead as “Gallium sucks” and/or “A Gallium-based driver is slow” due to the fact that Gallium does incur some amount of overhead as compared to the old-style immediate mode DRI drivers.

While it’s true that there is an amount of performance lost by using Gallium in this sense, it’s also true that the performance gained is much greater. The reason for this is that Gallium is able to batch commands and state changes for every driver using it, allowing redundant calls to avoid triggering any work in the GPU.

It also makes for an easier time profiling and improving upon the CPU usage that’s required to handle the state changes emitted by Gallium. Instead of having a ton of core Mesa callbacks which need to be handled, each one potentially leading to a no-op that can be analyzed and deferred by the driver, Gallium provides a more cohesive API where each driver hook is a necessary change that must be handled. Because of this, the job of optimizing for those changes is simplified.

How Can Overhead Be Measured?

Other than the obvious method of running apps on a driver and checking the fps counter, piglit provides a facility for this: the drawoverhead test. This test has over a hundred subtests which perform sequences of draw operations with various state changes, each with its own result relative to a baseline, enabling a developer to easily profile and optimize a given codepath.

How Is Zink Doing Here?

To answer this, let’s look at some preliminary results from zink in master, the code which will soon be shipping in Mesa 21.0.0. All numbers here are, in contrast to my usual benchmarking, done on AMD 5700XT GPU. More on this later.

ZINK: MASTER

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  818, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  686, 83.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  411, 50.3%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  232, 28.4%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  258, 31.5%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             87, 10.7%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             162, 19.9%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 150, 18.3%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                120, 14.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     192, 23.5%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    146, 17.9%

After this point, the test aborts because shader images are not yet implemented, but it’s enough for a baseline.

These numbers are…not great. Primarily, at least to start, I’ll be focusing on the first row where zink is performing 818,000 draws per second.

Let’s check out some performance from zink-wip (20201230 snapshot), specifically with GALLIUM_THREAD=0 set to disable threaded context. This means I’m adding in descriptor caching and unlimited command buffer counts (vs forcing a stall after every submit from the 4th one onwards to reset a batch):

ZINK: WIP (CACHED, NO THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                  766, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                  633, 82.6%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                  407, 53.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                  500, 65.3%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                  449, 58.6%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,             85, 11.2%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             235, 30.7%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 159, 20.8%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                128, 16.7%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     179, 23.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    139, 18.2%

This is actually worse for a lot of cases!

But why is that?

It turns out that in the base draw case, threaded context is really necessary to be doing caching and using more command buffers. There’s sizable gains made in the baseline texture cases (+100% or so each) and a vertex attribute change (+50%), but fundamentally the overhead for the driver seems higher.

What happens if threading is enabled though?

ZINK: WIP (CACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5206, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5149, 98.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5187, 99.6%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5210, 100.1%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 4684, 90.0%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            137, 2.6%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             252, 4.8%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 243, 4.7%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                222, 4.3%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     213, 4.1%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    208, 4.0%

blink.gif

Indeed, threading yields almost a 700% performance improvement for teh baseline cases. It turns out that synchronously performing expensive tasks like computing hash values for descriptor sets is bad. Who could have guessed.

State Changes

Looking at the other values, however, is a bit more pertinent for the purpose of this post. Overhead is incurred when state changes are triggered by descriptors being changed, and this is much closer to a real world scenario (i.e., gaming) than simply running draw calls with no changes. Caching yields roughly a 50% performance improvement for this case.

Further Improvements

As I’d mentioned previously, I’m doing some work now on descriptor management with an aim to further lower this overhead. Let’s see what that looks like.

ZINK: TEST (UNCACHED, THREAD)

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 5426, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 5423, 99.9%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 5432, 100.1%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 5246, 96.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 5177, 95.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            153, 2.8%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,             229, 4.2%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 247, 4.6%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                228, 4.2%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     237, 4.4%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    223, 4.1%

While there’s a small (~4%) improvement for the baseline numbers, what’s much more interesting is the values where descriptor states are changed. They are, in fact, about as good or even slightly better than the caching version of descriptor management.

This is huge. Specifically it’s huge because it means that I can likely port over some of the techniques used in this approached to the cached version in order to drive further reductions in overhead.

Closing Remarks

Before I go, let’s check out some numbers from a real driver. Specifically, RadeonSI: the pinnacle of Gallium-based drivers.

RADEONSI

   #, Test name                                              ,    Thousands draws/s, Difference vs the 1st
   1, DrawElements ( 1 VBO| 0 UBO|  0    ) w/ no state change,                 6221, 100.0%
   2, DrawElements ( 4 VBO| 0 UBO|  0    ) w/ no state change,                 6261, 100.7%
   3, DrawElements (16 VBO| 0 UBO|  0    ) w/ no state change,                 6236, 100.2%
   4, DrawElements ( 1 VBO| 0 UBO| 16 Tex) w/ no state change,                 6263, 100.7%
   5, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ no state change,                 6243, 100.4%
   6, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ shader program change,            217, 3.5%
   7, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ vertex attrib change,            1467, 23.6%
   8, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 1 texture change,                 374, 6.0%
   9, DrawElements ( 1 VBO| 8 UBO|  8 Tex) w/ 8 textures change,                218, 3.5%
  10, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 1 TBO change,                     680, 10.9%
  11, DrawElements ( 1 VBO| 8 UBO|  8 TBO) w/ 8 TBOs change,                    318, 5.1%

Yikes. Especially intimidating here is the relative performance for vertex attribute changes, where RadeonSI is able to retain almost 25% of its baseline performance relative to zink not even managing 5%.

Hopefully these figures get closer to each other in the future, but this just shows that there’s still a long way to go.

Written on January 14, 2021