Need Another Hit
Ever stop in the middle of writing some code when it hits you? That nagging feeling. It’s in the back of your mind, relentlessly gnawing away. Making you question your decisions. Making you wonder what you’re doing with your life. You’re plugging away happily, when suddenly you remember I haven’t optimized any code in hours.
It’s no way to work.
It’s no way to live.
Let’s get back to our roots, blog.
For this edition of the blog, we’re hopping off the usual tracks and onto a different set entirely: Vulkan driver optimization. I can already hear what you’re thinking.
Vulkan drivers are already fast. Go back to doing something useful, like making GL work.
First: no they’re not.
Second: I’m doing the opposite of that.
Third: shut up, it’s my blog.
How does one optimize Vulkan drivers? As we all know, any great optimization needs a test case. In the case of Vulkan, everyone who’s anyone (me) uses Zink as their test case. The reasons for this are many and varied because I say they are, but the important one to keep in mind is, as always,
Suppose you are a large gaming-oriented company that sells hardware. Your hardware runs on a battery. The battery has a finite charge. Every bit of power drained from the battery powers the device so that users can play games.
Wouldn’t it be great if that battery lasted longer?
There are a number of ways to achieve this goal. Because this is my blog, I’m going to talk about the one that doesn’t involve underclocking or reducing graphical settings.
Obviously I’m talking about optimization, the process by which a smart and talented reader of StackOverflow copies and pastes code in exactly the right sequence such that, upon resolving all compilation errors, execution of a program utilizes fewer resources.
Now because everyone reading this is a GPU driver developer, we all know that optimization comes in two forms, optimizing for CPU and optimizing for GPU. But GPU optimization is easy. Anyone can do that. You just strap on your RadeonGPUProfiler or Nsight or
So we’re done with GPU optimization, but we’re not optimized enough yet. The battery still doesn’t last forever, and the users are still complaining on Reddit that they can’t even finish a casual playthrough of Elden Ring or a boss fight in Monster Hunter: Rise without needing to plug in.
This brings us to “CPU optimization”, the process by which we use more esoteric tools like perf or dtrace or custom instrumentation to generate possibly-useful traces of where the CPU may or may not be executing optimally because it’s a filthy liar that doesn’t even know what line of code it’s executing half the time. But still, we need test cases, and unlike GPU profiling, CPU profiling typically isn’t useful with only a single frame’s worth of sample data.
drawoverhead, which provides a view of how various GL operations impact CPU utilization by executing millions of draw calls per second to provide copious samples for profiling.
Why Not drawoverhead?
This is where the blog is going to take a turn for the bizarre. Some people, it seems, don’t want to use Zink for benchmarking and profiling.
I’m shocked, hurt, appalled, and also it’s me who doesn’t want to use Zink for benchmarking and profiling so it’s a very confusing time.
The problem with using Zink for optimizing CPU usage is that Zink keeps getting in the way. I want to profile only the Vulkan driver, but instead I’ve got all this Mesa oozing and spurting onto my CPU samples. It’s gross, it’s an untenable situation, and I’ve already taken steps to resolve it.
Behold the future: vkoverhead.
With one simple clone, build, and execute, it’s now possible to see how much the Vulkan driver you’re using sucks at any given task.
Want to see how costly it is to bind a different pipeline? Got it covered.
Changing vertex buffers? Blam, your performance is garbage.
Starting and stopping renderpasses? Take your entire computer and throw it in the dumpster because that’s where your performance just went.
The obvious problem with this is that somebody has to actually dig into the
vkoverhead results for each driver and figure out what can be made better. I’ll write another post about this since it’s a separate topic.
Instead, what I want to do today is use
vkoverhead to delve into one of the latest and greatest myths of modern Vulkan:
Is the use of fast-linked Graphics Pipeline Libraries worse than, equivalent to, or better than VK_EXT_vertex_input_dynamic_state?
I say this is one of the great myths because, having spoken to Very Knowledgeable Khronos Insiders as well as Experienced Application Developers, I’ve been told repeatedly that
VK_EXT_vertex_input_dynamic_state is just a convenience feature that should not be used or relied upon, and proper use of GPL with fast-linking provides the same functionality and performance with broader adoption. But is this really true?
Well, now that the tools exist, it’s possible to say definitively that this sort of wishful thinking does not reflect reality. Let’s check out the numbers. As of the latest 1.1
vkoverhead release, the following cases are available:
$ ./vkoverhead -list 0, draw 1, draw_multi 2, draw_vertex 3, draw_multi_vertex 4, draw_index_change 5, draw_index_offset_change 6, draw_rp_begin_end 7, draw_rp_begin_end_dynrender 8, draw_rp_begin_end_dontcare 9, draw_rp_begin_end_dontcare_dynrender 10, draw_multirt 11, draw_multirt_dynrender 12, draw_multirt_begin_end 13, draw_multirt_begin_end_dynrender 14, draw_multirt_begin_end_dontcare 15, draw_multirt_begin_end_dontcare_dynrender 16, draw_vbo_change 17, draw_1vattrib_change 18, draw_16vattrib 19, draw_16vattrib_16vbo_change 20, draw_16vattrib_change 21, draw_16vattrib_change_dynamic 22, draw_16vattrib_change_gpl 23, draw_16vattrib_change_gpl_hashncache 24, draw_1ubo_change 25, draw_12ubo_change 26, draw_1sampler_change 27, draw_16sampler_change 28, draw_1texelbuffer_change 29, draw_16texelbuffer_change 30, draw_1ssbo_change 31, draw_8ssbo_change 32, draw_1image_change 33, draw_16image_change 34, draw_1imagebuffer_change 35, draw_16imagebuffer_change 36, submit_noop 37, submit_50noop 38, submit_1cmdbuf 39, submit_50cmdbuf 40, submit_50cmdbuf_50submit 41, descriptor_noop 42, descriptor_1ubo 43, descriptor_template_1ubo 44, descriptor_12ubo 45, descriptor_template_12ubo 46, descriptor_1sampler 47, descriptor_template_1sampler 48, descriptor_16sampler 49, descriptor_template_16sampler 50, descriptor_1texelbuffer 51, descriptor_template_1texelbuffer 52, descriptor_16texelbuffer 53, descriptor_template_16texelbuffer 54, descriptor_1ssbo 55, descriptor_template_1ssbo 56, descriptor_8ssbo 57, descriptor_template_8ssbo 58, descriptor_1image 59, descriptor_template_1image 60, descriptor_16image 61, descriptor_template_16image 62, descriptor_1imagebuffer 63, descriptor_template_1imagebuffer 64, descriptor_16imagebuffer 65, descriptor_template_16imagebuffer 66, misc_resolve 67, misc_resolve_mutable
The interesting cases for this scenario are:
21, draw_16vattrib_change_dynamic 22, draw_16vattrib_change_gpl 23, draw_16vattrib_change_gpl_hashncache
- Case 21 is changing 16 vertex attributes between draws using
- Case 22 is using fast-linking GPL to compile and bind a new pipeline from precompiled partial pipelines between draws
- Case 23 is using fully precompiled GPL pipelines with hash-n-cache to swap pipelines between draws
Running all of these tests on NVIDIA’s driver (the only hardware driver to fully support both extensions) on a AMD Ryzen 9 5900X with a 3060TI yields the following:
|Case||Draws per second|
Staggeringly, it turns out that GPL is worse in every scenario. Even the speed of the typical Vulkan hash-n-cache usage can’t make up for the fact that
VK_EXT_vertex_input_dynamic_state genuinely is that much faster. And assuming the driver isn’t doing low-GPU-performance heroics, that means everyone drinking the koolaid about not using or not implementing
VK_EXT_vertex_input_dynamic_state should be reconsidering their beverage of choice.
This isn’t to say Graphics Pipeline Library is bad or should not be used.
GPL is one of the best extensions Vulkan has to offer, and it definitely provides a huge number of features that every application developer should be examining to see how they might improve performance.
But it’s not better than
The project is already in a state where at least one major GPU vendor is utilizing it to drive down CPU usage. If you’re a GPU driver engineer, or perhaps if you’re someone who does benchmarking for a popular tech news site, you should check out
Some key takeaways:
- Raw numbers can be compared between different GPUs and and GPU drivers so long as the rest of the system stays the same
- This is how I know that AMDPRO currently performs better than RADV
- If the rest of the system is not the same between drivers/GPUs, the percentage differences can still be compared
Stay tuned for an upcoming post where I teach a course in making your own spaghetti. Also some other things that give zink a 25%+ performance boost.