Fastlink

Fast-linking: This Is Your Howto

I previously wrote a post talking about some optimization work that’s been done with RADV’s VK_EXT_graphics_pipeline_library implementation to improve fast-link performance. As promised, that wasn’t the end of the story. Today’s post will be a bit different, however, as I’ll be assuming all the graphics experts in the audience are already well-versed in all the topics I’m covering.

Also I’m assuming you’re all driver developers interested in improving your VK_EXT_graphics_pipeline_library fast-link performance.

The one exception is that today I’ll be using a specific definition for fast when it comes to fast-linking: to be fast, a driver should be able to fast-link in under 0.01ms. In an extremely CPU-intensive application, this should allow for even the explodiest of pipeline explosions (100+ fast-links in a single frame) to avoid any sort of hitching/stuttering.

Which drivers have what it takes to be fast?

Testing

To begin evaluating fast-link performance, it’s important to have test cases. Benchmarks. The sort that can be easily run, easily profiled, easily understood.

vkoverhead is the premier tool for evaluating CPU overhead in Vulkan drivers, and thanks to Valve, it now has legal support for GPL fast-link using real pipelines from Dota2. That’s right. Acing this synthetic benchmark will have real world implications.

For anyone interested in running these cases, it’s as simple as building and then running:

./vkoverhead -start 135

These benchmark cases will call vkCreateGraphicsPipelines in a tight loop to perform a fast-link on GPL-created pipeline libraries, fast-linking thousands of times per second for easy profiling. The number of iterations per second, in thousands, is then printed.

vkoverhead works with any Vulkan driver on any platform (including Windows!), which means it’s possible to use it to profile and optimize any driver.

Optimization

vkoverhead currently has two cases for GPL fast-link. As they are both extracted directly from Dota2, they have a number of properties in common:

similar descriptor layouts/requirements
same composition of libraries (all four GPL stages created separately)

Each case tests the following:

depthonly is a pipeline containing only a vertex shader, forcing the driver to use its own fragment shader
slow is a pipeline that happens to be slow to create on many drivers

Various tools are available on different platforms for profiling, and I’m not going to go into details here. What I’m going to do instead is look into strategies for optimizing drivers. Strategies that I (and others) have employed in real drivers. Strategies that you, if you aren’t shipping a fast-linking implementation of GPL, might be interested in.

First Strategy: Move NO-OP Fragment Shader To Device

The depthonly case explicitly tests whether drivers are creating a new fragment shader for every pipeline that lacks one. Drivers should not do this.

Instead, create a single fragment shader on the device object and reuse it like these drivers do:

In addition to being significantly faster, this also saves some memory.

Second Strategy: Avoid Copying Shader IR

Regular, optimized pipeline creation typically involves running optimization passes across the shader binaries, possibly even the entire pipeline, to ensure that various speedups can be found. Many drivers copy the internal shader IR in the course of pipeline creation to handle shader variants.

Don’t copy shader IR when trying to fast-link a pipeline.

Copying IR is very expensive, especially in larger shaders. Instead, either precompile unoptimized shader binaries in their corresponding GPL stage or refcount IR structures that must exist during execution. Examples:

Third Strategy: Avoid Compiling Shaders

This one seems obvious, but it has to be stated.

Do not compile shaders when attempting to achieve fast-link speed.

If you are compiling shaders, this is a very easy place to start optimizing.

Fourth Strategy: Avoid Caching Fast-link Pipelines

There’s no reason to cache a fast-linked pipeline. The amount of time saved by retrieving a cached pipeline should be outweighed by the amount of time required to:

compute a key/hash for a given pipeline
access the cache

I say should because ideally a driver should be so fast at combining a GPL pipeline that even a cache hit is only comparable performance, if not slower outright. Skip all aspects of caching for these pipelines.

RADV recently optimized this away

Fifth Strategy: Misc Profiling

If a driver is still slow after checking for the above items, it’s time to try profiling. It’s surprising what slowdowns drivers will hit. The classics I’ve seen are large memset calls and avoidable allocations.

Some examples:

RADV dynamic state initialization had a big memset
another big RADV memset
Lavapipe dead code deletion removing a small allocation
Lavapipe pipeline layout reuse to avoid allocations
Lavapipe deferred CSO creation*
Lavapipe is special because it is both a CPU-based driver using LLVM, meaning the time spent out of LLVM will never exceed the time spent inside LLVM. It also executes its command buffers in a thread. Thus, it can “cheat” by deferring the final creation of its shader CSOs until pipeline bind time.

A Mystery Solved

In my previous post, I alluded to a driver that was shipping a GPL implementation that advertised fast-link but wasn’t actually fast. I saw a lot of guesses. Nobody got it right.

It was Lavapipe (me) all along.

As hinted at above, however, this is no longer the case. In fact, after going through the listed strategies, Lavapipe now has the fastest GPL linking in the world.

Obviously it would have to if I’m writing a blog post about optimizing fast-linking, right?

Fast-linking: Initial Comparisons

How fast is Lavapipe’s linking, you might ask?

To answer this, let’s first apply a small patch to bump up Lavapipe’s descriptor limits so it can handle the beefy Dota2 pipelines. With that done, here’s a look at comparisons to other, more legitimate drivers, all running on the same system.

NVIDIA is the gold standard for GPL fast-linking considering how long they’ve been shipping it. They’re pretty fast.

$ VK_ICD_FILENAMES=nvidia_icd.json ./vkoverhead -start 135 -duration 5
vkoverhead running on NVIDIA GeForce RTX 2070:
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                      444,          100.0%
 136, misc_compile_fastlink_slow,                           243,          100.0%

RADV (with pending MRs applied) has gotten incredibly fast over the past week-ish.

$ RADV_PERFTEST=gpl ./vkoverhead -start 135 -duration 5
vkoverhead running on AMD Radeon RX 5700 XT (RADV NAVI10):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                      579,          100.0%
 136, misc_compile_fastlink_slow,                           537,          100.0%

Lavapipe (with pending MRs applied) blows them both out of the water.

$ VK_ICD_FILENAMES=lvp_icd.x86_64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 15.0.6, 256 bits):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                     1485,         100.0%
 136, misc_compile_fastlink_slow,                          1464,         100.0%

Even if the NVIDIA+RADV numbers are added together, it’s still not close.

Fast-linking: More Comparisons

If I switch over to a different machine, Intel’s ANV driver has a MR for GPL open, and it’s seeing some movement. Here’s a head-to-head with the champion.

$ ./vkoverhead -start 135 -duration 5
vkoverhead running on Intel(R) Iris(R) Plus Graphics (ICL GT2):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                      384,          100.0%
 136, misc_compile_fastlink_slow,                           276,          100.0%

$ VK_ICD_FILENAMES=lvp_icd.x86_64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 15.0.6, 256 bits):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                     1785,         100.0%
 136, misc_compile_fastlink_slow,                          1779,         100.0%

On yet another machine, here’s Turnip, which advertises the fast-link feature. This driver requires a small patch to modify MAX_SETS=5 since this is hardcoded at 4. I’ve also pinned execution here to the big cores for consistency.

# turnip ooms itself with -duration
$ ./vkoverhead -start 135
vkoverhead running on Turnip Adreno (TM) 618:
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                       73,           100.0%
 136, misc_compile_fastlink_slow,                            23,           100.0%

$ VK_ICD_FILENAMES=lvp_icd.aarch64.json ./vkoverhead -start 135 -duration 5
vkoverhead running on llvmpipe (LLVM 14.0.6, 128 bits):
	* misc numbers are reported as thousands of operations per second
	* percentages for misc cases should be ignored
 135, misc_compile_fastlink_depthonly,                      690,          100.0%
 136, misc_compile_fastlink_slow,                           699,          100.0%

More Analysis

We’ve seen that Lavapipe is unequivocally the champion of fast-linking in every head-to-head, but what does this actually look like in timings?

Here’s a chart that shows the breakdown in milliseconds.

Driver	` misc_compile_fastlink_depthonly `	` misc_compile_fastlink_slow `
NVIDIA	0.002ms	0.004ms
RADV	0.0017ms	0.0019ms
Lavapipe	0.0007ms	0.0007ms

ANV	0.0026ms	0.0036ms
Lavapipe	0.00056ms	0.00056ms

Turnip	0.0137ms	0.0435ms
Lavapipe	0.001ms	0.001ms

As we can see, all of these drivers are “fast”. A single fast-link pipeline isn’t likely to cause any of them to drop a frame.

The driver I’ve got my eye on, however, is Turnip, which is the only one of the tested group that doesn’t quite hit that 0.01ms target. A little bit of profiling might show some easy gains here.

Even More Analysis

For another view of these drivers, let’s examine the relative performance. Since GPL fast-linking is inherently a CPU task that has no relation to the GPU, it stands to reason that a CPU-based driver should be able to optimize for it the best given that there’s already all manner of hackery going on to defer and delay execution. Indeed, reality confirms this, and looking at any profile of Lavapipe for the benchmark cases reveals that the only remaining bottleneck is the speed of malloc, which is to say the speed with which the returned pipeline object can be allocated.

Thus, ignoring potential micro-optimizations of pipeline struct size, it can be said that Lavapipe has effectively reached the maximum speed of the system for fast-linking. From there, we can say that any other driver running on the same system is utilizing some fraction of this power.

Therefore, every other driver’s fast-link performance can be visualized in units of Lavapipe (lvps) to determine how much gain is possible if things like refactoring time and feasibility are ignored.

Driver	`misc_compile_fastlink_depthonly`	`misc_compile_fastlink_slow`
NVIDIA	0.299lvps	0.166lvps
RADV	0.390lvps	0.367lvps
ANV	0.215lvps	0.155lvps
Turnip	0.106lvps	0.033lvps

The great thing about lvps is that these are comparable units.

At last, we finally have a way to evaluate all these drivers in a head-to-head across different systems.

The results are a bit surprising to me:

RADV, with the insane heroics of Samuel “Gotta Go Fast” Pitoiset, has gone from roughly zero lvps last week to first place this week
NVIDIA’s fast-linking, while quite fast, is closer in performance to an unoptimized, unlanded ANV MR than it is to RADV
Turnip is a mobile driver that both 1) has a GPL implementation 2) is kinda fast at an objective level?

Key Takeaways

Aside from the strategies outlined above, the key takeaway for me is that there shouldn’t be any hardware limitation to implementing fast-linking. It’s a CPU-based architectural problem, and with enough elbow grease, any driver can aspire to reach nonzero lvps in vkoverhead’s benchmark cases.

Written on February 1, 2023