vulkanradixsort
June 29, 2026 · View on GitHub
Reduce-then-scan GPU radix sort, implemented as a single-file header-only Vulkan library. No additional dependencies.
Note: As of June 2026 (CUDA 13.2, CUB v3.2.0 Onesweep), CUB is faster by 1.85× on keys-only and 1.25× on key-value at N = . Still a practical choice for Vulkan-based workflows.
Requirements
VulkanSDK >= 1.4.328.1— download from https://vulkan.lunarg.com/ (push descriptor requires >= 1.4; >= 1.4.328.1 for macOS)cmake >= 3.24- Vulkan 1.3+ device with
pushDescriptorandsynchronization2enabled (see Usage)
slangc v2026.11 is downloaded automatically at configure time. To use the Vulkan SDK's slangc instead:
cmake -B build -DVRDX_SLANGC_FROM_SDK=ON
Benchmark
Build
$ cmake . -B build # slangc downloaded automatically (v2026.11)
$ cmake . -B build -DVRDX_SLANGC_FROM_SDK=ON # use slangc from Vulkan SDK instead
$ cmake --build build --config Release -j
Run
$ ./build/Release/bench.exe <type> [-o output.csv] [--validation] [--no-verify] # Windows
$ ./build/bench <type> [-o output.csv] [--validation] [--no-verify] # Linux
type:cpu,vulkan,cuda,fuchsia--validation: enable Vulkan validation layers (disabled by default to avoid benchmark overhead)--no-verify: skip correctness check and proceed directly to benchmarking- Sweeps N from to (128 steps), 1 warmup + 10 timed runs each
- Outputs median GPU and CPU throughput to CSV
Plot results:
$ python tools/plot.py vulkan.csv cuda.csv --output results.png
Results
Test environment: Windows, NVIDIA GeForce RTX 5080, CUDA 13.2, CUB v3.2.0 (Onesweep default).
Median throughput at N = . Ratios relative to this library (> 1× means the competitor is faster).
| Sort type | This library (Vulkan) | Fuchsia (Vulkan) | CUB Onesweep (CUDA) |
|---|---|---|---|
| 32-bit keys only | 12.07 GItems/s | 15.59 GItems/s (1.29×) | 22.36 GItems/s (1.85×) |
| 32-bit key-value | 9.35 GItems/s | 5.32 GItems/s (0.57×) | 11.67 GItems/s (1.25×) |
Fuchsia radix sort is faster on keys-only, but 1.76× slower on key-value. Fuchsia sorts key-value pairs as a single 64-bit key, doubling memory traffic per pass, while this library sorts the two buffers independently.

Integration
Integrate via CMake or by copying include/vk_radix_sort.h into your project.
CMake
-
Add subdirectory:
add_subdirectory(path/to/vulkan_radix_sort) -
Link to
vk_radix_sort:# With Volk target_link_libraries(my_project PRIVATE volk::volk_headers vk_radix_sort) # With standard Vulkan loader target_link_libraries(my_project PRIVATE Vulkan::Vulkan vk_radix_sort)
Copy header
Copy include/vk_radix_sort.h into your project and include it directly.
Usage
-
In exactly one source file, define
VRDX_IMPLEMENTATIONbefore includingvk_radix_sort.h. Includevolk.hfirst if using Volk.// With Volk #include "volk.h" #define VRDX_IMPLEMENTATION #include "vk_radix_sort.h" // With standard Vulkan headers #define VRDX_IMPLEMENTATION #include "vk_radix_sort.h" -
Enable the required features when creating the
VkDevice:VkPhysicalDeviceVulkan13Features features13 = {VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_3_FEATURES}; features13.synchronization2 = VK_TRUE; VkPhysicalDeviceVulkan14Features features14 = {VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_4_FEATURES}; features14.pNext = &features13; features14.pushDescriptor = VK_TRUE; VkDeviceCreateInfo deviceInfo = {VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO}; deviceInfo.pNext = &features14; // ... vkCreateDevice(physicalDevice, &deviceInfo, nullptr, &device); -
Create
VkBufferfor keys and values withVK_BUFFER_USAGE_STORAGE_BUFFER_BIT. -
Create
VrdxSorter:VrdxSorter sorter = VK_NULL_HANDLE; VrdxSorterCreateInfo sorterInfo = {}; sorterInfo.physicalDevice = physicalDevice; sorterInfo.device = device; sorterInfo.pipelineCache = pipelineCache; VkResult result = vrdxCreateSorter(&sorterInfo, &sorter); if (result != VK_SUCCESS) { /* handle error */ } -
Allocate a temporary storage buffer:
VrdxSorterStorageRequirements requirements; vrdxGetSorterStorageRequirements(sorter, elementCount, &requirements); // keys only vrdxGetSorterKeyValueStorageRequirements(sorter, elementCount, &requirements); // key-value VkBufferCreateInfo bufferInfo = {VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO}; bufferInfo.size = requirements.size; bufferInfo.usage = requirements.usage; // ... -
Record sort commands.
Buffer offsets must be multiples of
minStorageBufferOffsetAlignment(usually16).The sort rebinds pipeline, layout, and push constants — previously bound state is lost.
Add execution barriers around the sort. Use global memory barriers rather than per-resource barriers (Vulkan synchronization examples):
- Before the sort:
COMPUTE_SHADERstage (+TRANSFERfor indirect),SHADER_READaccess (+TRANSFER_READfor indirect). - After the sort:
COMPUTE_SHADERstage,SHADER_WRITEaccess.
VkQueryPool queryPool; // VK_NULL_HANDLE, or a timestamp query pool with at least 15 entries. // Sort keys only vrdxCmdSort(commandBuffer, sorter, elementCount, keysBuffer, 0, storageBuffer, 0, queryPool, 0); // Sort keys with values vrdxCmdSortKeyValue(commandBuffer, sorter, elementCount, keysBuffer, 0, valuesBuffer, 0, storageBuffer, 0, queryPool, 0); // Sort with indirect element count (read from GPU buffer) // Actual count in indirectBuffer must not exceed maxElementCount. vrdxCmdSortKeyValueIndirect(commandBuffer, sorter, maxElementCount, indirectBuffer, 0, keysBuffer, 0, valuesBuffer, 0, storageBuffer, 0, queryPool, 0); - Before the sort:
Development Guide
After modifying shaders, run the cmake build. It compiles shaders with slangc into src/generated/*.h, then assembles include/vk_radix_sort.h from src/vk_radix_sort.h.in. Each generated file includes the slangc version as a comment:
// Generated by slangc 2026.11
To bump the slangc version, update SLANG_VERSION in cmake/Slangc.cmake and rebuild.
To bump the library version, update VERSION in the project() call in CMakeLists.txt, rebuild, and commit the regenerated include/vk_radix_sort.h.
Contributing
When modifying shaders or src/vk_radix_sort.h.in, commit the regenerated include/vk_radix_sort.h as well. Use the default build (without -DVRDX_SLANGC_FROM_SDK=ON) to ensure the output is reproducible.
TODO
- Compare with VkRadixSort.
- Find optimal
WORKGROUP_SIZEandPARTITION_DIVISIONfor different devices.
References
- https://github.com/b0nes164/GPUSorting — CUDA kernel references for understanding the algorithm.
Troubleshooting
NVIDIA GPU (Windows): slow performance after a few seconds.
- Cause: NVIDIA driver downgrades GPU/memory clock under sustained load. Check with the Performance Overlay (Alt+R).
- Fix: set performance mode to maximum in NVIDIA Control Panel.
