vulkanradixsort

June 29, 2026 · View on GitHub

Reduce-then-scan GPU radix sort, implemented as a single-file header-only Vulkan library. No additional dependencies.

Note: As of June 2026 (CUDA 13.2, CUB v3.2.0 Onesweep), CUB is faster by 1.85× on keys-only and 1.25× on key-value at N = $2^{25}$ . Still a practical choice for Vulkan-based workflows.

Requirements

VulkanSDK >= 1.4.328.1 — download from https://vulkan.lunarg.com/ (push descriptor requires >= 1.4; >= 1.4.328.1 for macOS)
cmake >= 3.24
Vulkan 1.3+ device with pushDescriptor and synchronization2 enabled (see Usage)

slangc v2026.11 is downloaded automatically at configure time. To use the Vulkan SDK's slangc instead:

cmake -B build -DVRDX_SLANGC_FROM_SDK=ON

Benchmark

Build

$ cmake . -B build                            # slangc downloaded automatically (v2026.11)
$ cmake . -B build -DVRDX_SLANGC_FROM_SDK=ON  # use slangc from Vulkan SDK instead
$ cmake --build build --config Release -j

Run

$ ./build/Release/bench.exe <type> [-o output.csv] [--validation] [--no-verify]  # Windows
$ ./build/bench <type> [-o output.csv] [--validation] [--no-verify]              # Linux

type: cpu, vulkan, cuda, fuchsia
--validation: enable Vulkan validation layers (disabled by default to avoid benchmark overhead)
--no-verify: skip correctness check and proceed directly to benchmarking
Sweeps N from $2^{18}$ to $2^{25}$ (128 steps), 1 warmup + 10 timed runs each
Outputs median GPU and CPU throughput to CSV

Plot results:

$ python tools/plot.py vulkan.csv cuda.csv --output results.png

Results

Test environment: Windows, NVIDIA GeForce RTX 5080, CUDA 13.2, CUB v3.2.0 (Onesweep default).

Median throughput at N = $2^{25}$ . Ratios relative to this library (> 1× means the competitor is faster).

Sort type	This library (Vulkan)	Fuchsia (Vulkan)	CUB Onesweep (CUDA)
32-bit keys only	12.07 GItems/s	15.59 GItems/s (1.29×)	22.36 GItems/s (1.85×)
32-bit key-value	9.35 GItems/s	5.32 GItems/s (0.57×)	11.67 GItems/s (1.25×)

Fuchsia radix sort is faster on keys-only, but 1.76× slower on key-value. Fuchsia sorts key-value pairs as a single 64-bit key, doubling memory traffic per pass, while this library sorts the two buffers independently.

Benchmark Result

Integration

Integrate via CMake or by copying include/vk_radix_sort.h into your project.

CMake

Add subdirectory:

add_subdirectory(path/to/vulkan_radix_sort)

Link to vk_radix_sort:

# With Volk
target_link_libraries(my_project PRIVATE volk::volk_headers vk_radix_sort)

# With standard Vulkan loader
target_link_libraries(my_project PRIVATE Vulkan::Vulkan vk_radix_sort)

Copy header

Copy include/vk_radix_sort.h into your project and include it directly.

Usage

In exactly one source file, define VRDX_IMPLEMENTATION before including vk_radix_sort.h. Include volk.h first if using Volk.

// With Volk
#include "volk.h"
#define VRDX_IMPLEMENTATION
#include "vk_radix_sort.h"

// With standard Vulkan headers
#define VRDX_IMPLEMENTATION
#include "vk_radix_sort.h"

Enable the required features when creating the VkDevice:

VkPhysicalDeviceVulkan13Features features13 = {VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_3_FEATURES};
features13.synchronization2 = VK_TRUE;

VkPhysicalDeviceVulkan14Features features14 = {VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_4_FEATURES};
features14.pNext = &features13;
features14.pushDescriptor = VK_TRUE;

VkDeviceCreateInfo deviceInfo = {VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO};
deviceInfo.pNext = &features14;
// ...
vkCreateDevice(physicalDevice, &deviceInfo, nullptr, &device);

Create VkBuffer for keys and values with VK_BUFFER_USAGE_STORAGE_BUFFER_BIT.

Create VrdxSorter:

VrdxSorter sorter = VK_NULL_HANDLE;
VrdxSorterCreateInfo sorterInfo = {};
sorterInfo.physicalDevice = physicalDevice;
sorterInfo.device = device;
sorterInfo.pipelineCache = pipelineCache;
VkResult result = vrdxCreateSorter(&sorterInfo, &sorter);
if (result != VK_SUCCESS) { /* handle error */ }

Allocate a temporary storage buffer:

VrdxSorterStorageRequirements requirements;
vrdxGetSorterStorageRequirements(sorter, elementCount, &requirements);         // keys only
vrdxGetSorterKeyValueStorageRequirements(sorter, elementCount, &requirements); // key-value

VkBufferCreateInfo bufferInfo = {VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO};
bufferInfo.size = requirements.size;
bufferInfo.usage = requirements.usage;
// ...

Record sort commands.

Buffer offsets must be multiples of minStorageBufferOffsetAlignment (usually 16).

The sort rebinds pipeline, layout, and push constants — previously bound state is lost.

Add execution barriers around the sort. Use global memory barriers rather than per-resource barriers (Vulkan synchronization examples):

Before the sort: COMPUTE_SHADER stage (+ TRANSFER for indirect), SHADER_READ access (+ TRANSFER_READ for indirect).
After the sort: COMPUTE_SHADER stage, SHADER_WRITE access.

VkQueryPool queryPool;  // VK_NULL_HANDLE, or a timestamp query pool with at least 15 entries.

// Sort keys only
vrdxCmdSort(commandBuffer, sorter, elementCount,
            keysBuffer, 0,
            storageBuffer, 0,
            queryPool, 0);

// Sort keys with values
vrdxCmdSortKeyValue(commandBuffer, sorter, elementCount,
                    keysBuffer, 0,
                    valuesBuffer, 0,
                    storageBuffer, 0,
                    queryPool, 0);

// Sort with indirect element count (read from GPU buffer)
// Actual count in indirectBuffer must not exceed maxElementCount.
vrdxCmdSortKeyValueIndirect(commandBuffer, sorter, maxElementCount,
                            indirectBuffer, 0,
                            keysBuffer, 0,
                            valuesBuffer, 0,
                            storageBuffer, 0,
                            queryPool, 0);

Development Guide

After modifying shaders, run the cmake build. It compiles shaders with slangc into src/generated/*.h, then assembles include/vk_radix_sort.h from src/vk_radix_sort.h.in. Each generated file includes the slangc version as a comment:

// Generated by slangc 2026.11

To bump the slangc version, update SLANG_VERSION in cmake/Slangc.cmake and rebuild.

To bump the library version, update VERSION in the project() call in CMakeLists.txt, rebuild, and commit the regenerated include/vk_radix_sort.h.

Contributing

When modifying shaders or src/vk_radix_sort.h.in, commit the regenerated include/vk_radix_sort.h as well. Use the default build (without -DVRDX_SLANGC_FROM_SDK=ON) to ensure the output is reproducible.

TODO

Compare with VkRadixSort.
Find optimal WORKGROUP_SIZE and PARTITION_DIVISION for different devices.

References

https://github.com/b0nes164/GPUSorting — CUDA kernel references for understanding the algorithm.

Troubleshooting

NVIDIA GPU (Windows): slow performance after a few seconds.

Cause: NVIDIA driver downgrades GPU/memory clock under sustained load. Check with the Performance Overlay (Alt+R).
Fix: set performance mode to maximum in NVIDIA Control Panel.