vulkanradixsort

June 29, 2026 · View on GitHub

Reduce-then-scan GPU radix sort, implemented as a single-file header-only Vulkan library. No additional dependencies.

Note: As of June 2026 (CUDA 13.2, CUB v3.2.0 Onesweep), CUB is faster by 1.85× on keys-only and 1.25× on key-value at N = 2252^{25}. Still a practical choice for Vulkan-based workflows.

Requirements

  • VulkanSDK >= 1.4.328.1 — download from https://vulkan.lunarg.com/ (push descriptor requires >= 1.4; >= 1.4.328.1 for macOS)
  • cmake >= 3.24
  • Vulkan 1.3+ device with pushDescriptor and synchronization2 enabled (see Usage)

slangc v2026.11 is downloaded automatically at configure time. To use the Vulkan SDK's slangc instead:

cmake -B build -DVRDX_SLANGC_FROM_SDK=ON

Benchmark

Build

$ cmake . -B build                            # slangc downloaded automatically (v2026.11)
$ cmake . -B build -DVRDX_SLANGC_FROM_SDK=ON  # use slangc from Vulkan SDK instead
$ cmake --build build --config Release -j

Run

$ ./build/Release/bench.exe <type> [-o output.csv] [--validation] [--no-verify]  # Windows
$ ./build/bench <type> [-o output.csv] [--validation] [--no-verify]              # Linux
  • type: cpu, vulkan, cuda, fuchsia
  • --validation: enable Vulkan validation layers (disabled by default to avoid benchmark overhead)
  • --no-verify: skip correctness check and proceed directly to benchmarking
  • Sweeps N from 2182^{18} to 2252^{25} (128 steps), 1 warmup + 10 timed runs each
  • Outputs median GPU and CPU throughput to CSV

Plot results:

$ python tools/plot.py vulkan.csv cuda.csv --output results.png

Results

Test environment: Windows, NVIDIA GeForce RTX 5080, CUDA 13.2, CUB v3.2.0 (Onesweep default).

Median throughput at N = 2252^{25}. Ratios relative to this library (> 1× means the competitor is faster).

Sort typeThis library (Vulkan)Fuchsia (Vulkan)CUB Onesweep (CUDA)
32-bit keys only12.07 GItems/s15.59 GItems/s (1.29×)22.36 GItems/s (1.85×)
32-bit key-value9.35 GItems/s5.32 GItems/s (0.57×)11.67 GItems/s (1.25×)

Fuchsia radix sort is faster on keys-only, but 1.76× slower on key-value. Fuchsia sorts key-value pairs as a single 64-bit key, doubling memory traffic per pass, while this library sorts the two buffers independently.

Benchmark Result

Integration

Integrate via CMake or by copying include/vk_radix_sort.h into your project.

CMake

  1. Add subdirectory:

    add_subdirectory(path/to/vulkan_radix_sort)
    
  2. Link to vk_radix_sort:

    # With Volk
    target_link_libraries(my_project PRIVATE volk::volk_headers vk_radix_sort)
    
    # With standard Vulkan loader
    target_link_libraries(my_project PRIVATE Vulkan::Vulkan vk_radix_sort)
    

Copy header

Copy include/vk_radix_sort.h into your project and include it directly.

Usage

  1. In exactly one source file, define VRDX_IMPLEMENTATION before including vk_radix_sort.h. Include volk.h first if using Volk.

    // With Volk
    #include "volk.h"
    #define VRDX_IMPLEMENTATION
    #include "vk_radix_sort.h"
    
    // With standard Vulkan headers
    #define VRDX_IMPLEMENTATION
    #include "vk_radix_sort.h"
    
  2. Enable the required features when creating the VkDevice:

    VkPhysicalDeviceVulkan13Features features13 = {VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_3_FEATURES};
    features13.synchronization2 = VK_TRUE;
    
    VkPhysicalDeviceVulkan14Features features14 = {VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_1_4_FEATURES};
    features14.pNext = &features13;
    features14.pushDescriptor = VK_TRUE;
    
    VkDeviceCreateInfo deviceInfo = {VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO};
    deviceInfo.pNext = &features14;
    // ...
    vkCreateDevice(physicalDevice, &deviceInfo, nullptr, &device);
    
  3. Create VkBuffer for keys and values with VK_BUFFER_USAGE_STORAGE_BUFFER_BIT.

  4. Create VrdxSorter:

    VrdxSorter sorter = VK_NULL_HANDLE;
    VrdxSorterCreateInfo sorterInfo = {};
    sorterInfo.physicalDevice = physicalDevice;
    sorterInfo.device = device;
    sorterInfo.pipelineCache = pipelineCache;
    VkResult result = vrdxCreateSorter(&sorterInfo, &sorter);
    if (result != VK_SUCCESS) { /* handle error */ }
    
  5. Allocate a temporary storage buffer:

    VrdxSorterStorageRequirements requirements;
    vrdxGetSorterStorageRequirements(sorter, elementCount, &requirements);         // keys only
    vrdxGetSorterKeyValueStorageRequirements(sorter, elementCount, &requirements); // key-value
    
    VkBufferCreateInfo bufferInfo = {VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO};
    bufferInfo.size = requirements.size;
    bufferInfo.usage = requirements.usage;
    // ...
    
  6. Record sort commands.

    Buffer offsets must be multiples of minStorageBufferOffsetAlignment (usually 16).

    The sort rebinds pipeline, layout, and push constants — previously bound state is lost.

    Add execution barriers around the sort. Use global memory barriers rather than per-resource barriers (Vulkan synchronization examples):

    • Before the sort: COMPUTE_SHADER stage (+ TRANSFER for indirect), SHADER_READ access (+ TRANSFER_READ for indirect).
    • After the sort: COMPUTE_SHADER stage, SHADER_WRITE access.
    VkQueryPool queryPool;  // VK_NULL_HANDLE, or a timestamp query pool with at least 15 entries.
    
    // Sort keys only
    vrdxCmdSort(commandBuffer, sorter, elementCount,
                keysBuffer, 0,
                storageBuffer, 0,
                queryPool, 0);
    
    // Sort keys with values
    vrdxCmdSortKeyValue(commandBuffer, sorter, elementCount,
                        keysBuffer, 0,
                        valuesBuffer, 0,
                        storageBuffer, 0,
                        queryPool, 0);
    
    // Sort with indirect element count (read from GPU buffer)
    // Actual count in indirectBuffer must not exceed maxElementCount.
    vrdxCmdSortKeyValueIndirect(commandBuffer, sorter, maxElementCount,
                                indirectBuffer, 0,
                                keysBuffer, 0,
                                valuesBuffer, 0,
                                storageBuffer, 0,
                                queryPool, 0);
    

Development Guide

After modifying shaders, run the cmake build. It compiles shaders with slangc into src/generated/*.h, then assembles include/vk_radix_sort.h from src/vk_radix_sort.h.in. Each generated file includes the slangc version as a comment:

// Generated by slangc 2026.11

To bump the slangc version, update SLANG_VERSION in cmake/Slangc.cmake and rebuild.

To bump the library version, update VERSION in the project() call in CMakeLists.txt, rebuild, and commit the regenerated include/vk_radix_sort.h.

Contributing

When modifying shaders or src/vk_radix_sort.h.in, commit the regenerated include/vk_radix_sort.h as well. Use the default build (without -DVRDX_SLANGC_FROM_SDK=ON) to ensure the output is reproducible.

TODO

  • Compare with VkRadixSort.
  • Find optimal WORKGROUP_SIZE and PARTITION_DIVISION for different devices.

References

Troubleshooting

NVIDIA GPU (Windows): slow performance after a few seconds.

  • Cause: NVIDIA driver downgrades GPU/memory clock under sustained load. Check with the Performance Overlay (Alt+R).
  • Fix: set performance mode to maximum in NVIDIA Control Panel.