CudaSift

March 10, 2026 · View on GitHub

Branch: AdaLovelace — Optimized for NVIDIA Ada Lovelace architecture (RTX 4060 Ti, sm_89)

A high-performance CUDA implementation of the Scale Invariant Feature Transform (SIFT) algorithm. This implementation runs the complete SIFT pipeline on the GPU, achieving sub-millisecond feature extraction on modern NVIDIA hardware.

Based on the original work by Mårten Björkman (Celebrandil), with Ada Lovelace architecture optimizations.


Hardware Target

SpecValue
GPUNVIDIA GeForce RTX 4060 Ti
ArchitectureAda Lovelace (sm_89)
CUDA Cores4352
VRAM8 GB GDDR6
Memory Bandwidth288 GB/s
FP32 Performance~22.1 TFLOPS
L2 Cache32 MB
Driver595.71

Performance Benchmarks

SIFT Extraction (5 octaves, threshold=3.0)

ResolutionSizeFeaturesExtract (ms)Match (ms)Total (ms)FPS
VGA640x4806530.770.121.04965
720p1280x72011550.910.201.47681
SXGA1280x96013260.990.211.66601
1080p1920x108019111.380.342.49402
1440p2560x144022441.850.383.56281
4K UHD3840x216028293.530.516.95144

Benchmarked on RTX 4060 Ti (Driver 595.71, CUDA 13.1). Compute Capability 8.9, 34 SMs, 8187 MB VRAM, 128-bit bus, 32768 KB L2 cache.

Feature Matching (FindMaxCorr10 kernel)

FeaturesMatch Time (ms)
1911 (self-match)0.33

Octave Comparison (1080p, threshold=3.0)

OctavesFeaturesExtract (ms)
317411.24
418771.35
519111.60
619201.80

Threshold Comparison (1080p, 5 octaves)

ThresholdFeaturesExtract (ms)
1.070812.07
2.037001.79
3.019111.59
5.05421.32
10.061.35

Cross-Architecture Comparison

ArchGPUExtract 1280x960Extract 1920x1080Match (ms)GFLOPSBW (GB/s)
PascalGTX 1080 Ti1.20*1.70*2.20*11340484
TuringRTX 2080 Ti0.42*0.56*0.30*11750616
AdaRTX 4060 Ti0.991.380.3322060288

* Values from original CudaSift benchmarks. Ada values measured with Driver 595.71, CUDA 13.1.

Architecture Overview

Input Image (Host -> Device)
         |
         v
+--------------------------------------------------+
|          Gaussian Scale Space                     |
|  Octave 0 (full) -> Octave 1 (1/2) -> ... -> N   |
|         |                                         |
|         v                                         |
|  LaplaceMulti: DoG computation                    |
|  (5 scales + 3 border per octave)                 |
+--------------------------------------------------+
         |
         v
+--------------------------------------------------+
|          Keypoint Detection                       |
|  FindPointsMulti:                                 |
|    - 3D extrema detection (26 neighbors)          |
|    - Edge response rejection                      |
|    - Sub-pixel localization (Taylor expansion)    |
+--------------------------------------------------+
         |
         v
+--------------------------------------------------+
|        Orientation Assignment                     |
|  ComputeOrientations:                             |
|    - 32-bin gradient histogram                    |
|    - Gaussian-weighted 11x11 window               |
|    - Secondary peak -> duplicate feature          |
+--------------------------------------------------+
         |
         v
+--------------------------------------------------+
|       Descriptor Computation                      |
|  ExtractSiftDescriptors:                          |
|    - 4x4 spatial bins x 8 orientations            |
|    - 128-D vector per feature                     |
|    - Two-pass normalization (clip + renorm)       |
+--------------------------------------------------+
         |
         v
+--------------------------------------------------+
|          Feature Matching                         |
|  FindMaxCorr10 (brute-force):                     |
|    - 32x32 feature block tiling                   |
|    - float4 vectorized loads                      |
|    - Warp shuffle reductions                      |
|    - Best + second-best tracking (ambiguity)      |
|                                                   |
|  FindHomography (RANSAC):                         |
|    - 4-point DLT on GPU                           |
|    - Parallel hypothesis testing                  |
|    - Iterative refinement (CPU, Cholesky)         |
+--------------------------------------------------+

CUDA Kernel Configuration

KernelBlock SizeShared MemDescription
ScaleDown68x12 KB2x downsampling with 5-tap Gaussian
LaplaceMulti136x14 KBMulti-scale DoG computation
FindPointsMulti32x11 KB3D extrema detection + sub-pixel
ComputeOrientations121x10.5 KBGradient histogram, peak detection
ExtractSiftDescriptors16x80.7 KB128-D descriptor with trilinear interp
FindMaxCorr1032x832 KBTiled brute-force matching

Building

Prerequisites

  • CUDA Toolkit 11.0+ (recommended 12.x for Ada Lovelace)
  • OpenCV 4.x
  • CMake 3.18+
  • C++17 compatible compiler

Quick Build (Windows)

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release

Quick Build (Linux)

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

Using the Build Script

bash scripts/build.sh Release

CMake Options

OptionDefaultDescription
BUILD_TESTSONBuild test and benchmark programs
BUILD_EXAMPLESONBuild example programs
USE_MANAGED_MEMOFFUse CUDA managed memory
VERBOSE_OUTPUTONEnable verbose timing output

Usage

Main Demo

# Default (img1.png & img2.png)
./cudasift

# Specify GPU device and image set
./cudasift 0 1   # device 0, PGM image set

Feature Extraction Demo

./demo_extract [image_path] [gpu_id] [threshold] [num_octaves]
./demo_extract data/img1.png 0 3.0 5

Output: data/keypoints.png with detected keypoints drawn.

Feature Matching Demo

./demo_match [img1] [img2] [gpu_id]
./demo_match data/img1.png data/img2.png 0

Output: data/matches.png with match lines between images.

Real-time Video Demo

./demo_video [source] [gpu_id] [threshold]
./demo_video 0           # Webcam
./demo_video video.mp4   # Video file

Keys: q quit, +/- adjust threshold.

Performance Benchmark

./benchmark [gpu_id] [num_runs] [threshold]
./benchmark 0 200 3.0

Outputs performance tables at multiple resolutions with extraction, matching, and upload times.

Running Tests

# Individual tests
./test_extract     # Feature extraction correctness
./test_match       # Matching and quality tests
./test_homography  # Geometric verification tests

# All tests + benchmark
bash scripts/run_benchmark.sh

Test Results (RTX 4060 Ti)

Test SuitePassedTotalRate
test_extract1010100%
test_match1111100%
test_homography88100%
Total2929100%

Test Coverage

TestWhat It Verifies
BasicExtractionFeatures detected, valid positions/scales
DifferentThresholdsHigher threshold = fewer features
DifferentOctavesMore octaves = more features
ReproducibilityIdentical results across runs
ScaleUp2x upsampling detects more features
SelfMatchSelf-matching gives perfect scores
CrossMatchCross-image matching produces valid results
HomographyRANSAC + refinement finds inliers
TranslationRecovers known translation
RotationHandles 10 degree rotation
ScaleHandles 80% scale change
PGMImagesStereo pair matching

API Reference

Core Functions

// Initialize CUDA device
void InitCuda(int devNum = 0);

// Allocate/free temporary GPU memory for extraction
float *AllocSiftTempMemory(int width, int height, int numOctaves, bool scaleUp = false);
void FreeSiftTempMemory(float *memoryTmp);

// Extract SIFT features from a GPU image
void ExtractSift(SiftData &siftData, CudaImage &img, int numOctaves,
                 double initBlur, float thresh, float lowestScale = 0.0f,
                 bool scaleUp = false, float *tempMemory = 0);

// Initialize/free SIFT data container
void InitSiftData(SiftData &data, int num = 1024, bool host = false, bool dev = true);
void FreeSiftData(SiftData &data);

// Match two sets of SIFT features on GPU
double MatchSiftData(SiftData &data1, SiftData &data2);

// Find homography using RANSAC
double FindHomography(SiftData &data, float *homography, int *numMatches,
                      int numLoops = 1000, float minScore = 0.85f,
                      float maxAmbiguity = 0.95f, float thresh = 5.0f);

Data Structures

struct SiftPoint {
  float xpos, ypos;       // Sub-pixel position
  float scale;            // Feature scale (sigma)
  float sharpness;        // DoG response value
  float edgeness;         // Edge response ratio
  float orientation;      // Dominant orientation (degrees)
  float score;            // Match correlation score
  float ambiguity;        // Second-best / best ratio
  int match;              // Index of best match
  float match_xpos, match_ypos;  // Matched point position
  float match_error;      // Reprojection error
  float subsampling;      // Octave subsampling factor
  float data[128];        // 128-D descriptor vector
};

struct SiftData {
  int numPts;             // Number of detected features
  int maxPts;             // Allocated capacity
  SiftPoint *h_data;      // Host pointer
  SiftPoint *d_data;      // Device pointer
};

File Structure

CudaSift/
|-- CMakeLists.txt          # Modern CMake build (sm_89)
|-- README.md               # This file
|-- LICENSE                  # MIT License
|
|-- cudaSift.h              # Public API header
|-- cudaSiftH.cu            # Host-side SIFT pipeline
|-- cudaSiftH.h             # Host function declarations
|-- cudaSiftD.cu            # Device kernels (DoG, keypoints, descriptors)
|-- cudaSiftD.h             # Kernel constants and block sizes
|-- cudaImage.cu            # GPU image container
|-- cudaImage.h             # Image class declaration
|-- cudautils.h             # CUDA utilities (error checking, timers, shuffle)
|-- matching.cu             # Matching kernels + RANSAC homography
|-- geomFuncs.cpp           # CPU homography refinement
|-- mainSift.cpp            # Main demo program
|
|-- examples/
|   |-- demo_extract.cpp    # Single-image extraction demo
|   |-- demo_match.cpp      # Two-image matching demo
|   +-- demo_video.cpp      # Real-time video demo
|
|-- tests/
|   |-- benchmark.cpp       # Multi-resolution performance benchmark
|   |-- test_extract.cpp    # Extraction correctness tests
|   |-- test_match.cpp      # Matching quality tests
|   +-- test_homography.cpp # Geometric verification tests
|
|-- scripts/
|   |-- build.sh            # Build script
|   +-- run_benchmark.sh    # Run all tests + benchmark
|
|-- data/
|   |-- img1.png            # Test image 1 (1280x960)
|   |-- img2.png            # Test image 2 (1280x960)
|   |-- left.pgm            # Stereo left image
|   +-- righ.pgm            # Stereo right image
|
+-- match.pdf               # Matching kernel optimization notes

Ada Lovelace Optimizations

This branch includes the following optimizations for the Ada Lovelace architecture:

  1. sm_89 Compute Target -- Native code generation for RTX 40-series GPUs
  2. Fast Math -- --use_fast_math for all CUDA kernels (intrinsic sin/cos/exp/sqrt)
  3. Large L2 Cache -- RTX 4060 Ti has 32 MB L2 cache, benefiting texture lookups and DoG pyramid reads
  4. Warp Synchronization -- All warp-level operations use __shfl_sync with full mask
  5. Optimized Block Sizes -- Tuned for 128 SMs and Ada Lovelace occupancy characteristics
  6. C++17 / CUDA 17 -- Modern language standard support
  7. Static Library -- Core SIFT compiled as static library for faster linking

Algorithm Parameters

ParameterDefaultDescription
numOctaves5Number of octaves in scale space
initBlur1.0Initial Gaussian blur sigma
thresh3.0DoG threshold for keypoint detection
lowestScale0.0Minimum scale for features
scaleUpfalse2x upsample input for fine features
maxPts32768Maximum number of features
minScore0.85Minimum match score for RANSAC
maxAmbiguity0.95Maximum ambiguity ratio for RANSAC

References

License

MIT License -- see LICENSE for details.