rf-detr.cpp

May 29, 2026 · View on GitHub

Brought to you by the LocalAI team, the creators of LocalAI: the open-source AI engine that runs any model (LLMs, vision, voice, image, video) on any hardware. No GPU required.

Models on HF License LocalAI

A C++ inference engine for Roboflow RF-DETR, built on ggml. Supports the full RF-DETR family: 5 detection variants (Nano/Small/Base/Medium/Large) and 3 segmentation variants (SegNano/SegSmall/SegMedium), with F32 / F16 / Q8_0 / Q4_K quantizations published as GGUFs on HuggingFace.

Status: end-to-end detection and segmentation work on real model weights. C++ F16 is about 9% faster than PyTorch CPU on every COCO image we tested, matches F32 accuracy (max |Δscore| ≤ 0.006), and is 1.86x smaller. Detection match vs PyTorch is 54/55 at IoU ≥ 0.95 across 7 COCO val2017 images. Mask IoU is 0.9924 mean across segmentation variants.

Examples

Detection (rfdetr-base, F16):

Bus + pedestriansKitchen scene
Bus + pedestrians detectionKitchen scene detection

Segmentation (rfdetr-seg-nano, F16) with per-class mask overlay:

Street sceneCats + remotes
Street segmentationCats segmentation

All outputs above were produced by rfdetr-cli detect --annotated <path>.png; the renderer draws per-class colored boxes with class name + score labels, and for segmentation models overlays the per-detection mask in the same class color.

Quickstart: prebuilt models

All 32 GGUF models (8 variants x 4 quantizations) are published on HuggingFace. Pull one and run detection in three commands:

# `--recursive` is mandatory: third_party/ggml is a submodule.
# If you've already cloned without it: git submodule update --init --recursive
git clone --recursive https://github.com/mudler/rf-detr.cpp && cd rf-detr.cpp

cmake -B build -DRFDETR_BUILD_CLI=ON && cmake --build build -j

# F16 is the default we recommend: fastest on CPU, matches F32 accuracy, 1.86x smaller.
mkdir -p models
hf download mudler/rfdetr-cpp-base rfdetr-base-f16.gguf --local-dir models/

# Detect
./build/bin/rfdetr-cli detect \
    --model models/rfdetr-base-f16.gguf \
    --input my_image.jpg \
    --output detections.json \
    --threshold 0.5 --threads 8

Available pre-built repositories

VariantHuggingFaceF32F16Q8_0Q4_K
Nanomudler/rfdetr-cpp-nano113 MB61 MB36 MB30 MB
Smallmudler/rfdetr-cpp-small119 MB64 MB38 MB31 MB
Basemudler/rfdetr-cpp-base119 MB64 MB38 MB31 MB
Mediummudler/rfdetr-cpp-medium125 MB67 MB40 MB32 MB
Largemudler/rfdetr-cpp-large126 MB68 MB41 MB33 MB
Seg-Nanomudler/rfdetr-cpp-seg-nano127 MB68 MB40 MB32 MB
Seg-Smallmudler/rfdetr-cpp-seg-small128 MB68 MB40 MB32 MB
Seg-Mediummudler/rfdetr-cpp-seg-medium134 MB72 MB42 MB34 MB
Seg-Largemudler/rfdetr-cpp-seg-large134 MB72 MB43 MB34 MB
Seg-XLargemudler/rfdetr-cpp-seg-xlarge141 MB76 MB45 MB36 MB
Seg-2XLargemudler/rfdetr-cpp-seg-2xlarge143 MB78 MB48 MB38 MB

Use F16 by default. It matches F32 accuracy, is 1.86x smaller, and is the fastest variant on CPU on every model we measured. See Benchmarks for the full numbers.

Quickstart: segmentation with mask output

hf download mudler/rfdetr-cpp-seg-nano rfdetr-seg-nano-f16.gguf --local-dir models/

mkdir -p /tmp/seg_masks
./build/bin/rfdetr-cli detect \
    --model models/rfdetr-seg-nano-f16.gguf \
    --input /tmp/coco_sample.jpg \
    --threshold 0.5 --threads 8 \
    --masks  /tmp/seg_masks \
    --output /tmp/seg.json

ls /tmp/seg_masks/
# det_000_class1_score93.png   <- person silhouette
# det_001_class51_score84.png  <- bowl silhouette
# ...

The --masks <dir> flag writes one PNG per detection (binary mask at the original image resolution). Mask quality matches PyTorch at IoU 0.997 and 99.98% pixel agreement on Seg-Nano F32; the remaining differences are sub-pixel boundary FP rounding.

Quickstart: convert from upstream

To roll your own (different variant, custom checkpoint, different quant):

# One-time: convert upstream RF-DETR .pth to GGUF (requires .venv with rfdetr).
python3 -m venv .venv && .venv/bin/pip install rfdetr

# F16: fastest on CPU, 1.86x smaller than F32, matches F32 accuracy.
.venv/bin/python scripts/convert_rfdetr_to_gguf.py \
    --variant base --dtype f16 \
    --output models/rfdetr-base-f16.gguf

# Pick a variant (nano|small|base|medium|large|seg-nano|seg-small|seg-medium|seg-large|seg-xlarge|seg-2xlarge)
.venv/bin/python scripts/convert_rfdetr_to_gguf.py \
    --variant nano --dtype f16 \
    --output models/rfdetr-nano-f16.gguf

# Re-quantize an existing F32 GGUF to any ggml type (incl. K-quants) without re-converting
./build/bin/rfdetr-cli quantize \
    models/rfdetr-base-f32.gguf models/rfdetr-base-q6_K.gguf q6_K
# Supported: f32 | f16 | q4_0 | q4_1 | q5_0 | q5_1 | q8_0 | q4_K | q5_K | q6_K

# Convert all detection variants in one shot
scripts/convert_all_variants.sh

# Build the full matrix (5 detection + 3 seg, 4 quants each, = 32 models)
scripts/build_all_quants.sh

Quickstart: fine-tuning

rf-detr.cpp is inference-only. To fine-tune RF-DETR on a custom dataset, train with the upstream rfdetr Python library, then convert the resulting checkpoint to GGUF:

.venv/bin/python scripts/convert_rfdetr_to_gguf.py \
    --checkpoint runs/my_train/checkpoint_best_total.pth \
    --variant base --dtype f16 \
    --output models/my_finetune-f16.gguf

The converter reads the head size directly from the checkpoint tensor and resizes the classification head before loading, so arbitrary num_classes values are handled automatically. See docs/finetuning.md for the end-to-end walkthrough (dataset prep, train, convert, quantize, serve), plus a smoke test using a synthetic 5-class checkpoint at scripts/build_custom_checkpoint.py.

Benchmarks

End-to-end CPU inference on AMD Ryzen 9 9950X3D (single batch, --threads 8). C++ F16 is faster than PyTorch on every image, at 1.86x smaller:

Latency comparison: PyTorch vs rf-detr.cpp F32 vs F16 vs Q8_0 across 7 COCO images

ImplMedian ms/imageModel sizevs PyTorchDetection match (IoU ≥ 0.95)
Python rfdetr (PyTorch + oneDNN)149.5120 MB1.00x (ref)reference
C++ rf-detr.cpp F32 (T=8)142.5120 MB1.05x54/55, max |Δscore| 0.045
C++ rf-detr.cpp F16 (T=8)136.964 MB1.09x54/55, max |Δscore| 0.044
C++ rf-detr.cpp Q8_0 (T=8)147.639 MB1.01x54/55, max |Δscore| 0.046

Numbers are medians (median-of-medians across 7 diverse COCO val2017 images, 3 passes of 20 iterations each, 5 warmup, 8 s cooldown between cells; see --rigorous mode in scripts/bench_community.py). Build uses -march=native plus ggml's tinyBLAS SGEMM (GGML_LLAMAFILE=ON) plus OpenMP plus a persistent ggml graph allocator.

See BENCHMARK.md for the per-image breakdown, F16 fast-path explanation, thread-scaling sweep, methodology, and reproduction recipe.

Variants comparison

All 5 detection variants share the DINOv2-small backbone; they differ in input resolution and decoder layer count. C++ F16 is faster than PyTorch on each:

VariantResolutionDec layersC++ F16 median ms @ T=8PyTorch median ms
Nano384261.588.4
Small5123116.0120.5
Base5603136.9149.5
Medium5764149.6182.8
Large7044237.8228.7*

* Large is the one variant where PyTorch is competitive at T=8 (within run-to-run variance).

Variants overview

Quantization tradeoffs

K-quants (Q4_K / Q5_K / Q6_K) produced via the C++ quantizer beat legacy block quants (Q4_0 / Q5_0) at the same target bit-width. The full matrix:

Quant tradeoffs

VariantRecall@0.5Recall@0.95Max |Δscore|Notes
F321.0000.9890.008Reference
F161.0000.9890.008Matches F32, fastest variant
Q8_01.0000.9890.0093.10x compression, no accuracy loss
Q6_K1.0000.9890.0113.40x compression, about 10% slower than Q8_0
Q5_K0.9530.8790.014Mild accuracy loss; still usable
Q4_K0.9530.8790.020Halves Δscore vs legacy Q4_0 at same size
Q4_0 (legacy)0.8910.7270.226Steep accuracy drop; not recommended

Recommendation (numbers are for rfdetr-base):

  1. F16: production default. Fastest, matches F32, 1.86x smaller than F32.
  2. Q8_0: when disk size matters. 3.10x compression, no accuracy loss, about 7% latency tax vs F16.
  3. Q6_K: when you need slightly smaller than Q8_0 with near-identical accuracy.
  4. Q4_K: last resort for ≤32 MB deployments. Real but not catastrophic accuracy loss.

See BENCHMARK.md for mask quality across all 12 seg cells (mask IoU stays ≥ 0.99 across F32/F16/Q8_0 on every segmentation variant).

Embedding via the C API

rf-detr.cpp exposes a flat C ABI in include/rfdetr.h for dlopen and purego.RegisterLibFunc consumers, intended for embedding in Go, Python, or any host language that can call C. It follows the same pattern LocalAI uses for its other ggml backends:

#include "rfdetr.h"

rfdetr_init_params p = {
    .model_path = "models/rfdetr-base-f16.gguf",
    .n_threads  = 8,
};
rfdetr_context* ctx;
rfdetr_init(&p, &ctx);

rfdetr_detect_params dp = {
    .image_path = "my_image.jpg",
    .threshold  = 0.5f,
};
rfdetr_detection dets[100];
int n;
rfdetr_detect(ctx, &dp, dets, 100, &n);

for (int i = 0; i < n; i++) {
    printf("class=%d score=%.3f bbox=[%.1f,%.1f,%.1f,%.1f]\n",
           dets[i].class_id, dets[i].score,
           dets[i].bbox[0], dets[i].bbox[1], dets[i].bbox[2], dets[i].bbox[3]);
}

rfdetr_free(ctx);

Build the shared library with cmake -DRFDETR_SHARED=ON. For segmentation models, detection structs additionally carry a mask field (binary uint8 buffer, owned by the context until the next detect call).

Why rf-detr.cpp

The upstream Roboflow RF-DETR runtime is Python + PyTorch + Transformers + Supervision. rf-detr.cpp provides:

  • A native CPU runtime with no Python at inference time. The CLI is a single binary that takes a GGUF file and an image.
  • Faster than PyTorch CPU on every variant we measured (1.05x to 1.45x across Nano-to-Medium).
  • Quantization down to about 30 MB (Q4_K) with measured accuracy tradeoffs.
  • GPU offload via ggml backends (CUDA / Metal / Vulkan / HIP). Build with -DRFDETR_GGML_CUDA=ON (or _METAL / _VULKAN / _HIPBLAS) and inference runs on the GPU: weights are realized in VRAM and the compute graph runs on the device, with the deformable-attention sampler automatically falling back to CPU via the ggml scheduler. Validated on an NVIDIA GB10: 23.6 ms/image (F16) vs 274 ms on the same box's CPU — an 11.6x speedup.
  • A flat C ABI (include/rfdetr.h) for embedding via dlopen, purego, or cgo.
  • End-to-end parity validation against the upstream PyTorch reference, per-module and end-to-end (see tests/test_parity_*.cpp).

Build

git clone --recursive https://github.com/mudler/rf-detr.cpp
cd rf-detr.cpp
cmake -B build -DRFDETR_BUILD_TESTS=ON -DRFDETR_BUILD_CLI=ON
cmake --build build -j
ctest --test-dir build --output-on-failure

The build applies two patches to third_party/ggml at configure time (stored in third_party/ggml-patches/). These are local performance and debug-instrumentation improvements not yet upstreamed. Re-running CMake is a no-op once they're in place. Run scripts/apply_ggml_patches.sh manually to inspect the patch flow.

CMake options

OptionDefaultPurpose
RFDETR_BUILD_CLIONBuild the rfdetr-cli binary
RFDETR_BUILD_TESTSOFFBuild the ctest test suite (24 tests)
RFDETR_SHAREDOFFBuild librfdetr.so (shared library for embedding)
GGML_NATIVEONCompile ggml with -march=native
GGML_LLAMAFILEONEnable ggml's tinyBLAS SGEMM (closes most of the PyTorch gap)
RFDETR_GGML_CUDA / _METAL / _VULKAN / _HIPBLASOFFOffload inference to GPU. Weights go to VRAM; the deformable-attention sampler falls back to CPU via the ggml scheduler. One backend per build.

GPU offload

rf-detr.cpp can offload inference to a GPU via ggml's backends. Build with one of:

cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_CUDA=ON     # NVIDIA (CUDA)
cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_HIPBLAS=ON  # AMD (ROCm/HIP)
cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_METAL=ON    # Apple (Metal)
cmake -B build -DRFDETR_BUILD_CLI=ON -DRFDETR_GGML_VULKAN=ON   # cross-vendor (Vulkan)

When a device is present, model weights are realized in VRAM and the compute graph runs on the GPU. The one op without a GPU kernel — the deformable- attention bilinear sampler — is automatically run on CPU by the ggml scheduler, which inserts the device↔host copies. If no device is found at runtime, it falls back cleanly to CPU.

Validated on an NVIDIA GB10 (Grace Blackwell, CUDA 13, compute capability 12.1): rfdetr-base F16 runs at 23.6 ms/image on the GPU vs 274 ms on the same box's 20-core ARM CPU (8 threads) — an 11.6x speedup. Detections match the CPU baseline within the standard tolerance (score ≤ 0.05, bbox ≤ 2 px); the 3 deformable-attention ops are confirmed running on CPU via the scheduler. See BENCHMARK.md for details.

Tests

ctest --test-dir build --output-on-failure   # 24 ctest targets

Tests cover per-module parity vs the upstream torch reference (backbone, projector, two-stage, decoder, heads, segmentation), end-to-end detection parity, quantization sanity (F16/Q8_0/Q4_K load correctly), and per-variant load checks. The parity tests use precomputed baseline tensor bundles stored as GGUFs; regenerate them with scripts/gen_torch_baseline.py if you change the architecture.

Documentation

Citation

If you use rf-detr.cpp in a publication, please cite both this work and the upstream RF-DETR paper:

@misc{rfdetrcpp2026,
  author       = {Di Giacinto, Ettore and Palethorpe, Richard},
  title        = {rf-detr.cpp: C++/ggml inference engine for RF-DETR},
  year         = {2026},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/mudler/rf-detr.cpp}},
}

@software{rfdetr2025,
  author    = {Robicheaux, Peter and Popov, Matvei and Madan, Anish and Robinson, Isaac and Nelson, Joseph and Galuba, Wojciech and Wood, James and Kakanos, Sergei and Nemcek, Matthew and Hoshmand, Onur and Ramirez Castro, Carlos},
  title     = {RF-DETR},
  publisher = {GitHub},
  year      = {2025},
  url       = {https://github.com/roboflow/rf-detr},
}

The upstream RF-DETR builds on LW-DETR, DINOv2, and Deformable DETR; cite those too if relevant to your work:

@article{chen2024lwdetr,
  title   = {{LW-DETR}: A Transformer Replacement to {YOLO} for Real-Time Detection},
  author  = {Chen, Qiang and Su, Xiangbo and Zhang, Xinyu and Wang, Jian and Chen, Jiahui and Shen, Yunpeng and Han, Chuchu and Chen, Ziliang and Xu, Weixiang and Li, Fanrong and Zhang, Shan and Wang, Kun and Liu, Yong and Han, Jingdong and Ma, Zhaoxiang and Zhang, Erjin},
  journal = {arXiv preprint arXiv:2406.03459},
  year    = {2024},
}

@article{oquab2023dinov2,
  title   = {{DINOv2}: Learning Robust Visual Features without Supervision},
  author  = {Oquab, Maxime and Darcet, Timothée and Moutakanni, Théo and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and others},
  journal = {arXiv preprint arXiv:2304.07193},
  year    = {2023},
}

@article{zhu2020deformabledetr,
  title   = {{Deformable DETR}: Deformable Transformers for End-to-End Object Detection},
  author  = {Zhu, Xizhou and Su, Weijie and Lu, Lewei and Li, Bin and Wang, Xiaogang and Dai, Jifeng},
  journal = {arXiv preprint arXiv:2010.04159},
  year    = {2020},
}

Author

Ettore Di Giacinto (@mudler), maintainer of LocalAI. PRs welcome; see issues for the current roadmap (GPU backend validation, end-to-end seg quant comparison, etc.).

License

Apache-2.0; see LICENSE. Copyright © 2026 Ettore Di Giacinto.

The model weights remain under their upstream license: RF-DETR is Apache-2.0 (roboflow/rf-detr).