YOLOv8-TensorRT

June 10, 2026 · View on GitHub

YOLOv8 inference accelerated with TensorRT — detection, segmentation, pose, oriented boxes and classification, from Python and C++.

English | 简体中文


Python TensorRT C++ License

Take a trained ultralytics YOLOv8 model, export it to ONNX, build a TensorRT engine, and run it from Python or a small C++ binary — for any of the five tasks. The Python and C++ sides share the same engines and class files; the build adapts itself to whatever TensorRT and OpenCV you have.

Highlights

  • One shared C++ core (libyolov8_core): RAII-managed TensorRT/CUDA resources, exceptions instead of assert, and a single trt_compat layer that is the only place branching on the TensorRT version.
  • Version-agnostic build: auto-detects TensorRT (8 ↔ 10/11, including enterprise headers) and OpenCV (≥4.7 switches to class-aware NMS); see docs/Build.md. Verified on TensorRT 8.6 / 10.8 / 10.16 / 11.0 and OpenCV 4.5 / 4.6 / 4.11.
  • C++14 fallback: std::filesystem on C++17, otherwise a vendored ghc::filesystem (-DCMAKE_CXX_STANDARD=14).
  • One Python entry point: infer.py --task {det,seg,pose,obb,cls} --backend {torch,cudart,pycuda} replaces ten per-task scripts; the cudart/pycuda backends now run on TensorRT 10.
  • Unit tests (pytest + ctest), a --profile per-layer report, and benchmark.py.

Supported tasks

Taskinfer.py --taskC++ binaryExport (ONNX)
Detectiondetyolov8_detect (raw) · yolov8_detect_e2e (End2End)export-det.py or ultralytics
Segmentationsegyolov8_seg · yolov8_seg_simpleexport-seg.py or ultralytics
Poseposeyolov8_poseultralytics
Oriented boxesobbyolov8_obbultralytics
Classificationclsyolov8_clsultralytics

Engine layouts. export-det.py produces an End2End detection engine with NMS built in (outputs num_dets, bboxes, scores, labels); export-seg.py produces a segmentation engine (outputs outputs, proto); the native ultralytics export keeps the model's raw output (e.g. [1, 84, anchors]). Match the engine to its consumer: infer.py --task det and yolov8_detect_e2e need the End2End engine, infer.py --task seg needs the export-seg.py engine, while yolov8_detect and the pose/obb/cls paths take the raw ultralytics export.

Layout

csrc/
├── core/        # libyolov8_core: engine, trt_compat, RAII, pre/post-process, profiler
├── apps/        # one thin executable per task (detect / segment / pose / obb / cls ...)
├── deepstream/  # DeepStream bbox parser plugin (optional)
└── tests/       # C++ unit tests (ctest)
models/          # Python: engine builder, backends, compat, labels, per-task handlers
data/labels/     # class names shared by Python and C++ (coco / imagenet / dota)
infer.py  build.py  export-det.py  export-seg.py  benchmark.py

Setup

The repo is small; for the lightest checkout use a shallow clone: git clone --depth 1 <url> (latest code only, no history).

CUDA (≥ 11.4) and TensorRT (≥ 8.4) must already be installed system-wide — nvidia-smi and trtexec --version should both work. Then install the Python deps:

pip install -r requirements.txt
pip install ultralytics            # ONNX export
pip install cuda-python            # optional: infer.py --backend cudart
pip install pycuda                 # optional: infer.py --backend pycuda

Workflow

.ptexport ONNXbuild engineinfer.

1. Export ONNX

(ultralytics downloads pretrained weights such as yolov8s.pt automatically on first use.)

End2End (NMS built in — detection / segmentation):

python export-det.py --weights yolov8s.pt --sim --input-shape 1 3 640 640 \
    --iou-thres 0.65 --conf-thres 0.25 --topk 100 --device cuda:0
python export-seg.py --weights yolov8s-seg.pt --sim --device cuda:0

Raw export (pose / obb / cls, and detection/segmentation without built-in NMS) uses ultralytics:

yolo export model=yolov8s-pose.pt format=onnx opset=11 simplify

2. Build the engine

python build.py --weights yolov8s.onnx --fp16 --device cuda:0
# or
/path/to/tensorrt/bin/trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s.engine --fp16

3. Inference — Python

python infer.py --task det  --backend torch  --engine yolov8s.engine     --imgs data --out-dir output
python infer.py --task seg  --backend cudart --engine yolov8s-seg.engine  --imgs data --conf-thres 0.25 --iou-thres 0.65
python infer.py --task pose --backend pycuda --engine yolov8s-pose.engine --imgs data --show
flagmeaning
--taskdet / seg / pose / obb / cls
--backendtorch (PyTorch), cudart (cuda-python), pycuda
--engine --imgsengine file; image file or directory
--show / --out-dirdisplay in a window, or save to a directory
--conf-thres --iou-thresthresholds (seg / pose / obb)
--devicetorch device, e.g. cuda:0 (torch backend)
--batchimages per engine call (dynamic-batch engines)

Batched inference. Images are run in one engine call per batch and decoded per image. A fixed-batch engine (e.g. exported with batch=2) is driven at its own batch size; a dynamic-batch engine follows --batch N. With a single image or a batch-1 engine the behaviour is unchanged.

4. Inference — C++

cmake -S . -B build -DTensorRT_ROOT=/path/to/TensorRT
cmake --build build -j
export LD_LIBRARY_PATH=/path/to/TensorRT/lib:$LD_LIBRARY_PATH
./build/bin/yolov8_detect yolov8s.engine data/bus.jpg --out-dir output   # --show / --profile / --labels

Build details, multiple TensorRT/OpenCV versions, cuDNN for TensorRT 8 and the C++14 fallback are in docs/Build.md. Class names live in data/labels/*.txt (override with --labels).

Performance

benchmark.py over the cudart backend (host-to-host: H2D + execute + D2H), yolov8n FP16, 640×640 (cls 224×224), on an RTX 3080 Ti Laptop / CUDA 12.8 / TensorRT 10.16:

Tasklatency (mean)throughput
Detection2.46 ms406 qps
Segmentation3.43 ms292 qps
Pose2.28 ms439 qps
Oriented boxes1.97 ms507 qps
Classification0.33 ms3033 qps
python benchmark.py --engine yolov8s.engine --runs 200          # latency / throughput
./build/bin/yolov8_detect yolov8s.engine data/bus.jpg --profile  # per-layer C++ timing
python trt-profile.py --engine yolov8s.engine --device cuda:0    # Python layer profile

Development

pre-commit install                 # ruff + clang-format + mdformat run on every commit
python -m pytest tests/            # Python unit tests
cmake -S . -B build -DBUILD_TESTS=ON && ctest --test-dir build   # C++ unit tests

Troubleshooting

  • libnvinfer.so: cannot open shared object file at runtime — add the TensorRT lib/ (and /usr/local/cuda/lib64) to LD_LIBRARY_PATH.
  • Engine fails to deserialize — a .engine is tied to the exact TensorRT version that built it; rebuild it with the same TensorRT you link/run against.
  • TensorRT 8 link error undefined reference to cudnn* — TensorRT 8 plugins need cuDNN 8; pass -DCUDNN_ROOT=<dir> (e.g. a conda cudnn=8 env) and put it on LD_LIBRARY_PATH. TensorRT 10+ dropped this dependency.
  • ONNX export gives a tiny / empty engine — on PyTorch 2.x pass dynamo=False to torch.onnx.export (already set in export-det.py / export-seg.py).
  • TensorRT 11 trtexec says "model not found" — pass an absolute --onnx= path.
  • --show does nothing on a headless box — drop --show and use --out-dir to save annotated images.

More questions (batch, INT8, export tweaks, custom models, …) are answered in docs/FAQ.md.

Deployment

  • DeepStream — bbox parser plugin in csrc/deepstream; build with -DBUILD_DEEPSTREAM=ON (needs the DeepStream SDK).
  • Jetson — build the same targets on-device with -DTensorRT_ROOT pointing at the aarch64 TensorRT; no separate sources (see docs/Build.md).

Acknowledgments

Bundled third-party code (ghc::filesystem, TensorRT samples) is credited in ACKNOWLEDGMENTS.md. Licensed under MIT.

Star History