YOLOv8-TensorRT
June 10, 2026 · View on GitHub
YOLOv8 inference accelerated with TensorRT — detection, segmentation, pose, oriented boxes and classification, from Python and C++.
English | 简体中文
Take a trained ultralytics YOLOv8 model, export it to ONNX, build a TensorRT engine, and run it from Python or a small C++ binary — for any of the five tasks. The Python and C++ sides share the same engines and class files; the build adapts itself to whatever TensorRT and OpenCV you have.
Highlights
- One shared C++ core (
libyolov8_core): RAII-managed TensorRT/CUDA resources, exceptions instead ofassert, and a singletrt_compatlayer that is the only place branching on the TensorRT version. - Version-agnostic build: auto-detects TensorRT (8 ↔ 10/11, including enterprise headers) and OpenCV (
≥4.7switches to class-aware NMS); see docs/Build.md. Verified on TensorRT 8.6 / 10.8 / 10.16 / 11.0 and OpenCV 4.5 / 4.6 / 4.11. - C++14 fallback:
std::filesystemon C++17, otherwise a vendoredghc::filesystem(-DCMAKE_CXX_STANDARD=14). - One Python entry point:
infer.py --task {det,seg,pose,obb,cls} --backend {torch,cudart,pycuda}replaces ten per-task scripts; the cudart/pycuda backends now run on TensorRT 10. - Unit tests (pytest + ctest), a
--profileper-layer report, andbenchmark.py.
Supported tasks
| Task | infer.py --task | C++ binary | Export (ONNX) |
|---|---|---|---|
| Detection | det | yolov8_detect (raw) · yolov8_detect_e2e (End2End) | export-det.py or ultralytics |
| Segmentation | seg | yolov8_seg · yolov8_seg_simple | export-seg.py or ultralytics |
| Pose | pose | yolov8_pose | ultralytics |
| Oriented boxes | obb | yolov8_obb | ultralytics |
| Classification | cls | yolov8_cls | ultralytics |
Engine layouts.
export-det.pyproduces an End2End detection engine with NMS built in (outputsnum_dets, bboxes, scores, labels);export-seg.pyproduces a segmentation engine (outputsoutputs, proto); the nativeultralyticsexport keeps the model's raw output (e.g.[1, 84, anchors]). Match the engine to its consumer:infer.py --task detandyolov8_detect_e2eneed the End2End engine,infer.py --task segneeds theexport-seg.pyengine, whileyolov8_detectand the pose/obb/cls paths take the raw ultralytics export.
Layout
csrc/
├── core/ # libyolov8_core: engine, trt_compat, RAII, pre/post-process, profiler
├── apps/ # one thin executable per task (detect / segment / pose / obb / cls ...)
├── deepstream/ # DeepStream bbox parser plugin (optional)
└── tests/ # C++ unit tests (ctest)
models/ # Python: engine builder, backends, compat, labels, per-task handlers
data/labels/ # class names shared by Python and C++ (coco / imagenet / dota)
infer.py build.py export-det.py export-seg.py benchmark.py
Setup
The repo is small; for the lightest checkout use a shallow clone:
git clone --depth 1 <url>(latest code only, no history).
CUDA (≥ 11.4) and TensorRT (≥ 8.4) must already be installed system-wide — nvidia-smi and trtexec --version should both work. Then install the Python deps:
pip install -r requirements.txt
pip install ultralytics # ONNX export
pip install cuda-python # optional: infer.py --backend cudart
pip install pycuda # optional: infer.py --backend pycuda
Workflow
.pt → export ONNX → build engine → infer.
1. Export ONNX
(ultralytics downloads pretrained weights such as yolov8s.pt automatically on first use.)
End2End (NMS built in — detection / segmentation):
python export-det.py --weights yolov8s.pt --sim --input-shape 1 3 640 640 \
--iou-thres 0.65 --conf-thres 0.25 --topk 100 --device cuda:0
python export-seg.py --weights yolov8s-seg.pt --sim --device cuda:0
Raw export (pose / obb / cls, and detection/segmentation without built-in NMS) uses ultralytics:
yolo export model=yolov8s-pose.pt format=onnx opset=11 simplify
2. Build the engine
python build.py --weights yolov8s.onnx --fp16 --device cuda:0
# or
/path/to/tensorrt/bin/trtexec --onnx=yolov8s.onnx --saveEngine=yolov8s.engine --fp16
3. Inference — Python
python infer.py --task det --backend torch --engine yolov8s.engine --imgs data --out-dir output
python infer.py --task seg --backend cudart --engine yolov8s-seg.engine --imgs data --conf-thres 0.25 --iou-thres 0.65
python infer.py --task pose --backend pycuda --engine yolov8s-pose.engine --imgs data --show
| flag | meaning |
|---|---|
--task | det / seg / pose / obb / cls |
--backend | torch (PyTorch), cudart (cuda-python), pycuda |
--engine --imgs | engine file; image file or directory |
--show / --out-dir | display in a window, or save to a directory |
--conf-thres --iou-thres | thresholds (seg / pose / obb) |
--device | torch device, e.g. cuda:0 (torch backend) |
--batch | images per engine call (dynamic-batch engines) |
Batched inference. Images are run in one engine call per batch and decoded per image. A fixed-batch engine (e.g. exported with
batch=2) is driven at its own batch size; a dynamic-batch engine follows--batch N. With a single image or a batch-1 engine the behaviour is unchanged.
4. Inference — C++
cmake -S . -B build -DTensorRT_ROOT=/path/to/TensorRT
cmake --build build -j
export LD_LIBRARY_PATH=/path/to/TensorRT/lib:$LD_LIBRARY_PATH
./build/bin/yolov8_detect yolov8s.engine data/bus.jpg --out-dir output # --show / --profile / --labels
Build details, multiple TensorRT/OpenCV versions, cuDNN for TensorRT 8 and the C++14 fallback are in docs/Build.md. Class names live in data/labels/*.txt (override with --labels).
Performance
benchmark.py over the cudart backend (host-to-host: H2D + execute + D2H), yolov8n FP16, 640×640 (cls 224×224), on an RTX 3080 Ti Laptop / CUDA 12.8 / TensorRT 10.16:
| Task | latency (mean) | throughput |
|---|---|---|
| Detection | 2.46 ms | 406 qps |
| Segmentation | 3.43 ms | 292 qps |
| Pose | 2.28 ms | 439 qps |
| Oriented boxes | 1.97 ms | 507 qps |
| Classification | 0.33 ms | 3033 qps |
python benchmark.py --engine yolov8s.engine --runs 200 # latency / throughput
./build/bin/yolov8_detect yolov8s.engine data/bus.jpg --profile # per-layer C++ timing
python trt-profile.py --engine yolov8s.engine --device cuda:0 # Python layer profile
Development
pre-commit install # ruff + clang-format + mdformat run on every commit
python -m pytest tests/ # Python unit tests
cmake -S . -B build -DBUILD_TESTS=ON && ctest --test-dir build # C++ unit tests
Troubleshooting
libnvinfer.so: cannot open shared object fileat runtime — add the TensorRTlib/(and/usr/local/cuda/lib64) toLD_LIBRARY_PATH.- Engine fails to deserialize — a
.engineis tied to the exact TensorRT version that built it; rebuild it with the same TensorRT you link/run against. - TensorRT 8 link error
undefined reference to cudnn*— TensorRT 8 plugins need cuDNN 8; pass-DCUDNN_ROOT=<dir>(e.g. a condacudnn=8env) and put it onLD_LIBRARY_PATH. TensorRT 10+ dropped this dependency. - ONNX export gives a tiny / empty engine — on PyTorch 2.x pass
dynamo=Falsetotorch.onnx.export(already set inexport-det.py/export-seg.py). - TensorRT 11
trtexecsays "model not found" — pass an absolute--onnx=path. --showdoes nothing on a headless box — drop--showand use--out-dirto save annotated images.
More questions (batch, INT8, export tweaks, custom models, …) are answered in docs/FAQ.md.
Deployment
- DeepStream — bbox parser plugin in csrc/deepstream; build with
-DBUILD_DEEPSTREAM=ON(needs the DeepStream SDK). - Jetson — build the same targets on-device with
-DTensorRT_ROOTpointing at the aarch64 TensorRT; no separate sources (see docs/Build.md).
Acknowledgments
Bundled third-party code (ghc::filesystem, TensorRT samples) is credited in ACKNOWLEDGMENTS.md. Licensed under MIT.