OmniDrive Deployment with TensorTR-LLM and TensorRT-Edge-LLM

March 3, 2026 · View on GitHub

This document demonstrates the deployment of the OmniDrive utilizing TensorRT, TensorRT-LLM and TensorRT-Edge-LLM. In this deployment demo, we will use EVA-base as the vision backbone and TinyLlama as the LLM head. We will provide a overview of the overall deployment strategy and result analysis.

Deployment strategy
Deployment Guide
Results analysis
- Accuracy performance
- Inference latencies
References

Deployment strategy

The OmniDrive employs EVA as the vision backbone, StreamPETR for both the bounding box detection head and the map head, and a LLM model as the planning head. For deployment, we utilize EVA-base as the backbone and TinyLlama as the LLM head.

To enhence inference efficiency, engines are built seperately for the vision component (EVA backbone and StreamPETR necks) and the LLM component (TinyLlama). Below are the pipelines for deploying the two components:

The vision component:
1. export ONNX model
2. build engines with TensorRT
The LLM component:
- If using TensorRT-LLM
  1. convert the checkpoints to Hugging Face safetensor with TensorRT-LLM
  2. build engines with trtllm-build
- Or if using TensorRT-Edge-LLM
  1. export ONNX model
  2. build engines with TensorRT scripts

For TensorRT-LLM, we use TensorRT 10.4, and TensorRT-LLM 0.13 to deploy the OmniDrive on A100 GPU, X86_64 Linux platforms; for TensorRT-Edge-LLM, we use TensorRT 10.14 and the TensorRT-Edge-LLM 0.4.0 (latest) to deploy the model on NVIDIA DRIVE AGX Thor platform. Notice that TensorRT-Edge-LLM also support X86_64 deployment.

Please refer to this config for model details.

Vision	LLM	BBOX mAP	Planning L2 1s	Planning L2 2s	Planning L2 3s
PyTorch	PyTorch	0.354	0.151	0.314	0.585
FP32 engine	FP16 engine	0.354	0.150	0.312	0.581
FP32 engine	FP16 activation INT4 weight	0.354	0.157	0.323	0.604
Mixed precision engine	FP16 engine	0.306	0.166	0.337	0.615
Mixed precision engine	FP16 activation INT4 weight	0.306	0.171	0.349	0.634

Inference latencies

Here is the runtime latency analysis for the engines. The units of the reference latency numbers in the following tables are ms.

On an A100 with TensorRT-LLM:

Vision	PyTorch	FP32 vision engine	Mixed-precision vision engine
Latency (ms)	280.02	75.66	25.92

LLM	PyTorch	FP16 LLM engine	FP16 activation INT4 weight
Time To First Token (TTFT) (ms)	107.82	10.02	11.30
Time Per Output Token (TPOT) (ms)	27.49	2.80	2.52
Overall Latency (ms)	2256.50	256.20	232.60

On NVIDIA DRIVE AGX Thor with TensorRT-Edge-LLM:

Vision	FP32 vision engine	Mixed-precision vision engine
Latency (ms)	316.90	177.68

LLM	FP16 LLM engine	NVFP4 LLM engine
Time To First Token (TTFT) (ms)	16.55	9.85
Time Per Output Token (TPOT) (ms)	9.04	4.45
Overall Latency (ms)	794.25	392.55

Table of Contents

Deployment strategy

Deployment Guide

Results analysis

Accuracy performance

Inference latencies

References