OmniDrive Deployment with TensorTR-LLM and TensorRT-Edge-LLM

March 3, 2026 ยท View on GitHub

This document demonstrates the deployment of the OmniDrive utilizing TensorRT, TensorRT-LLM and TensorRT-Edge-LLM. In this deployment demo, we will use EVA-base as the vision backbone and TinyLlama as the LLM head. We will provide a overview of the overall deployment strategy and result analysis.

Table of Contents

  1. Deployment strategy
  2. Deployment Guide
  3. Results analysis
  4. References

Deployment strategy

The OmniDrive employs EVA as the vision backbone, StreamPETR for both the bounding box detection head and the map head, and a LLM model as the planning head. For deployment, we utilize EVA-base as the backbone and TinyLlama as the LLM head.

To enhence inference efficiency, engines are built seperately for the vision component (EVA backbone and StreamPETR necks) and the LLM component (TinyLlama). Below are the pipelines for deploying the two components:

  • The vision component:
    1. export ONNX model
    2. build engines with TensorRT
  • The LLM component:
    • If using TensorRT-LLM
      1. convert the checkpoints to Hugging Face safetensor with TensorRT-LLM
      2. build engines with trtllm-build
    • Or if using TensorRT-Edge-LLM
      1. export ONNX model
      2. build engines with TensorRT scripts

For TensorRT-LLM, we use TensorRT 10.4, and TensorRT-LLM 0.13 to deploy the OmniDrive on A100 GPU, X86_64 Linux platforms; for TensorRT-Edge-LLM, we use TensorRT 10.14 and the TensorRT-Edge-LLM 0.4.0 (latest) to deploy the model on NVIDIA DRIVE AGX Thor platform. Notice that TensorRT-Edge-LLM also support X86_64 deployment.

Please refer to this config for model details.

Deployment Guide

Please refer to the TensorRT-LLM deployment guide and TensorRT-Edge-LLM deployment guide for the detailed instructions on how to set up environment, build engines, and run engine inference with TensorRT-LLM and TensorRT-Edge-LLM.

Results analysis

Accuracy performance

Here are the performance comparisons between the PyTorch model and engines.

VisionLLMBBOX mAPPlanning L2 1sPlanning L2 2sPlanning L2 3s
PyTorchPyTorch0.3540.1510.3140.585
FP32 engineFP16 engine0.3540.1500.3120.581
FP32 engineFP16 activation INT4 weight0.3540.1570.3230.604
Mixed precision engineFP16 engine0.3060.1660.3370.615
Mixed precision engineFP16 activation INT4 weight0.3060.1710.3490.634

Inference latencies

Here is the runtime latency analysis for the engines. The units of the reference latency numbers in the following tables are ms.

On an A100 with TensorRT-LLM:

VisionPyTorchFP32 vision engineMixed-precision vision engine
Latency (ms)280.0275.6625.92
LLMPyTorchFP16 LLM engineFP16 activation INT4 weight
Time To First Token (TTFT) (ms)107.8210.0211.30
Time Per Output Token (TPOT) (ms)27.492.802.52
Overall Latency (ms)2256.50256.20232.60

On NVIDIA DRIVE AGX Thor with TensorRT-Edge-LLM:

VisionFP32 vision engineMixed-precision vision engine
Latency (ms)316.90177.68
LLMFP16 LLM engineNVFP4 LLM engine
Time To First Token (TTFT) (ms)16.559.85
Time Per Output Token (TPOT) (ms)9.044.45
Overall Latency (ms)794.25392.55

References

  1. EVA paper
  2. StreamPETR paper
  3. TinyLlama paper
  4. TensorRT repo
  5. TensorRT-LLM repo
  6. TensorRT-Edge-LLM