Optimization Guide: Achieving >30 FPS on Jetson

February 5, 2026 · View on GitHub

This guide explains how to achieve optimal performance with Depth Anything 3 on NVIDIA Jetson platforms.

Quick Reference by Platform

Use this table to find the recommended configuration for your Jetson:

Platform	VRAM	Recommended Model	Resolution	Expected FPS	Memory Usage
Orin Nano 4GB	4GB shared	DA3-Small	308x308	40-45	~1.2GB
Orin Nano 8GB	8GB shared	DA3-Small	308x308	45-50	~1.2GB
Orin NX 8GB	8GB shared	DA3-Small	308x308	50-55	~1.2GB
Jetson Orin NX 16GB*	16GB shared	DA3-Small	518x518	43+ (validated)	~1.8GB
AGX Orin 32GB	32GB shared	DA3-Base	518x518	25-35	~2.5GB
AGX Orin 64GB	64GB shared	DA3-Base/Large	518x518	20-35	~2.5-4GB
Xavier NX	8GB shared	DA3-Small	308x308	15-25*	~1.2GB

*Xavier NX requires JetPack 5.x with TensorRT 8.5+ (limited DA3 support)

*Validated on Seeed reComputer J4012

Key Notes:

FPS values are TensorRT processing capacity. Real-world FPS may be limited by camera input rate (~24 FPS for USB cameras)
Use ./run.sh for one-click deployment with automatic configuration
All platforms use FP16 precision for optimal speed/accuracy balance

Model Selection Guide

Model	Parameters	Best For	Min VRAM
DA3-Small	~24M	Real-time robotics, obstacle avoidance	4GB
DA3-Base	~97M	Balanced quality/speed, general use	8GB
DA3-Large	~335M	High-quality depth, slower inference	16GB

TensorRT Status (2026-02-05)

TensorRT acceleration validated on Jetson Orin NX 16GB (Seeed reComputer J4012).

Component	Previous (L4T r36.2.0)	Current (L4T r36.4.0)
TensorRT	8.6.2 (incompatible)	10.3 (validated)
CUDA	12.2	12.6
cuDNN	8.9	9.3

Root Cause (Resolved): TensorRT 8.6 could not compile DINOv2's Einsum operations. TensorRT 10.3 has enhanced ViT/MHA support.

Validated Performance (2026-01-31):

Platform: Jetson Orin NX 16GB
Model: DA3-SMALL at 518x518 FP16
Throughput: 35.3 FPS
GPU Latency: 26.4ms median (25.5ms min)
Engine Size: 58MB
Speedup: 6.8x over PyTorch baseline

To enable TensorRT:

# Rebuild Docker image with new base
docker compose build depth-anything-3-jetson

# Run with auto TensorRT engine build
DA3_TENSORRT_AUTO=true docker compose up depth-anything-3-jetson

Current Architecture (2026-02-04) - Optimized

Shared Memory IPC (/dev/shm/da3) achieves 23+ FPS, limited only by camera input rate.

Architecture	TRT Inference	IPC Overhead	Total	FPS
Native (target)	~26ms	0ms	~26ms	~38
Host-Container File IPC (old)	~50ms	~40ms	~90ms	~11
Host-Container Shared Memory (current)	~15ms	~8ms	~23ms	43+ capacity

Optimization Complete: TensorRT runs on host, ROS2 in container. Communication via /dev/shm/da3/ using numpy.memmap reduces IPC overhead to ~8ms. Processing capacity is 43+ FPS; actual output limited by camera input (~24 FPS).

To use optimized mode:

# run.sh automatically uses shared memory TRT service
./run.sh

Validated Performance on Jetson Orin NX 16GB

PyTorch Baseline

Measured on Jetson Orin NX 16GB (JetPack 6.0, L4T r36.2.0, CUDA 12.2):

Model	Backend	Resolution	FPS	Inference Time
DA3-SMALL	PyTorch FP32	518x518	~5.2	~193ms

TensorRT 10.3 (Validated 2026-01-31)

Measured on Jetson Orin NX 16GB (L4T r36.4.0, TensorRT 10.3):

Model	Backend	Resolution	FPS	GPU Latency	Engine Size	Speedup
DA3-SMALL	TensorRT FP16	518x518	35.3	26.4ms median (25.5ms min)	58MB	6.8x

Performance Targets (Future - TensorRT)

Input: 1080p camera (1920x1080) at 30 FPS
Output: 1080p depth + confidence maps
Target FPS: >30 FPS sustained
Platform: NVIDIA Jetson Orin AGX 64GB

Quick Start

Option 1: PyTorch FP32 (Development/Baseline Only) - ~5 FPS

WARNING: NOT for production use. PyTorch mode is provided only for development testing and as a performance baseline. For production deployment, use Option 2 (TensorRT).

Works out of the box, no TensorRT engine build required:

# Configure your webcam for 1080p MJPEG
ros2 run v4l2_camera v4l2_camera_node --ros-args \
  -p image_size:="[1920,1080]" \
  -p pixel_format:="MJPEG" \
  -r __ns:=/camera &

# Launch optimized node
ros2 launch depth_anything_3_ros2 depth_anything_3.launch.py \
  image_topic:=/camera/image_raw \
  model_name:=depth-anything/DA3-SMALL \
  backend:=pytorch \
  model_input_height:=384 \
  model_input_width:=384

Option 2: TensorRT FP16 (Recommended) - >30 FPS Target

Requires Docker image rebuild and one-time model conversion:

# Step 1: Build TensorRT engine with auto-detection (recommended)
# This auto-detects your Jetson platform and uses optimal settings
python3 scripts/build_tensorrt_engine.py --auto

# Or specify model and precision manually:
python3 scripts/build_tensorrt_engine.py \
  --model da3-small \
  --precision fp16 \
  --resolution 308

# Step 2: Launch with TensorRT backend
ros2 launch depth_anything_3_ros2 depth_anything_3.launch.py \
  image_topic:=/camera/image_raw \
  backend:=tensorrt_native \
  trt_model_path:=/root/.cache/tensorrt/da3-small_fp16_308x308_*.engine

Option 3: Docker Deployment (Recommended)

Build with L4T r36.4.0 base and run with automatic TensorRT engine building:

# Build the Jetson image
docker compose build depth-anything-3-jetson

# Run with auto TensorRT engine building on first start
DA3_TENSORRT_AUTO=true docker compose up depth-anything-3-jetson

# Or build engine at image build time (slower build, faster first run)
docker compose build depth-anything-3-jetson \
  --build-arg BUILD_TENSORRT=true \
  --build-arg TENSORRT_MODEL=da3-small

Implementation Details

Key Optimizations Implemented

Model Input Resolution: Platform-Aware
- Orin Nano/NX 8GB: 308x308 (optimal for memory constraints)
- Orin NX 16GB / AGX Orin: 518x518 (higher quality)
- Reduces inference time significantly vs larger resolutions
TensorRT FP16 Quantization (Recommended)
- 2-3x faster inference vs PyTorch
- Excellent accuracy (no calibration required)
- Alternative: TensorRT INT8 (3-4x speedup, requires calibration dataset)
GPU-Accelerated Upsampling
- Upsamples 384x384 depth → 1080p on GPU
- Bilinear mode: ~4ms (fast, smooth)
- Bicubic mode: ~6ms (higher quality)
- All operations stay on GPU (no CPU bottleneck)
Async Colorization
- Colorization runs in background thread
- Off critical path (doesn't block depth processing)
- Saves ~15-20ms per frame
Subscriber Checks
- Only colorizes if someone is subscribed to colored topic
- Saves processing when visualization not needed
DA3-SMALL Model
- Faster than DA3-BASE (~1.25x speedup)
- Good accuracy for most use cases
- Can switch to DA3-BASE if quality is critical

Performance Breakdown (Expected on Jetson Orin AGX)

TensorRT FP16 Pipeline (>30 FPS):

1080p camera capture          ~5ms
GPU resize (1080p→518x518)    ~3ms
TensorRT FP16 inference       ~20ms
GPU upsample (518→1080p)      ~4ms
Publishing depth+confidence   ~2ms
────────────────────────────────────
Total:                        ~34ms = 29.4 FPS

With optimizations:

Async colorization: +0ms (off critical path)
Subscriber checks: Skip work when not needed
Expected real-world: 32-36 FPS

PyTorch FP16 Pipeline (~25 FPS):

1080p camera capture          ~5ms
GPU resize (1080p→384x384)    ~3ms
PyTorch FP16 inference        ~30ms
GPU upsample (384→1080p)      ~4ms
Publishing depth+confidence   ~2ms
────────────────────────────────────
Total:                        ~44ms = 22.7 FPS

With optimizations: 24-28 FPS

Step-by-Step Setup

1. Install Dependencies

# Install torch2trt for TensorRT conversion
pip3 install torch2trt

# Verify CUDA and TensorRT are available
python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
python3 -c "import torch2trt; print('torch2trt available')"

2. Build TensorRT Engine

# Create models directory
mkdir -p models/tensorrt models/onnx

# Auto-detect platform and build optimal engine (recommended)
python3 scripts/build_tensorrt_engine.py --auto

# Or build with specific settings:
# For Orin Nano/NX 8GB (use 308x308)
python3 scripts/build_tensorrt_engine.py \
  --model da3-small \
  --precision fp16 \
  --resolution 308

# For AGX Orin (use 518x518)
python3 scripts/build_tensorrt_engine.py \
  --model da3-small \
  --precision fp16 \
  --resolution 518

# List available models
python3 scripts/build_tensorrt_engine.py --list-models

Expected output:

Detected Platform: Jetson AGX Orin

Recommended settings for AGX_ORIN_64GB:
  Precision: fp16
  Resolution: 518x518
  Workspace: 8192 MB

Downloading ONNX model: Depth Anything 3 Small
Building TensorRT engine...
Engine built successfully: models/tensorrt/da3-small_fp16_518x518_AGX_ORIN_64GB.engine

3. Configure Your Camera

For Anker PowerConf C200 webcam:

# Check available formats
v4l2-ctl --list-formats-ext -d /dev/video0

# Launch camera at 1080p with MJPEG encoding
ros2 run v4l2_camera v4l2_camera_node --ros-args \
  -p video_device:="/dev/video0" \
  -p image_size:="[1920,1080]" \
  -p pixel_format:="MJPEG" \
  -p camera_frame_id:="camera_optical_frame" \
  -r __ns:=/camera

4. Launch Optimized Node

# TensorRT FP16 (>30 FPS)
ros2 launch depth_anything_3_ros2 depth_anything_3.launch.py \
  image_topic:=/camera/image_raw \
  backend:=tensorrt_native \
  trt_model_path:=/root/.cache/tensorrt/da3-small_fp16_518x518_AGX_ORIN_64GB.engine \
  output_height:=1080 \
  output_width:=1920 \
  log_inference_time:=true

5. Monitor Performance

Watch the console output for performance metrics (logged every 5 seconds):

[depth_anything_3_optimized]: Performance - FPS: 33.45, Inference: 18.2ms, Total: 29.9ms, Frames: 167
[depth_anything_3_optimized]: GPU Memory - Allocated: 2458.3MB, Reserved: 2560.0MB, Free: 61541.7MB

Configuration Options

Backend Selection

Backend	Speed	Quality	Setup
`pytorch`	Baseline	Best	No conversion needed
`tensorrt_native` (FP16)	2-3x faster	Excellent	One-time engine build
`tensorrt_native` (INT8)	3-4x faster	Very Good	Requires calibration dataset

Model Selection

Model	Speed	Quality	FPS (TRT FP16 @ 518)
DA3-SMALL	Fastest	Good	30-35 FPS
DA3-BASE	Medium	Better	25-30 FPS
DA3-LARGE	Slow	Best	15-20 FPS

Input Resolution Trade-offs

Resolution	Platform	Inference Time (TRT FP16)	Recommendation
308x308	Orin Nano 4GB/8GB	~15ms	Recommended for memory-constrained
308x308	Orin NX 8GB	~12ms	Good balance
518x518	Orin NX 16GB	~25ms	Recommended for 16GB+
518x518	AGX Orin 32GB/64GB	~20ms	Recommended for AGX

Upsampling Mode

Mode	Speed	Quality	Use Case
`bilinear`	Fastest (~4ms)	Good	Recommended for >30 FPS
`bicubic`	Medium (~6ms)	Better	Balance quality/speed
`nearest`	Fastest (~2ms)	Blocky	Not recommended

Troubleshooting

Issue: FPS below 30

Check 1: Verify backend

# Should see "Backend: tensorrt_int8" in console output
# If seeing "Backend: pytorch", TensorRT model not loaded

Check 2: Verify model input size

# Should see "input_size=(384, 384)" in console
# If seeing 518x518, inference will be slower

Check 3: Disable colorization temporarily

ros2 launch depth_anything_3_ros2 depth_anything_3.launch.py \
  ... \
  publish_colored:=false

Check 4: Check GPU utilization

# Run in another terminal
watch -n 1 nvidia-smi

# GPU utilization should be 80-95%
# If low, check for CPU bottlenecks

Issue: TensorRT engine build fails

# Check TensorRT and pycuda installation
python3 -c "import tensorrt; print(f'TensorRT {tensorrt.__version__}')"
python3 -c "import pycuda.driver; print('pycuda OK')"

# Verify trtexec is available
which trtexec || ls /usr/src/tensorrt/bin/trtexec

# Verify TensorRT libraries
ls /usr/lib/aarch64-linux-gnu/libnvinfer*

# Try building with verbose output
python3 scripts/build_tensorrt_engine.py --auto --verbose

Issue: Out of memory

# Use smaller model
model_name:=depth-anything/DA3-SMALL

# Or reduce output resolution
output_height:=720
output_width:=1280

Advanced Optimization (Experimental)

CUDA Streams

Enable pipeline parallelism (experimental):

ros2 launch depth_anything_3_ros2 depth_anything_3.launch.py \
  ... \
  use_cuda_streams:=true

Expected: Additional 5-10% speedup

Lower Camera Resolution

If 1080p output not required:

# Camera at 720p
v4l2_camera ... -p image_size:="[1280,720]"

# Output at 720p
output_height:=720
output_width:=1280

Expected: 40-45 FPS (720p output)

Benchmark Results

Measured Results (PyTorch - Current)

Tested on Jetson Orin NX 16GB (JetPack 6.0, L4T r36.2.0):

Configuration	Model Input	Backend	FPS	Inference Time	Notes
Current Baseline	518x518	PyTorch FP32	~5.2	~193ms	Functional

Validated Results (TensorRT 10.3)

Measured on Jetson Orin NX 16GB (L4T r36.4.0, TensorRT 10.3, 2026-01-31):

Configuration	Model Input	Backend	FPS	GPU Latency	Quality
Baseline	518x518	PyTorch FP32	5.2	~193ms	Excellent
TensorRT FP16	518x518	TensorRT FP16	35.3	26.4ms median	Excellent

Key Technical Details:

Dockerfile base: dustynv/ros:humble-pytorch-l4t-r36.4.0
TRT 10.x syntax: --memPoolSize=workspace:2048MiB (not deprecated --workspace)
ONNX input shape: 5D pixel_values:1x1x3x518x518
Engine size: 58MB

Platform-Specific Performance Projections

Based on validated Orin NX 16GB results, projected performance for other platforms:

Platform	Model	Resolution	Precision	Projected FPS
Orin Nano 4GB	da3-small	308	FP16	~40-45
Orin Nano 8GB	da3-small	308	FP16	~45-50
Orin NX 8GB	da3-small	308	FP16	~50-55
Orin NX 16GB	da3-small	518	FP16	35.3 (validated)
AGX Orin 32GB	da3-small	518	FP16	~45-55
AGX Orin 64GB	da3-small	518	FP16	~50-60

Notes:

Projections based on proportional compute capacity. Only Orin NX 16GB has validated measurements.
Real-world FPS limited by camera input (~24 FPS for USB). See Quick Reference for recommended configurations.
For DA3-Base/Large projections, expect ~50% and ~25% of DA3-Small FPS respectively.

Quality Comparison

FP16 vs INT8 Quantization:

FP16: No accuracy loss, recommended default
INT8: ~3-5% accuracy reduction, requires calibration dataset
Recommendation: Use FP16 unless maximum speed is critical and you have calibration data

308x308 vs 518x518 Input:

When upsampled to 1080p, both produce good results
518x518 better for fine details and edges
308x308 recommended for memory-constrained devices (Orin Nano)

Summary

To achieve >30 FPS with 1080p depth + confidence on Jetson:

Quick Start (Docker):

# Build Jetson image
docker compose build depth-anything-3-jetson

# Run with auto TensorRT engine building
DA3_TENSORRT_AUTO=true docker compose up depth-anything-3-jetson

Manual Setup:

Run python3 scripts/build_tensorrt_engine.py --auto (auto-detects platform)
Launch with backend:=tensorrt_native
Configure camera for 1080p MJPEG

Platform-specific settings are automatically selected:

Orin Nano/NX 8GB: 308x308 FP16
Orin NX 16GB / AGX Orin: 518x518 FP16

Expected performance: 30-50 FPS depending on platform.

For questions or issues, please open a GitHub issue.