๐๐ก Tracking Meets Large Multimodal Models for Driving Scenario Understanding
April 7, 2025 ยท View on GitHub
๐ Enhancing autonomous driving with tracking-powered multimodal understanding!

โจ Overview
This repository presents an innovative approach that integrates 3D object tracking into Large Multimodal Models (LMMs) to enhance spatiotemporal understanding in autonomous driving. ๐โก By leveraging tracking information, we significantly improve perception, planning, and prediction compared to baseline models.

๐น Key Benefits:
- ๐ธ Vision + Tracking: We enhance VQA in autonomous driving by integrating tracking-based embeddings.
- ๐ 3D Object Tracking: We use 3DMOTFormer for robust multi-object tracking, improving contextual understanding.
- ๐ Multimodal Fusion: Images and tracking features are jointly processed to enhance reasoning and predictions.
- ๐ง Self-Supervised Pretraining: Our tracking encoder boosts model comprehension.
- ๐ Benchmark Success: We achieve a 9.5% accuracy gain and 7.04-point ChatGPT score improvement on DriveLM-nuScenes, and a 3.7% final score increase on DriveLM-CARLA. ๐๐ฅ
๐ Data Preparation
๐น VQA Datasets: Obtain datasets following instructions from DriveLM.
๐น Tracking Data:
- ๐ Step 1: Generate 3D object and ego-vehicle tracks using 3DMOTFormer.
- ๐ Step 2: Process these tracks to map key object and ego-vehicle trajectories for each question.
๐ Results
๐ DriveLM-nuScenes
๐ DriveLM-CARLA
โ๏ธ Setup & Fine-Tuning
๐ก To set up and fine-tune the model, refer to [llama_adapter_v2_multimodal7b/README.md] in this repository.
๐ Inference
๐ง Before running inference, extract the adapter weights using save_weights.py. Inside this script, set the trained weights path and output path accordingly.
Run the following command to perform inference on test data:
cd llama_adapter_v2_multimodal7b/
python demo.py --llama_dir /path/to/llama_model_weights \
--checkpoint /path/to/pre-trained/checkpoint.pth \
--data ../test_llama.json \
--output ../output.json \
--batch_size 4 \
--num_processes 8
๐ Evaluation
๐ To evaluate the model's performance:
1๏ธโฃ Set up the evaluation package using instructions in DriveLM Challenge ReadMe.
2๏ธโฃ Run the evaluation script:
python evaluation/evaluation.py --root_path1 ./output.json --root_path2 ./test_eval.json
๐ TODO List
- ๐ข Release pretrained weights
- ๐ฏ Release finetuned checkpoint
- ๐ Release nuScenes train and test VQA with tracks
๐ Acknowledgments
We sincerely appreciate the contributions and resources from the following projects:
- ๐ DriveLM โ Benchmark datasets & evaluation.
- ๐ฆ LLaMA Adapter โ Large Multimodal Model foundation.
- ๐ฏ 3DMOTFormer โ 3D multi-object tracking.
- ๐ nuScenes Dataset โ Real-world autonomous driving dataset.
๐ If you like this project, drop a โญ on GitHub! ๐
