๐Ÿš—๐Ÿ’ก Tracking Meets Large Multimodal Models for Driving Scenario Understanding

April 7, 2025 ยท View on GitHub

๐Ÿ“Œ Enhancing autonomous driving with tracking-powered multimodal understanding! arXiv

โœจ Overview

This repository presents an innovative approach that integrates 3D object tracking into Large Multimodal Models (LMMs) to enhance spatiotemporal understanding in autonomous driving. ๐Ÿš˜โšก By leveraging tracking information, we significantly improve perception, planning, and prediction compared to baseline models.

Model

๐Ÿ”น Key Benefits:

  • ๐Ÿ“ธ Vision + Tracking: We enhance VQA in autonomous driving by integrating tracking-based embeddings.
  • ๐Ÿš€ 3D Object Tracking: We use 3DMOTFormer for robust multi-object tracking, improving contextual understanding.
  • ๐Ÿ”— Multimodal Fusion: Images and tracking features are jointly processed to enhance reasoning and predictions.
  • ๐Ÿง  Self-Supervised Pretraining: Our tracking encoder boosts model comprehension.
  • ๐Ÿ† Benchmark Success: We achieve a 9.5% accuracy gain and 7.04-point ChatGPT score improvement on DriveLM-nuScenes, and a 3.7% final score increase on DriveLM-CARLA. ๐Ÿ“Š๐Ÿ”ฅ

Model

๐Ÿ“‚ Data Preparation

๐Ÿ”น VQA Datasets: Obtain datasets following instructions from DriveLM.

๐Ÿ”น Tracking Data:

  • ๐Ÿ“Œ Step 1: Generate 3D object and ego-vehicle tracks using 3DMOTFormer.
  • ๐Ÿ“Œ Step 2: Process these tracks to map key object and ego-vehicle trajectories for each question.

๐Ÿ† Results

๐Ÿš˜ DriveLM-nuScenes

Results Table 1

๐ŸŒ DriveLM-CARLA

Results Table 2


โš™๏ธ Setup & Fine-Tuning

๐Ÿ’ก To set up and fine-tune the model, refer to [llama_adapter_v2_multimodal7b/README.md] in this repository.


๐Ÿš€ Inference

๐Ÿ”ง Before running inference, extract the adapter weights using save_weights.py. Inside this script, set the trained weights path and output path accordingly.

Run the following command to perform inference on test data:

cd  llama_adapter_v2_multimodal7b/
python demo.py --llama_dir /path/to/llama_model_weights \
               --checkpoint /path/to/pre-trained/checkpoint.pth \
               --data ../test_llama.json  \
               --output ../output.json \
               --batch_size 4 \
               --num_processes 8

๐Ÿ“Š Evaluation

๐Ÿ” To evaluate the model's performance:

1๏ธโƒฃ Set up the evaluation package using instructions in DriveLM Challenge ReadMe.

2๏ธโƒฃ Run the evaluation script:

python evaluation/evaluation.py --root_path1 ./output.json --root_path2 ./test_eval.json

๐Ÿš€ TODO List

  • ๐Ÿ“ข Release pretrained weights
  • ๐ŸŽฏ Release finetuned checkpoint
  • ๐Ÿ“Š Release nuScenes train and test VQA with tracks

๐Ÿ™ Acknowledgments

We sincerely appreciate the contributions and resources from the following projects:

  • ๐Ÿš— DriveLM โ€“ Benchmark datasets & evaluation.
  • ๐Ÿฆ™ LLaMA Adapter โ€“ Large Multimodal Model foundation.
  • ๐ŸŽฏ 3DMOTFormer โ€“ 3D multi-object tracking.
  • ๐ŸŒ nuScenes Dataset โ€“ Real-world autonomous driving dataset.

๐Ÿš€ If you like this project, drop a โญ on GitHub! ๐Ÿ’™