README.md

April 13, 2026 Β· View on GitHub

MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving

πŸŽ‰πŸŽ‰CVPR 2026 πŸŽ‰πŸŽ‰

arXiv

Lingjun Zhang1*, Yujian Yuan1,2*, Changjie Wu1†, Xinyuan Chang1, Xin Cai3, Shuang Zeng1,4, Linzhe Shi1, Sijin Wang1, Hang Zhang1, Mu Xu1,

1Amap, Alibaba Group, 2The Hong Kong University of Science and Technology, 3The Chinese University of Hong Kong, 4Xi'an Jiaotong University

(*) Equal contribution. (†) Project leader.

image

Comparison of different reasoning methods. Text reasoning struggles with space misalignment, while image reasoning suffers from guideless image prediction. Our proposed progressive multimodal reasoning conducts aligned smooth reasoning.

MindDriver: The proposed multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning.

πŸ—“οΈ Release Plan

  • 2026/02: βœ… MindDriver paper.
  • 2026/04: βœ… MindDriver annotation and training code.
  • 2026/06: MindDriver checkpoints.

πŸ› οΈ Installation

Create the required environment through the following steps:

git clone https://github.com/hotdogcheesewhite/MindDriver.git && cd MindDriver

conda create -n MindDriver python=3.10 -y && conda activate MindDriver

# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

cd LLaMA-Factory && pip install -e ".[metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation

cd .. && pip install -r requirements.txt

πŸ“¦ Data Preparation

1、Download nuScenes

Download the complete dataset from nuScenes and extract it to ./LLaMA-Factory/data/nuscenes

Or establish a soft connection:

ln -s /path/to/your/nuscenes LLaMA-Factory/data

We used pre-cached data from the nuScenes dataset. The data can be downloaded at Google Drive. The file cached_nuscenes_info.pkl is located in the directory ./create_data. The metrics folder is placed in the directory ./tools/data.

2、Extract visual tokens

Separately extract the visual tokens of the front view from fine-tuned data, to facilitate supervised MLLM:

python MoVQGAN/sft_data.py

3、Construct data

Construct fine-tuning data that conform to the LLaMA-Factory format respectively:

python create_data/sft_data.py --split train # Change to "val" for constructing the validation set
python gen_data/for_api_data.py
python gen_data/api_call_mutil.py
python gen_data/check.py
python gen_data.py/convert_to_qwen_img.py
NuScenes Raw Images
    β”‚
    β–Ό
[Step 1] MoVQGAN/sft_data.py
    β”‚ Encode CAM_FRONT images β†’ discrete visual tokens
    β”‚ Output: gt_indices_sft.json
    β–Ό
[Step 2] create_data/sft_data.py
    β”‚ Build prompt + image paths + ground-truth trajectory
    β”‚ Output: {split}_api_split.json (training data template)
    β–Ό
[Step 3] gen_data/api_call_mutil.py  ──── OR ────  create_data/singleinfer.py (local)
    β”‚ Send images + prompt β†’ LLM β†’ get reasoning text
    β”‚ Output: result_qwen_2_5_72b.json
    β–Ό
[Step 4] gen_data/check.py
    β”‚ Quality-check each reasoning result via API
    β”‚ Output: check_right_or_wrong.json
    β–Ό
[Step 5] gen_data/convert_to_qwen_img.py
    β”‚ Merge: API results + check results + MoVQGAN tokens + CAN bus data
    β”‚ Output: {split}_final.json (ready for training)
    β–Ό
[Step 6] LLaMA-Factory (configs/sft.yaml)
    β”‚ Input: {split}_final.json + Qwen2.5-VL checkpoint
    β”‚ Output: Fine-tuned model weights

Follow the LLaMA-Factory tutorial and add the dataset information in the file ./LLaMA-Factory/data/dataset_info.json.

πŸš€ Training

Enter the working directory of LLaMA-Factory:

cd LLaMA-Factory

During the SFT stage, we assist the model in achieving two-stage alignment.

llamafactory-cli train ../configs/sft.yaml

🎯 Infer

Run the following command in the LLaMA-Factory directory to infer test dataset:

python scripts/vllm_infer.py \ 
--model_name_or_path saves/qwen25_vl-3b/sft \
--dataset val_cot_motion \
--template qwen2_vl \
--cutoff_len 32768 \
--max_new_tokens 2048 \
--max_samples 100000 \
--image_resolution 524288 \
--save_name results.jsonl \
--temperature 0.1 \
--top_p 0.1 \
--top_k 10

πŸ“ˆ Evaluation

First, under the MindDriver directory, match the predicted results with the tokens to facilitate the evaluation:

cd ..

python tools/match.py \
--pred_trajs_path ./LLaMA-Factory/results.jsonl \
--token_traj_path ./LLaMA-Factory/data/val_cot_motion.json

Then evaluate the L2 and collision rate indicators for the end-to-end trajectory planning:

python tools/evaluation/evaluation.py \
# Change to "stp3" and use the ST-P3 calculation method
--metric uniad \  
--result_file ./LLaMA-Factory/eval_traj.json

πŸ‘€ Visualization

Use the following command under the MindDriver directory to visualize the trajectory:

python tools/visualization/visualize_planning.py \
--pred-trajs-path ./LLaMA-Factory/results.jsonl \
--tokens-path ./LLaMA-Factory/eval_traj.json \  
--output-path ./vis_traj

Use the following command under the MindDriver directory to restore the visual tokens to the pixel space and visualize the CoT:

python ./MoVQGAN/vis.py \
--input_json ./LLaMA-Factory/eval_traj.json \
--output_dir ./vis_cot

πŸ™ Acknowledgement

Our work is primarily based on the following codebases:FSDrive, LLaMA-Factory, MoVQGAN, GPT-Driver, Agent-Driver. We are sincerely grateful for their work.