README.md
April 13, 2026 Β· View on GitHub
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
ππCVPR 2026 ππ
Lingjun Zhang1*, Yujian Yuan1,2*, Changjie Wu1β , Xinyuan Chang1, Xin Cai3, Shuang Zeng1,4, Linzhe Shi1, Sijin Wang1, Hang Zhang1, Mu Xu1,
1Amap, Alibaba Group, 2The Hong Kong University of Science and Technology, 3The Chinese University of Hong Kong, 4Xi'an Jiaotong University
(*) Equal contribution. (β ) Project leader.
Comparison of different reasoning methods. Text reasoning struggles with space misalignment, while image reasoning suffers from guideless image prediction. Our proposed progressive multimodal reasoning conducts aligned smooth reasoning.
MindDriver: The proposed multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning.
ποΈ Release Plan
2026/02: β MindDriver paper.2026/04: β MindDriver annotation and training code.2026/06: MindDriver checkpoints.
π οΈ Installation
Create the required environment through the following steps:
git clone https://github.com/hotdogcheesewhite/MindDriver.git && cd MindDriver
conda create -n MindDriver python=3.10 -y && conda activate MindDriver
# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
cd LLaMA-Factory && pip install -e ".[metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation
cd .. && pip install -r requirements.txt
π¦ Data Preparation
1γDownload nuScenes
Download the complete dataset from nuScenes and extract it to ./LLaMA-Factory/data/nuscenes
Or establish a soft connectionοΌ
ln -s /path/to/your/nuscenes LLaMA-Factory/data
We used pre-cached data from the nuScenes dataset. The data can be downloaded at Google Drive. The file cached_nuscenes_info.pkl is located in the directory ./create_data. The metrics folder is placed in the directory ./tools/data.
2γExtract visual tokens
Separately extract the visual tokens of the front view from fine-tuned data, to facilitate supervised MLLM:
python MoVQGAN/sft_data.py
3γConstruct data
Construct fine-tuning data that conform to the LLaMA-Factory format respectively:
python create_data/sft_data.py --split train # Change to "val" for constructing the validation set
python gen_data/for_api_data.py
python gen_data/api_call_mutil.py
python gen_data/check.py
python gen_data.py/convert_to_qwen_img.py
NuScenes Raw Images
β
βΌ
[Step 1] MoVQGAN/sft_data.py
β Encode CAM_FRONT images β discrete visual tokens
β Output: gt_indices_sft.json
βΌ
[Step 2] create_data/sft_data.py
β Build prompt + image paths + ground-truth trajectory
β Output: {split}_api_split.json (training data template)
βΌ
[Step 3] gen_data/api_call_mutil.py ββββ OR ββββ create_data/singleinfer.py (local)
β Send images + prompt β LLM β get reasoning text
β Output: result_qwen_2_5_72b.json
βΌ
[Step 4] gen_data/check.py
β Quality-check each reasoning result via API
β Output: check_right_or_wrong.json
βΌ
[Step 5] gen_data/convert_to_qwen_img.py
β Merge: API results + check results + MoVQGAN tokens + CAN bus data
β Output: {split}_final.json (ready for training)
βΌ
[Step 6] LLaMA-Factory (configs/sft.yaml)
β Input: {split}_final.json + Qwen2.5-VL checkpoint
β Output: Fine-tuned model weights
Follow the LLaMA-Factory tutorial and add the dataset information in the file ./LLaMA-Factory/data/dataset_info.json.
π Training
Enter the working directory of LLaMA-Factory:
cd LLaMA-Factory
During the SFT stage, we assist the model in achieving two-stage alignment.
llamafactory-cli train ../configs/sft.yaml
π― Infer
Run the following command in the LLaMA-Factory directory to infer test dataset:
python scripts/vllm_infer.py \
--model_name_or_path saves/qwen25_vl-3b/sft \
--dataset val_cot_motion \
--template qwen2_vl \
--cutoff_len 32768 \
--max_new_tokens 2048 \
--max_samples 100000 \
--image_resolution 524288 \
--save_name results.jsonl \
--temperature 0.1 \
--top_p 0.1 \
--top_k 10
π Evaluation
First, under the MindDriver directory, match the predicted results with the tokens to facilitate the evaluation:
cd ..
python tools/match.py \
--pred_trajs_path ./LLaMA-Factory/results.jsonl \
--token_traj_path ./LLaMA-Factory/data/val_cot_motion.json
Then evaluate the L2 and collision rate indicators for the end-to-end trajectory planning:
python tools/evaluation/evaluation.py \
# Change to "stp3" and use the ST-P3 calculation method
--metric uniad \
--result_file ./LLaMA-Factory/eval_traj.json
π Visualization
Use the following command under the MindDriver directory to visualize the trajectory:
python tools/visualization/visualize_planning.py \
--pred-trajs-path ./LLaMA-Factory/results.jsonl \
--tokens-path ./LLaMA-Factory/eval_traj.json \
--output-path ./vis_traj
Use the following command under the MindDriver directory to restore the visual tokens to the pixel space and visualize the CoT:
python ./MoVQGAN/vis.py \
--input_json ./LLaMA-Factory/eval_traj.json \
--output_dir ./vis_cot
π Acknowledgement
Our work is primarily based on the following codebases:FSDrive, LLaMA-Factory, MoVQGAN, GPT-Driver, Agent-Driver. We are sincerely grateful for their work.