README.md
September 28, 2025 ยท View on GitHub
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
๐๐NeurIPS 2025 spotlight๐๐
Shuang Zeng1,2, Xinyuan Chang1, Mengwei Xie1, Xinran Liu1, Yifan Bai2,3, Zheng Pan1, Mu Xu1, Xing Wei2,
1Amap, Alibaba Group, 2Xiโan Jiaotong University, 3DAMO Academy, Alibaba Group
FutureSightDrive (FSDrive): The proposed spatio-temporal CoT enables end-to-end autonomous driving VLA to think visually about trajectory planning and unify visual generation and understanding with minimal data, advancing autonomous driving towards visual reasoning for the first time.
https://github.com/user-attachments/assets/a99a14a3-a892-4cbe-ac1f-66b777d9081b
Table of Contents
- ๐ ๏ธ Installation
- ๐ฆ Data Preparation
- ๐ Training
- ๐ฏ Infer
- ๐ Evaluation
- ๐ Visualization
- ๐ Citing
- ๐ Acknowledgement
๐ ๏ธ Installation
Create the required environment through the following steps:
git clone https://github.com/MIV-XJTU/FSDrive.git && cd FSDrive
conda create -n FSDrive python=3.10 -y && conda activate FSDrive
# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
cd LLaMA-Factory && pip install -e ".[metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation
cd .. && pip install -r requirements.txt
๐ฆ Data Preparation
1ใDownload nuScenes
Download the complete dataset from nuScenes and extract it to ./LLaMA-Factory/data/nuscenes
Or establish a soft connection๏ผ
ln -s /path/to/your/nuscenes LLaMA-Factory/data
We used pre-cached data from the nuScenes dataset. The data can be downloaded at Google Drive. The file cached_nuscenes_info.pkl is located in the directory ./create_data. The metrics folder is placed in the directory ./tools/data.
2ใExtract visual tokens
Separately extract the visual tokens of the front view from both the pre-trained and fine-tuned data, to facilitate supervised MLLM:
python MoVQGAN/pretrain_data.py
python MoVQGAN/sft_data.py
3ใConstruct data
Construct pre-training and fine-tuning data that conform to the LLaMA-Factory format respectively:
python create_data/pretrain_data.py
python create_data/sft_data.py --split train # Change to "val" for constructing the validation set
Follow the LLaMA-Factory tutorial and add the dataset information in the file ./LLaMA-Factory/data/dataset_info.json.
๐ Training
Enter the working directory of LLaMA-Factory:
cd LLaMA-Factory
1ใPre-train
First, pre-train the VLM to activate its visual generation capabilities:
llamafactory-cli train ../configs/pretrain.yaml
2ใSFT
Then, based on the pre-trained parameters, fine-tune the VLM to think visually about trajectory planning:
llamafactory-cli train ../configs/sft.yaml
๐ฏ Infer
Run the following command in the LLaMA-Factory directory to infer test dataset:
python scripts/vllm_infer.py \
--model_name_or_path saves/qwen2_vl-2b/sft \
--dataset val_cot_motion \
--template qwen2_vl \
--cutoff_len 32768 \
--max_new_tokens 2048 \
--max_samples 100000 \
--image_resolution 524288 \
--save_name results.jsonl \
--temperature 0.1 \
--top_p 0.1 \
--top_k 10
๐ Evaluation
First, under the FSDrive directory, match the predicted results with the tokens to facilitate the evaluation:
cd ..
python tools/match.py \
--pred_trajs_path ./LLaMA-Factory/results.jsonl \
--token_traj_path ./LLaMA-Factory/data/val_cot_motion.json
Then evaluate the L2 and collision rate indicators for the end-to-end trajectory planning:
python tools/evaluation/evaluation.py \
# Change to "stp3" and use the ST-P3 calculation method
--metric uniad \
--result_file ./LLaMA-Factory/eval_traj.json
๐ Visualization
Use the following command under the FSDrive directory to visualize the trajectory:
python tools/visualization/visualize_planning.py \
--pred-trajs-path ./LLaMA-Factory/results.jsonl \
--tokens-path ./LLaMA-Factory/eval_traj.json \
--output-path ./vis_traj
Use the following command under the FSDrive directory to restore the visual tokens to the pixel space and visualize the CoT:
python ./MoVQGAN/vis.py \
--input_json ./LLaMA-Factory/eval_traj.json \
--output_dir ./vis_cot
๐ Citing
If you find FSDrive is useful in your research or applications, please consider giving us a star ๐ and citing it by the following BibTeX entry:
@article{zeng2025futuresightdrive,
title={FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving},
author={Zeng, Shuang and Chang, Xinyuan and Xie, Mengwei and Liu, Xinran and Bai, Yifan and Pan, Zheng and Xu, Mu and Wei, Xing},
journal={arXiv preprint arXiv:2505.17685},
year={2025}
}
๐ Acknowledgement
Our work is primarily based on the following codebases:LLaMA-Factory, MoVQGAN, GPT-Driver, Agent-Driver. We are sincerely grateful for their work.