README.md

September 28, 2025 · View on GitHub

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

🎉🎉NeurIPS 2025 spotlight🎉🎉

Shuang Zeng^1,2, Xinyuan Chang¹, Mengwei Xie¹, Xinran Liu¹, Yifan Bai^2,3, Zheng Pan¹, Mu Xu¹, Xing Wei²,

¹Amap, Alibaba Group, ²Xi’an Jiaotong University, ³DAMO Academy, Alibaba Group

FutureSightDrive (FSDrive): The proposed spatio-temporal CoT enables end-to-end autonomous driving VLA to think visually about trajectory planning and unify visual generation and understanding with minimal data, advancing autonomous driving towards visual reasoning for the first time.

https://github.com/user-attachments/assets/a99a14a3-a892-4cbe-ac1f-66b777d9081b

🛠️ Installation
📦 Data Preparation
🚀 Training
🎯 Infer
📈 Evaluation
👀 Visualization
📜 Citing
🙏 Acknowledgement

🛠️ Installation

Create the required environment through the following steps:

git clone https://github.com/MIV-XJTU/FSDrive.git && cd FSDrive

conda create -n FSDrive python=3.10 -y && conda activate FSDrive

# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

cd LLaMA-Factory && pip install -e ".[metrics,deepspeed,liger-kernel,bitsandbytes]" --no-build-isolation

cd .. && pip install -r requirements.txt

📦 Data Preparation

1、Download nuScenes

Download the complete dataset from nuScenes and extract it to ./LLaMA-Factory/data/nuscenes

Or establish a soft connection：

ln -s /path/to/your/nuscenes LLaMA-Factory/data

We used pre-cached data from the nuScenes dataset. The data can be downloaded at Google Drive. The file cached_nuscenes_info.pkl is located in the directory ./create_data. The metrics folder is placed in the directory ./tools/data.

2、Extract visual tokens

Separately extract the visual tokens of the front view from both the pre-trained and fine-tuned data, to facilitate supervised MLLM:

python MoVQGAN/pretrain_data.py
python MoVQGAN/sft_data.py

3、Construct data

Construct pre-training and fine-tuning data that conform to the LLaMA-Factory format respectively:

python create_data/pretrain_data.py
python create_data/sft_data.py --split train # Change to "val" for constructing the validation set

Follow the LLaMA-Factory tutorial and add the dataset information in the file ./LLaMA-Factory/data/dataset_info.json.

🚀 Training

Enter the working directory of LLaMA-Factory:

cd LLaMA-Factory

1、Pre-train

First, pre-train the VLM to activate its visual generation capabilities:

llamafactory-cli train ../configs/pretrain.yaml

2、SFT

Then, based on the pre-trained parameters, fine-tune the VLM to think visually about trajectory planning:

llamafactory-cli train ../configs/sft.yaml

🎯 Infer

Run the following command in the LLaMA-Factory directory to infer test dataset:

python scripts/vllm_infer.py \ 
--model_name_or_path saves/qwen2_vl-2b/sft \
--dataset val_cot_motion \
--template qwen2_vl \
--cutoff_len 32768 \
--max_new_tokens 2048 \
--max_samples 100000 \
--image_resolution 524288 \
--save_name results.jsonl \
--temperature 0.1 \
--top_p 0.1 \
--top_k 10

📈 Evaluation

First, under the FSDrive directory, match the predicted results with the tokens to facilitate the evaluation:

cd ..

python tools/match.py \
--pred_trajs_path ./LLaMA-Factory/results.jsonl \
--token_traj_path ./LLaMA-Factory/data/val_cot_motion.json

Then evaluate the L2 and collision rate indicators for the end-to-end trajectory planning:

python tools/evaluation/evaluation.py \
# Change to "stp3" and use the ST-P3 calculation method
--metric uniad \  
--result_file ./LLaMA-Factory/eval_traj.json

👀 Visualization

Use the following command under the FSDrive directory to visualize the trajectory:

python tools/visualization/visualize_planning.py \
--pred-trajs-path ./LLaMA-Factory/results.jsonl \
--tokens-path ./LLaMA-Factory/eval_traj.json \  
--output-path ./vis_traj

Use the following command under the FSDrive directory to restore the visual tokens to the pixel space and visualize the CoT:

python ./MoVQGAN/vis.py \
--input_json ./LLaMA-Factory/eval_traj.json \
--output_dir ./vis_cot

📜 Citing

If you find FSDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry:

@article{zeng2025futuresightdrive,
  title={FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving},
  author={Zeng, Shuang and Chang, Xinyuan and Xie, Mengwei and Liu, Xinran and Bai, Yifan and Pan, Zheng and Xu, Mu and Wei, Xing},
  journal={arXiv preprint arXiv:2505.17685},
  year={2025}
}

🙏 Acknowledgement

Our work is primarily based on the following codebases:LLaMA-Factory, MoVQGAN, GPT-Driver, Agent-Driver. We are sincerely grateful for their work.