Vega: Learning to Drive with Natural Language Instructions

March 27, 2026 · View on GitHub

arXiv Project Page

Sicheng Zuo*, Yuxuan Li*, Wenzhao Zheng*\dagger, Zheng Zhu, Jie Zhou, Jiwen Lu

* Equal contributions. \dagger Project leader.

Vega is a vision-language-world-action model that can follow natural language instructions to generate diverse driving actions and future images. Compared to traditional driving models, which can only predict a single expert trajectory or follow a limited set of navigation commands, Vega can generate multiple planning trajectories and future images that follow diverse user instructions. In the training stage, Vega leverages world modeling to enhance the model's planning capabilities in complex driving scenarios. As a result, our model achieves top performance on both NAVSIM v1 and v2 closed-loop planning benchmarks.

✨ News

  • [2026/03/26] Vega: Paper, training and inference code are released.

🚀 TODO

  • Release pretrained model weights.
  • Release instruction annotations (navtrain & navtest).
  • Release instruction generation and data processing scripts.

📦 Installation

We tested the code with CUDA 12.1, python3.10 and torch 2.7.1.

git clone https://github.com/wzzheng/Vega.git
cd Vega

conda create -n vega python=3.10
conda activate vega

pip install -r requirements.txt
# Installl flash-attn from https://github.com/Dao-AILab/flash-attention/releases

Replace all the /path/to placeholders in the code. For example, replace /path/to/Vega with your actual path to Vega workspace.

🤗 Pretrained Models & Datasets

Our pretrained models will be available on the huggingface hub soon:

Version Hugging Face Model Action Planning Image Generation #Params
Vega [Coming Soon] 14B

Our instruction annotations will also be available on the huggingface hub:

Version Hugging Face Dataset Rule-based Instructions VLM Instructions #Samples
navtrain [Coming Soon] 85109
navtest [Coming Soon] 12146

🌟 Data preparation

Our dataset is based on NAVSIM. After installing the navsim-devkit and downloading its dataset, download the instruction annotations or run the data processing scripts (Coming Soon).

💡 Inference

Note: Inference requires the instruction dataset and model weights, which are currently in the TODO list. The scripts below are provided for code review and reference.

To visualize action planning and future image generation, run inference_action_image.ipynb

You can also run action planning on the whole navtest dataset with infer.sh

bash infer.sh

🏋️‍♂️ Training & Finetuning

Note: Training requires the instruction dataset, which is currently in the TODO list. The scripts below are provided for code review and reference.

To train Vega from scratch, download ByteDance-Seed/BAGEL-7B-MoT.

To finetune from a pretrained checkpoint, set the --resume-from argument to the folder of the safetensors file.

bash train.sh

🧪 Performance

Our model demonstrates competitive performance on both NAVSIM benchmarks. On NAVSIM v2, it scores 86.9 EPDMS without any additional performance-enhancing techniques, which is comparable to SOTA. Using the best-of-N strategy as prior works, it achieves top performance on NAVSIM v2. These results suggest that Vega has learned robust instruction following capabilities and benefited from future image prediction training. On NAVSIM v1, our model achieves 87.9 PDMS, matching multi-modal BEV methods, and improves to 89.8 with the best-of-N strategy.

Acknowledgements

Our code is based on the following brilliant repositories:

Bagel NAVSIM

Many thanks to these authors!

Citation

If you find this project helpful, please consider citing the following paper:

@article{zuo2026vega,
  title={Vega: Learning to Drive with Natural Language Instructions}, 
  author={Zuo, Sicheng and Li, Yuxuan and Zheng, Wenzhao and Zhu, Zheng and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2603.25741},
  year={2026}
}