DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving (ICLR 2026)
February 11, 2026 Β· View on GitHub
π [Arxiv] π€ [Model Weights]
Yingyan Li*, Shuyao Shang*, Weisong Liu*, Bing Zhan*, Haochen Wang*, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fanβ , Zhaoxiang Zhangβ
This paper presents DriveVLA-W0, a training paradigm that employs world modeling to predict future images. This generates dense, self-supervised signals, compelling the model to learn the underlying dynamics of the driving environment, addressing the "supervision deficit" in VLA models and amplifying data scaling laws.
Due to company policy, only the reviewed part of our codebase is available. Please contact us if you have any questions.
π Project Structure
DriveVLA-W0/
βββ assets/ # Project assets (images, docs, etc.)
βββ configs/ # Model configuration files and normalization stats
β βββ fast/ # Fast action tokenizer configs
β βββ normalizer_navsim_test/ # NAVSIM testset normalization config
β βββ normalizer_navsim_trainval/ # NAVSIM train/val normalization config
β βββ normalizer_nuplan/ # NuPlan dataset normalization config
βββ data/ # Data pipelines and config
β βββ navsim/ # NAVSIM-related data
β βββ others/ # Other datasets
βββ inference/ # Inference scripts
β βββ navsim/ # NAVSIM PDMS evaluation
β βββ qwen/ # Qwen model inference
β βββ vla/ # Emu model inference
βββ models/ # Model definitions
β βββ policy_head/ # Policy head implementations
β βββ tokenizer/ # Tokenizer implementations
βββ scripts/ # Training and deployment scripts
βββ tools/ # Utility scripts
β βββ action_tokenizer/ # Action tokenizer tools
β βββ pickle_gen/ # Data preprocessing & pickle generation
βββ utils/ # utils code
β βββ datasets.py # Dataset definitions
βββ requirements.txt # Python dependencies
π Quick Start
5-Minute Example
- Download Pretrained Models
pip install huggingface_hub
export HF_ENDPOINT=https://hf-mirror.com
mkdir pretrained_models
bash scripts/misc/download_emu3_pretrain.sh
- Set Up Environment
conda create -n drivevla python=3.10
conda activate drivevla
pip install -r requirements.txt
-
Download Model Weights Download Emu3_Flow_Matching_Action_Expert_PDMS_87.2 and navsim_emu_vla_256_144_test_pre_1s.pkl from Hugging Face.
-
Run Inference
# Run inference using pretrained model (update paths as needed)
bash inference/vla/infer_navsim_flow_matching_PDMS_87.2.sh
π Data Preparation
NAVSIM Dataset
DriveVLA-W0 uses the NAVSIM (v1.1) dataset for training and evaluation. Steps required:
-
Obtain NAVSIM Dataset
- Visit the official NAVSIM repo
- Download the train and test data splits
- The data includes sensor information, scenario metadata, and labels
-
Data Preprocessing
# Generate VQ indices python tools/pickle_gen/pickle_generation_navsim_pre_1s.py # Generate NAVSIM pickle files bash scripts/tokenizer/extract_vq_emu3_navsim.sh -
Data Format
- Preprocessed data is saved in
data/navsim/processed_data/ - Contains scenario files, metadata, and extracted features
- Preprocessed data is saved in
Dataset Size
- Training: ~100,000 driving frames
- Validation: ~10,000 frames
- Test: NAVSIM test set
π» Hardware Requirements
Training Resource Consumption
8x L20 GPUs (40GB memory each), ~16 hours
Install
CUDA Installation
If your system does not already have CUDA 12.4+, please install it first:
# Download CUDA 12.8.1 (recommended version)
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda_12.8.1_570.124.06_linux.run
# Install CUDA toolkit
bash cuda_12.8.1_570.124.06_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-12.8
# Add to your ~/.bashrc or shell profile
export CUDA_HOME=/usr/local/cuda-12.8
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
Conda Environment Setup
# Create Conda environment
conda create -n drivevla python=3.10
conda activate drivevla
# Install PyTorch (CUDA 12.4)
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
# Install core dependencies
pip install -r requirements.txt
pip install "transformers[torch]"
# Install training-related dependencies
pip install deepspeed # Distributed training
pip install scipy # Scientific computing
pip install tensorboard==2.14.0 # Visualization
pip install wandb # Experiment tracking
Testing
First, download the model checkpoints from Hugging Face.
Then, run the following testing script to produce the output actions (as JSON files):
bash inference/vla/infer_navsim_flow_matching_PDMS_87.2.sh
Finally, run the script below to compute the PDMS metrics using the generated JSONs (with the conda environment and a valid navsim repo):
bash inference/vla/eval_navsim_metric_from_json.sh
Training
For training, please refer to Training.md.
βοΈ Configuration Overview
Configuration Files
The project uses JSON-formatted configuration files located in configs/:
configs/
βββ moe_fast_video.json # MoE model fast inference config
βββ moe_fast_video_pretrain.json # MoE model pretraining config
βββ normalizer_navsim_test/ # NAVSIM test set normalization parameters
βββ normalizer_navsim_trainval/ # NAVSIM train+val normalization parameters
βββ normalizer_nuplan/ # NuPlan normalization parameters
Normalization Statistics
Normalization parameters are automatically computed from the training datasets:
normalizer_navsim_trainval/β computed on NAVSIM training setnormalizer_navsim_test/β computed on NAVSIM test setnormalizer_nuplan/β computed on NuPlan dataset
π NAVSIM v1/v2 Benchmark SOTA
Here is a comparison with state-of-the-art methods on the NAVSIM test set, as presented in the paper. Our model, DriveVLA-W0, establishes a new state-of-the-art.
| Method | Reference | Sensors | NC β | DAC β | TTC β | C. β | EP β | PDMS β |
|---|---|---|---|---|---|---|---|---|
| Human | 100.0 | 100.0 | 100.0 | 99.9 | 87.5 | 94.8 | ||
| BEV-based Methods | ||||||||
| LAW | ICLR'25 | 1x Cam | 96.4 | 95.4 | 88.7 | 99.9 | 81.7 | 84.6 |
| Hydra-MDP | arXiv'24 | 3x Cam + L | 98.3 | 96.0 | 94.6 | 100.0 | 78.7 | 86.5 |
| DiffusionDrive | CVPR'25 | 3x Cam + L | 98.2 | 96.2 | 94.7 | 100.0 | 82.2 | 88.1 |
| WoTE | ICCV'25 | 3x Cam + L | 98.5 | 96.8 | 94.4 | 99.9 | 81.9 | 88.3 |
| VLA-based Methods | ||||||||
| AutoVLA | NeurIPS'25 | 3x Cam | 98.4 | 95.6 | 98.0 | 99.9 | 81.9 | 89.1 |
| ReCogDrive | arXiv'25 | 3x Cam | 98.2 | 97.8 | 95.2 | 99.8 | 83.5 | 89.6 |
| DriveVLA-W0* | Ours | 1x Cam | 98.7 | 99.1 | 95.3 | 99.3 | 83.3 | 90.2 |
| AutoVLAβ | NeurIPS'25 | 3x Cam | 99.1 | 97.1 | 97.1 | 100.0 | 87.6 | 92.1 |
| DriveVLA-W0β | Ours | 1x Cam | 99.3 | 97.4 | 97.0 | 99.9 | 88.3 | 93.0 |
β Star
If you find our work useful for your research, please consider giving this repository a star β.
π Citation
If you find this work useful for your research, please consider citing our paper:
@article{li2025drivevla,
title={DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving},
author={Li, Yingyan and Shang, Shuyao and Liu, Weisong and Zhan, Bing and Wang, Haochen and Wang, Yuqi and Chen, Yuntao and Wang, Xiaoman and An, Yasong and Tang, Chufeng and others},
journal={arXiv preprint arXiv:2510.12796},
year={2025}
}
Acknowledgements
We would like to acknowledge the following related works:
LAW (ICLR 2025): Using latent world models for self-supervised feature learning in end-to-end autonomous driving.
WoTE (ICCV 2025): Using BEV world models for online trajectory evaluation in end-to-end autonomous driving.
UniVLA: World modeling in the broader field of robotics.