A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens

May 8, 2026 · View on GitHub

CVPR 2026 Highlight  Paper  Models

DeltaTok compresses the frame-to-frame change in vision foundation model features into a single delta token, enabling DeltaWorld to efficiently generate diverse plausible futures.

Model Zoo

All models operate at 512x512 resolution with a frozen DINOv3 ViT-B backbone. The released DeltaTok and DeltaWorld are trained on Kinetics-700, while the paper uses a larger dataset. See Training & Evaluation and Example Training Resources for reproduction.

Task Heads

Evaluation heads for downstream tasks:

TaskDatasetMetricDownload
SegmentationVSPWmIoU: 58.4Download
SegmentationCityscapesmIoU: 70.5Download
DepthKITTIRMSE: 2.79Download
RGBImageNetvisualization onlyDownload

DeltaTok (Tokenizer) Download

ViT-B encoder and decoder trained on Kinetics-700. Reconstruction quality is measured by applying downstream task heads to the reconstructed features.

HorizonVSPW mIoU (↑)Cityscapes mIoU (↑)KITTI RMSE (↓)
Short (1 frame)58.669.62.78
Mid (3 frames)58.567.92.86

DeltaWorld (Predictor) Download

ViT-B predictor trained on Kinetics-700. Prediction quality is measured by applying downstream task heads to the predicted features. Cells report best-of-20 with mean in parentheses. best selects the sample with lowest DINOv3-feature loss to ground truth; mean averages DINOv3 features across all samples before evaluation.

MethodHorizonVSPW mIoU (↑)Cityscapes mIoU (↑)KITTI RMSE (↓)
Copy last (lower bound)Short (1 frame)51.253.53.76
DeltaWorldShort (1 frame)56.3 (54.2)66.2 (64.2)2.95 (3.32)
Copy last (lower bound)Mid (3 frames)44.339.64.86
DeltaWorldMid (3 frames)51.5 (46.6)55.3 (49.5)3.71 (4.74)

Setup

Requires Miniconda (or Anaconda), a Weights & Biases account for logging, and a Hugging Face account. Accept the license at facebook/dinov3-vitb16-pretrain-lvd1689m so the gated DINOv3 ViT-B backbone downloads automatically on first run.

conda create -n deltatok python=3.14.2
conda activate deltatok
pip install -r requirements.txt
wandb login
hf auth login
cp .env.example .env

Data Preparation

Prepare Kinetics-700 to train from scratch, and any of VSPW, Cityscapes, or KITTI for evaluation metrics and visualizations on that dataset. For each dataset you prepare, set the corresponding *_ROOT path in .env to the absolute path of the downloaded dataset directory.

Kinetics-700 (training, ~1.2 TB)

mkdir -p kinetics/train
wget -i https://s3.amazonaws.com/kinetics/700_2020/train/k700_2020_train_path.txt -P k700_tars/
for f in k700_tars/*.tar.gz; do tar -xzf "$f" -C kinetics/train; done

Pre-extracted frames (as a directory of frame folders or zip archives) are also supported for faster data loading. See datasets/kinetics.py for details.

VSPW (evaluation, ~43 GB)

pip install gdown
gdown "https://drive.google.com/file/d/14yHWsGneoa1pVdULFk7cah3t-THl7yEz/view?usp=sharing" --fuzzy
tar -xf VSPW_dataset.tar  # extracts to VSPW/

If gdown fails due to rate limiting, download VSPW_dataset.tar manually from the Google Drive link.

Cityscapes (evaluation, ~325 GB)

Requires registration at the Cityscapes website. Set CITYSCAPES_USERNAME and CITYSCAPES_PASSWORD environment variables for headless servers, or csDownload will prompt interactively.

pip install cityscapesscripts
mkdir -p cityscapes
csDownload -d cityscapes gtFine_trainvaltest.zip leftImg8bit_sequence_trainvaltest.zip
cd cityscapes && unzip -q gtFine_trainvaltest.zip && unzip -q leftImg8bit_sequence_trainvaltest.zip && cd ..

KITTI (evaluation, ~44 GB)

wget https://s3.eu-central-1.amazonaws.com/avg-kitti/data_depth_annotated.zip
unzip data_depth_annotated.zip -d kitti && rm data_depth_annotated.zip
for drive in 2011_09_26_drive_{0002,0009,0013,0020,0023,0027,0029,0036,0046,0048,0052,0056,0059,0064,0084,0086,0093,0096,0101,0106,0117} 2011_09_28_drive_0002 2011_09_29_drive_0071 2011_09_30_drive_{0016,0018,0027} 2011_10_03_drive_{0027,0047}; do
  wget -P kitti "https://s3.eu-central-1.amazonaws.com/avg-kitti/raw_data/${drive}/${drive}_sync.zip"
  unzip -o -d kitti "kitti/${drive}_sync.zip" && rm "kitti/${drive}_sync.zip"
done

Training & Evaluation

Training and evaluation use Lightning CLI. To get evaluation metrics and visualizations on a dataset, download the pre-trained task head for that dataset and set the corresponding *_HEAD_PATH in .env to the absolute path of the downloaded file.

The effective batch size should be 1024 for both DeltaTok and DeltaWorld. It's the product of four parameters:

--data.batch_size × --trainer.devices × --trainer.num_nodes × --trainer.accumulate_grad_batches

The default config reaches this on a single node with 8 GPUs at per-GPU batch size 128 and no gradient accumulation; adjust any of the four parameters to fit your hardware. See Example Training Resources for the configurations we used for each stage.

Training DeltaTok (Tokenizer)

Stage 1: Pre-train at 256px

python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --data.frame_size=256 \
  --trainer.max_steps=1000000

Stage 2: High-resolution fine-tune at 512px

--model.ckpt_path loads model weights only; optimizer state and step counter reset.

python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.lr=1e-4 \
  --trainer.max_steps=500000 \
  --model.ckpt_path=path/to/stage1/last.ckpt

Stage 3-4: LR cooldowns

--ckpt_path resumes full training state (model weights, optimizer state, step counter).

# Stage 3
python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.lr=1e-5 \
  --trainer.max_steps=550000 \
  --ckpt_path=path/to/stage2/last.ckpt

# Stage 4
python main.py fit -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.lr=1e-6 \
  --trainer.max_steps=600000 \
  --ckpt_path=path/to/stage3/last.ckpt

Training DeltaWorld (Predictor)

Requires a DeltaTok checkpoint: either the released one (pytorch_model.bin) or one from your own training (last.ckpt).

python main.py fit -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
  --model.network.tokenizer.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin \
  --trainer.max_steps=300000

LR cooldown

python main.py fit -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
  --model.lr=1e-5 \
  --trainer.max_steps=305000 \
  --ckpt_path=path/to/deltaworld/last.ckpt

Evaluation

DeltaTok

python main.py validate -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
  --model.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin

DeltaWorld

Requires both DeltaTok and DeltaWorld checkpoints.

python main.py validate -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
  --model.ckpt_path=path/to/deltaworld-kinetics/pytorch_model.bin \
  --model.network.tokenizer.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin

Example Training Resources

Training times and memory are measured on NVIDIA H200 GPUs. The configurations below are examples; any setup that reaches the target effective batch size works.

DeltaTok

StageResolutionLRStepsGPUsBatch/GPUGPU MemoryTime
1. Pre-train2561e-31M812865 GB82h
2. Hi-res fine-tune5121e-4500k1664109 GB89h
3. LR cooldown5121e-550k1664109 GB9h
4. LR cooldown5121e-650k1664109 GB9h

DeltaWorld

StageResolutionLRStepsGPUsBatch/GPUGPU MemoryTime
1. Train5121e-4300k323258 GB32h
2. LR cooldown5121e-55k323258 GB<1h

Citation

@inproceedings{kerssies2026deltatok,
  title     = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
  author    = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgements

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.