VFMF: World Modeling by Forecasting Vision Foundation Model Features

March 20, 2026 · View on GitHub

Gabrijel Boduljak | Yushi Lan | Christian Rupprecht | Andrea Vedaldi

VGG, University of Oxford
Abstract

Many recent methods forecast the world by generating stochastic videos. While these excel at visual realism, pixel prediction is computationally expensive and requires translating RGB into actionable signals for decision-making. An alternative uses vision foundation model (VFM) features as world representations, performing deterministic regression to predict future states. These features directly translate into useful signals like semantic segmentation and depth while remaining efficient. However, deterministic regression averages over multiple plausible futures, failing to capture uncertainty and reducing accuracy. To address this limitation, we introduce a generative forecaster using autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. This latent space preserves information more effectively than PCA-based alternatives for both forecasting and other applications like image generation. Our latent predictions decode easily into multiple interpretable modalities: semantic segmentation, depth, surface normals, and RGB. With matched architecture and compute, our method produces sharper, more accurate predictions than regression across all modalities and improves appearance prediction. Our results suggest that stochastic conditional generation of VFM features offers a promising, scalable foundation for future world models.

Method

An overview of our method VFMF. Given RGB context frames I1,,It`\mathbf{I}_1,\dots,\mathbf{I}_t`, we extract DINO features f1,,ft`\mathbf{f}_1,\dots,\mathbf{f}_t` and predict the next state feature ft+1`\mathbf{f}_{t+1}`. Context features are compressed with a VAE along the channel dimension to produce context latents z1,,zt`\mathbf{z}_1,\dots,\mathbf{z}_t`. Those context latents are concatenated with noisy future latents zt+1`\mathbf{z}_{t+1}` and passed to a conditional denoiser that denoises only the future latents zt+1`\mathbf{z}_{t+1}` while leaving the context latents unchanged. This process repeats autoregressively, with a window of fixed length. Specifically, each time a new latent zt+1`\mathbf{z}_{t+1}` is generated, it is appended to the context while the oldest context latent is popped. The denoised future latents are decoded back to DINO feature space by the VAE decoder. Finally, the reconstructed features can be routed to task-specific modality decoders for downstream tasks or interpretation.

Instructions

Inference

  1. Clone this repository.
  2. Set up environment matching the specification in environment.yml
  3. Download checkpoints
  4. Open a demo notebook. Examples are world-model/cityscapes_demo.ipynb and world-model/kubric_demo.ipynb.
  5. Fix the paths in the first notebook cell
REPO_PATH = "{absolute path to the repository}" 
CKPTS_PATH = "{absolute path to the checkpoints folder}"

We released Kubric and Cityscapes checkpoints and demo inference notebooks.

Dataset preparation

ImageNet

Prepare dataset following instructions from ReDi.

CityScapes

Prepare dataset following instructions from DINO-Foresight.

VAE

Training

ImageNet
  1. Make sure you can run the inference notebooks (this ensures you have the environment set up correctly).
  2. Open VAE training script train_imagenet_raw_convnext_base_beta=0.01.sh.
  3. Adjust CONFIG_PATH, CONFIG_NAME and WANDB_DIR.
  4. Open the config file vae/configs/default_raw_ilsvrc_isanbard.yaml. Adjust data.dataset_root, model.feature_stats and experiment_path.
  5. Execute train_imagenet_raw_convnext_base_beta=0.01.sh.
CityScapes
  1. Make sure you can run the inference notebooks (this ensures you have the environment set up correctly).
  2. Open VAE training script train_cityscapes_raw_convnext_base_beta=0.01.sh.
  3. Adjust CONFIG_PATH, CONFIG_NAME and WANDB_DIR.
  4. Open the config file vae/configs/default_raw_cityscapes.yaml. Adjust data.dataset_root, model.feature_stats and experiment_path.
  5. Execute train_cityscapes_raw_convnext_base_beta=0.01.sh.

More code and instructions will be released soon.

Acknowledgements

This repository is based on: