VFMF: Dense Forecasting by Generating Foundation Model Features

July 25, 2026 · View on GitHub

ICML 2026

Gabrijel Boduljak | Yushi Lan | Christian Rupprecht | Andrea Vedaldi

Abstract

Many recent methods forecast the world by generating stochastic videos. While these excel at visual realism, pixel prediction is computationally expensive and requires translating RGB into actionable signals for decision-making. An alternative uses vision foundation model (VFM) features as world representations, performing deterministic regression to predict future states. These features directly translate into useful signals like semantic segmentation and depth while remaining efficient. However, deterministic regression averages over multiple plausible futures, failing to capture uncertainty and reducing accuracy. To address this limitation, we introduce a generative forecaster using autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. This latent space preserves information more effectively than PCA-based alternatives for both forecasting and other applications like image generation. Our latent predictions decode easily into multiple interpretable modalities: semantic segmentation, depth, surface normals, and RGB. With matched architecture and compute, our method produces sharper, more accurate predictions than regression across all modalities and improves appearance prediction. Our results suggest that stochastic conditional generation of VFM features offers a promising, scalable foundation for future world models.

Method

An overview of our method VFMF. Given RGB context frames $`\mathbf{I}_1,\dots,\mathbf{I}_t`$ , we extract DINO features $`\mathbf{f}_1,\dots,\mathbf{f}_t`$ and predict the next state feature $`\mathbf{f}_{t+1}`$ . Context features are compressed with a VAE along the channel dimension to produce context latents $`\mathbf{z}_1,\dots,\mathbf{z}_t`$ . Those context latents are concatenated with noisy future latents $`\mathbf{z}_{t+1}`$ and passed to a conditional denoiser that denoises only the future latents $`\mathbf{z}_{t+1}`$ while leaving the context latents unchanged. This process repeats autoregressively, with a window of fixed length. Specifically, each time a new latent $`\mathbf{z}_{t+1}`$ is generated, it is appended to the context while the oldest context latent is popped. The denoised future latents are decoded back to DINO feature space by the VAE decoder. Finally, the reconstructed features can be routed to task-specific modality decoders for downstream tasks or interpretation.

Instructions

Inference

Clone this repository.
Set up environment matching the specification in environment.yml
Download checkpoints
Open a demo notebook. Examples are world-model/cityscapes_demo.ipynb and world-model/kubric_demo.ipynb.
Fix the paths in the first notebook cell

REPO_PATH = "{absolute path to the repository}" 
CKPTS_PATH = "{absolute path to the checkpoints folder}"

We released Kubric and Cityscapes checkpoints and demo inference notebooks.

Make sure you can run the inference notebooks (this ensures you have the environment set up correctly).
Open VAE training script train_imagenet_raw_convnext_base_beta=0.01.sh.
Adjust CONFIG_PATH, CONFIG_NAME and WANDB_DIR.
Open the config file vae/configs/default_raw_ilsvrc_isanbard.yaml. Adjust data.dataset_root, model.feature_stats and experiment_path.
Execute train_imagenet_raw_convnext_base_beta=0.01.sh.

CityScapes

Make sure you can run the inference notebooks (this ensures you have the environment set up correctly).
Open VAE training script train_cityscapes_raw_convnext_base_beta=0.01.sh.
Adjust CONFIG_PATH, CONFIG_NAME and WANDB_DIR.
Open the config file vae/configs/default_raw_cityscapes.yaml. Adjust data.dataset_root, model.feature_stats and experiment_path.
Execute train_cityscapes_raw_convnext_base_beta=0.01.sh.

More code and instructions will be released soon.

Acknowledgements

This repository is based on:

VFMF: Dense Forecasting by Generating Foundation Model Features

ICML 2026

Method

Instructions

Inference

Dataset preparation

ImageNet

CityScapes

VAE

Training

ImageNet

CityScapes

Acknowledgements