README.md

November 7, 2025 · View on GitHub

Generating Multimodal Driving Scenes via Next-Scene Prediction

Yanhao Wu^1,2, Haoyang Zhang², Tianwei Lin², Lichao Huang²,

Shujie Luo², Rui Wu², Congpei Qiu¹, Wei Ke¹, Tong Zhang^{3, 4},

¹ Xi'an Jiaotong University, ² Horizon Robotics, ³ EPFL, ⁴ University of Chinese Academy of Sciences

Accepted to CVPR 2025

🌟 What is UMGen?

UMGen generates multimodal driving scenes, where each scene integrates:
Ego-vehicle actions, maps, traffic agents, and images.

🎬 Autoregressive Scene Generation

All visual elements in the video are generated by UMGen.

https://github.com/user-attachments/assets/afe62434-1a9e-44dc-b1bd-b67d48e1b693

🤖 User-Specified Scenario Generation

UMGen also supports user-specified scenario generation.
In this video, we control the agent to simulate a cut-in maneuver scenario.

https://github.com/user-attachments/assets/a3224d85-08df-4e36-a47d-f3e88f2b7ad6

📎 More Information

For more videos and details, please refer to our and

🚀 Quick Start

Set up a new virtual environment

conda create -n UMGen python=3.8 -y
conda activate UMGen

Install dependency packpages

UMGen_path="path/to/UMGen"
cd ${UMGen_path}
pip3 install --upgrade pip
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip3 install -r requirements.txt

Prepare the data

Download the tokenized data and pretrained weights from https://drive.google.com/drive/folders/1rJEVxWNk4MH_FPdqUMgdjV_PHwKJMS-3?usp=sharing

The directory structure should be:

UMGen/
├── data
│   ├── controlled_scenes/
|       ├── XX
│   ├── tokenized_origin_scenes/
│       ├── XX
|   ├── weights/
│       ├── image_var.tar
|       ├── map_vae.ckpt
|       ├── UMGen_Large.pt
└── projects/

⚙️ Inference Usage

🎛️ Infer Future Frames Freely

Generate future frames automatically without any external control signals.

python projects/tools/evaluate.py --infer_task video --set_num_new_frames 30

🕹️Infer Future Frames with Control

Generate future frames under specific control constraints, such as predefined trajectories or object behavior control.

python projects/tools/evaluate.py --infer_task control --set_num_new_frames 30

🧩 To-Do List

Release more tokenized scene data
Release the code for obtaining scene tokens using the VAE models
Release the diffusion code to enhance the videos

📬 Contact

For any questions or collaborations, feel free to contact me : ) 📧 wuyanhao@stu.xjtu.edu.cn