README.md
June 1, 2026 Β· View on GitHub
[ICML 2026π₯π₯π₯] Rethinking Video Generation Model for the Embodied World
Yufan Deng,
Zilin Pan,
Hongyu Zhang,
Xiaojie Li,
Ruoqing Hu,
Yufei Ding,
Yiming Zou,
Yan Zeng,
Daquan Zhou
π£ Overview
This repository is the official implementation of our work, consisting of (i) RBench, a fineβgrained benchmark tailored for robotics video generation, and (ii) RoVid-X, a millionβscale dataset for training robotics video models. We reveal
the limitations of current video foundation models and potential directions for improvement, offering new perspectives for researchers exploring the embodied domain using video world models. Our goal is to establish a solid foundation for the rigorous assessment and scalable training of video generation models in the field of physical AI, accelerating the progress of embodied AI toward general intelligence.
π₯ News
[2026.06.01]π₯π₯π₯ We evaluated NVIDIA Cosmos 3, including Cosmos3-Nano and Cosmos3-Super, on RBench. Cosmos3-Nano takes the open-source Top-1 spot on the RBench Leaderboard, demonstrating strong physical simulation and world generation capabilities for embodied AI.[2026.05.21]π₯π₯π₯ We have released RoVid-X on Hugging Face. See this repository for more details.[2026.01.27]π₯ We are actively applying for the open-source process. Once the internal review is approved, we will release the RoVid-X robotic video dataset on Hugging Face and open-source the RBench on Hugging Face.[2026.01.22]π₯ Our Research Paper is now available. The Project Page is created.
π₯ Demo
https://github.com/user-attachments/assets/3d00cf52-3631-41c2-9eca-b580404e710f
π Todo List
- Embodied Execution Evaluation: Measure the action execution success rate of generated videos using Inverse Dynamics Model (IDM).
βοΈ Installation
Environment
# 0. Clone the repo
git clone https://github.com/DAGroup-PKU/ReVidgen.git
cd ReVidgen
# 1. Environment for RBench VLM evaluation
conda create -n rbench_vlm python=3.10.18 -y
conda activate rbench_vlm
pip install torch==2.5.1 torchvision==0.20.1
pip install -r requirements_vlm.txt
# 2γEnvironment for RBench low-level operators
conda create -n rbench_ops python=3.10.18 -y
conda activate rbench_ops
pip install torch==2.5.1 torchvision==0.20.1
cd pkgs/Grounded-Segment-Anything
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install -r requirements.txt
# Install Groudned-SAM-2 module
cd ../Grounded-SAM-2
pip install -e .
# Install Q-Align module
cd ../Q-Align
pip install -e .
cd ..
pip install -r requirements.txt
Download Checkpoints
Please download the checkpoint files from RBench and organize them under the following directory before running the evaluation:
ReVidgen/
βββ checkpoints/
β βββ BERT
β β βββ google-bert
β β βββ bert-base-uncased
β β βββ LICENSE
β β βββ ...
β βββ GroundingDino
β β βββ groundingdino_swinb_cogcoor.pth
β βββ q-future
β β βββ one-align
β β βββ README.md
β β βββ ...
β βββ SAM
β β βββ sam2.1_hiera_large.pt
β βββ Cotracker
β βββ scaled_offline.pth
β
βββ eval/
β βββ 4_embodiments/
β βββ 5_tasks/
β βββ ...
β
βββ pkgs/
β βββ Grounded-Segment-Anything/
β βββ ...
βββ ...
π RBench Results
RBench evaluates mainstream video generation models and shows a strong alignment with human evaluations, achieving a Spearman correlation of 0.96.
π RBench Results Across Tasks and Embodiments
Evaluations across task-oriented and embodiment-specific dimensions for 25 models spanning open-source, commercial, and robotics-specific families.
π¦ Dataset
https://github.com/user-attachments/assets/c46d5b18-4e20-4b78-9060-2e7c1a6effc8
We present RoVid-X, a large-scale robotic video dataset for real-world robotic interactions, providing RGB videos, depth videos, and optical flow videos to facilitate the training of embodied video models.
π§ Usage
π₯ Download RBench Validation Set
# if you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
# pip install -U "huggingface_hub[cli]"
huggingface-cli download DAGroup-PKU/RBench
π¬ Video Generation Format
Generated videos should be organized following the directory structure below.
ReVidgen/
βββ data/
βββ {model_name}/
βββ {task_name/embodiment_name}/
βββ videos/
βββ 0001.mp4
βββ 0002.mp4
βββ 0003.mp4
βββ ...
π€ Quick Start
> **Note:** To enable GPT-based evaluation, please prepare your API key in advance and set the `API_KEY` field in the following evaluation scripts accordingly.
# Run embodiment-oriented evaluation
bash scripts/rbench_eval_4embodiments.sh
# Run task-oriented evaluation
bash scripts/rbench_eval_5tasks.sh
π§ Ethics Concerns
The videos used in these demos are sourced from public domains or generated by models, and are intended solely to showcase the capabilities of this research.
- The service is a research preview. Please contact us if you find any potential violations. (dengyufan10@stu.pku.edu.cn)
βοΈ Citation
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.
BibTeX
@article{deng2026rethinking,
title={Rethinking Video Generation Model for the Embodied World},
author={Deng, Yufan and Pan, Zilin and Zhang, Hongyu and Li, Xiaojie and Hu, Ruoqing and Ding, Yufei and Zou, Yiming and Zeng, Yan and Zhou, Daquan},
journal={arXiv preprint arXiv:2601.15282},
year={2026}
}