README.md

June 1, 2026 Β· View on GitHub

[ICML 2026πŸ”₯πŸ”₯πŸ”₯] Rethinking Video Generation Model for the Embodied World

hf_space arXiv Home Page Dataset Benchmark Video License

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu,
Yufei Ding, Yiming Zou, Yan Zeng, Daquan Zhou

πŸ“£ Overview

teaser This repository is the official implementation of our work, consisting of (i) RBench, a fine‑grained benchmark tailored for robotics video generation, and (ii) RoVid-X, a million‑scale dataset for training robotics video models. We reveal the limitations of current video foundation models and potential directions for improvement, offering new perspectives for researchers exploring the embodied domain using video world models. Our goal is to establish a solid foundation for the rigorous assessment and scalable training of video generation models in the field of physical AI, accelerating the progress of embodied AI toward general intelligence.

πŸ”₯ News

  • [2026.06.01] πŸ”₯πŸ”₯πŸ”₯ We evaluated NVIDIA Cosmos 3, including Cosmos3-Nano and Cosmos3-Super, on RBench. Cosmos3-Nano takes the open-source Top-1 spot on the RBench Leaderboard, demonstrating strong physical simulation and world generation capabilities for embodied AI.
  • [2026.05.21] πŸ”₯πŸ”₯πŸ”₯ We have released RoVid-X on Hugging Face. See this repository for more details.
  • [2026.01.27] πŸ”₯ We are actively applying for the open-source process. Once the internal review is approved, we will release the RoVid-X robotic video dataset on Hugging Face and open-source the RBench on Hugging Face.
  • [2026.01.22] πŸ”₯ Our Research Paper is now available. The Project Page is created.

πŸŽ₯ Demo

https://github.com/user-attachments/assets/3d00cf52-3631-41c2-9eca-b580404e710f

πŸ“‘ Todo List

  • Embodied Execution Evaluation: Measure the action execution success rate of generated videos using Inverse Dynamics Model (IDM).

βš™οΈ Installation

Environment

# 0. Clone the repo
git clone https://github.com/DAGroup-PKU/ReVidgen.git
cd ReVidgen

# 1. Environment for RBench VLM evaluation
conda create -n rbench_vlm python=3.10.18 -y
conda activate rbench_vlm

pip install torch==2.5.1 torchvision==0.20.1
pip install -r requirements_vlm.txt

# 2、Environment for RBench low-level operators
conda create -n rbench_ops python=3.10.18 -y
conda activate rbench_ops

pip install torch==2.5.1 torchvision==0.20.1

cd pkgs/Grounded-Segment-Anything
python -m pip install -e segment_anything
pip install --no-build-isolation -e GroundingDINO
pip install -r requirements.txt

# Install Groudned-SAM-2 module
cd ../Grounded-SAM-2
pip install -e .

# Install Q-Align module
cd ../Q-Align
pip install -e .

cd ..
pip install -r requirements.txt

Download Checkpoints

Please download the checkpoint files from RBench and organize them under the following directory before running the evaluation:

ReVidgen/
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ BERT
β”‚   β”‚   └── google-bert
β”‚   β”‚       └── bert-base-uncased
β”‚   β”‚           β”œβ”€β”€ LICENSE
β”‚   β”‚           └── ...
β”‚   β”œβ”€β”€ GroundingDino
β”‚   β”‚   └── groundingdino_swinb_cogcoor.pth
β”‚   β”œβ”€β”€ q-future
β”‚   β”‚   └── one-align
β”‚   β”‚       β”œβ”€β”€ README.md
β”‚   β”‚       └── ...
β”‚   β”œβ”€β”€ SAM
β”‚   β”‚   └── sam2.1_hiera_large.pt
β”‚   └── Cotracker
β”‚       └── scaled_offline.pth
β”‚
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ 4_embodiments/
β”‚   β”œβ”€β”€ 5_tasks/
β”‚   └── ...
β”‚
β”œβ”€β”€ pkgs/
β”‚   β”œβ”€β”€ Grounded-Segment-Anything/
β”‚   └── ...
└── ...

πŸ“ˆ RBench Results

RBench evaluates mainstream video generation models and shows a strong alignment with human evaluations, achieving a Spearman correlation of 0.96.

πŸ“Š RBench Results Across Tasks and Embodiments

RBench Table Evaluations across task-oriented and embodiment-specific dimensions for 25 models spanning open-source, commercial, and robotics-specific families.

πŸ“¦ Dataset

https://github.com/user-attachments/assets/c46d5b18-4e20-4b78-9060-2e7c1a6effc8

We present RoVid-X, a large-scale robotic video dataset for real-world robotic interactions, providing RGB videos, depth videos, and optical flow videos to facilitate the training of embodied video models.

πŸ”§ Usage

πŸ“₯ Download RBench Validation Set


# if you are in china mainland, run this first: export HF_ENDPOINT=https://hf-mirror.com
# pip install -U "huggingface_hub[cli]"
huggingface-cli download DAGroup-PKU/RBench

🎬 Video Generation Format

Generated videos should be organized following the directory structure below.

ReVidgen/
└── data/
    └── {model_name}/
        └── {task_name/embodiment_name}/
            └── videos/
                β”œβ”€β”€ 0001.mp4
                β”œβ”€β”€ 0002.mp4
                β”œβ”€β”€ 0003.mp4
                └── ...

πŸ€— Quick Start

> **Note:** To enable GPT-based evaluation, please prepare your API key in advance and set the `API_KEY` field in the following evaluation scripts accordingly.

# Run embodiment-oriented evaluation
bash scripts/rbench_eval_4embodiments.sh

# Run task-oriented evaluation
bash scripts/rbench_eval_5tasks.sh

πŸ“§ Ethics Concerns

The videos used in these demos are sourced from public domains or generated by models, and are intended solely to showcase the capabilities of this research.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil:.

BibTeX

@article{deng2026rethinking,
  title={Rethinking Video Generation Model for the Embodied World},
  author={Deng, Yufan and Pan, Zilin and Zhang, Hongyu and Li, Xiaojie and Hu, Ruoqing and Ding, Yufei and Zou, Yiming and Zeng, Yan and Zhou, Daquan},
  journal={arXiv preprint arXiv:2601.15282},
  year={2026}
}