🧠 Geometry-aware 4D Video Generation for Robot Manipulation

January 10, 2026 · View on GitHub

4DGen teaser

We propose a 4D video generation model that enforces geometric consistency across multiple camera views to predict spatio-temporally aligned RGB-D videos from a single RGB-D image per view. We further demonstrate applications to robot manipulation by extracting gripper poses from generated videos using an off-the-shelf pose tracking algorithm. We show that the model generalizes to novel viewpoints and enables robots to leverage multi-view information for planning.

4DGen real video

🔗 Project Links

📄 Paper	🌐 Project Page	📦 Dataset	🤗 Hugging Face
arXiv	Website	Stanford Mirror	Dataset · Checkpoints

👥 Authors

Zeyi Liu¹ · Shuang Li¹ · Eric Cousineau² · Siyuan Feng² · Benjamin Burchfiel² · Shuran Song¹

¹ Stanford University
² Toyota Research Institute

🧩 Overview

Robotic manipulation requires understanding how 3D geometry evolves over time under agent actions. However, most video generation models are trained with single-view RGB videos, limiting their ability to reason about geometry and cross-view consistency.

This project introduces a geometry-aware 4D video generation pipeline that:

Models multi-view RGB-D observations across time
Enforces cross-view geometric consistency via pointmaps
Learns temporally coherent latent dynamics suitable for manipulation

The resulting models serve as strong foundations for world modeling, policy learning, and planning in robotics.

📦 Dataset

We release a multi-view, multi-task robotic manipulation dataset collected in simulation.

Tasks

Simulation tasks (LBM):

StoreCerealBoxUnderShelf
PutSpatulaOnTableFromUtensilCrock
PlaceAppleFromBowlIntoBin

Real-world robot manipulation tasks:

BimanualAddOrangeSlicesToBowl
BimanualChopCucumber
BimanualCupOnSaucer
BimanualTwistCapOffBottle

Key Properties

Simulation: 50 demonstrations per task
Real world: 10 demonstrations per task
16 RGB-D camera views per timestep, sampled from the upper hemisphere
Synchronized robot actions and observations
Simulation data collected in the Large Behavior Model (LBM) environment

📥 Download links:

Dataset: https://real.stanford.edu/4dgen/data/
Hugging Face mirror: https://huggingface.co/datasets/Zeyi/4dgen-dataset

🧠 Pre-trained Models

We provide multiple checkpoints to support different stages of the pipeline:

Stable Video Diffusion (SVD) backbones
Task-specific VAEs for RGB and pointmap latents
4D video generation models fine-tuned on manipulation data

📦 Checkpoints:

SVD / base models: https://real.stanford.edu/4dgen/checkpoints/
Fine-tuned VAEs: https://real.stanford.edu/4dgen/checkpoints/VAE/
4D generation outputs: https://real.stanford.edu/4dgen/checkpoints/outputs/

⚙️ Installation

We recommend using conda or mamba.

cd 4dgen
conda env create -f environment.yml
conda activate video_policy
conda install pytorch3d

Tested on:

Ubuntu 22.04
CUDA 12.2

🔧 Training

1️⃣ Fine-tune the VAE

CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_autoencoder_workspace

2️⃣ Train the 4D Video Generation Model

CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_svd_lightning_workspace
``$

**\text{Notes}:**

* \text{Tested} \text{on} **4 \times  \text{NVIDIA} \text{A6000} (48\text{GB})**
* \text{Batch} \text{size}: 1
* \text{Training} \text{time}: ~2 \text{days}

---

## 🔍 \text{Inference}

\text{Run} \text{the} \text{provided} \text{evaluation} \text{example}:

$``bash
python notebooks/eval.py

This script demonstrates loading a trained checkpoint and generating multi-view 4D predictions.

@article{liu2025geometry,
  title={Geometry-aware 4D Video Generation for Robot Manipulation},
  author={Liu, Zeyi and Li, Shuang and Cousineau, Eric and Feng, Siyuan and Burchfiel, Benjamin and Song, Shuran},
  journal={arXiv preprint arXiv:2507.01099},
  year={2025}
}

📄 License

This project is released for research use. Please see the repository for license details.

💬 Questions or issues? Feel free to open a GitHub issue or reach out via the project page.

🧠 Geometry-aware 4D Video Generation for Robot Manipulation

🔗 Project Links

👥 Authors

🧩 Overview

📦 Dataset

Tasks

Key Properties

🧠 Pre-trained Models

⚙️ Installation

🔧 Training

1️⃣ Fine-tune the VAE

2️⃣ Train the 4D Video Generation Model

🎥 Qualitative Results

Generated RGB-D Videos

Task 1

Task 2

Task 3

📚 Citation

📄 License