š§ Geometry-aware 4D Video Generation for Robot Manipulation
January 10, 2026 Ā· View on GitHub
We propose a 4D video generation model that enforces geometric consistency across multiple camera views to predict spatio-temporally aligned RGB-D videos from a single RGB-D image per view. We further demonstrate applications to robot manipulation by extracting gripper poses from generated videos using an off-the-shelf pose tracking algorithm. We show that the model generalizes to novel viewpoints and enables robots to leverage multi-view information for planning.
š Project Links
| š Paper | š Project Page | š¦ Dataset | š¤ Hugging Face |
|---|---|---|---|
| arXiv | Website | Stanford Mirror | Dataset Ā· Checkpoints |
š„ Authors
Zeyi Liu¹ · Shuang Li¹ · Eric Cousineau² · Siyuan Feng² · Benjamin Burchfiel² · Shuran Song¹
¹ Stanford University
² Toyota Research Institute
š§© Overview
Robotic manipulation requires understanding how 3D geometry evolves over time under agent actions. However, most video generation models are trained with single-view RGB videos, limiting their ability to reason about geometry and cross-view consistency.
This project introduces a geometry-aware 4D video generation pipeline that:
- Models multi-view RGB-D observations across time
- Enforces cross-view geometric consistency via pointmaps
- Learns temporally coherent latent dynamics suitable for manipulation
The resulting models serve as strong foundations for world modeling, policy learning, and planning in robotics.
š¦ Dataset
We release a multi-view, multi-task robotic manipulation dataset collected in simulation.
Tasks
Simulation tasks (LBM):
- StoreCerealBoxUnderShelf
- PutSpatulaOnTableFromUtensilCrock
- PlaceAppleFromBowlIntoBin
Real-world robot manipulation tasks:
- BimanualAddOrangeSlicesToBowl
- BimanualChopCucumber
- BimanualCupOnSaucer
- BimanualTwistCapOffBottle
Key Properties
- Simulation: 50 demonstrations per task
- Real world: 10 demonstrations per task
- 16 RGB-D camera views per timestep, sampled from the upper hemisphere
- Synchronized robot actions and observations
- Simulation data collected in the Large Behavior Model (LBM) environment
š„ Download links:
- Dataset: https://real.stanford.edu/4dgen/data/
- Hugging Face mirror: https://huggingface.co/datasets/Zeyi/4dgen-dataset
š§ Pre-trained Models
We provide multiple checkpoints to support different stages of the pipeline:
- Stable Video Diffusion (SVD) backbones
- Task-specific VAEs for RGB and pointmap latents
- 4D video generation models fine-tuned on manipulation data
š¦ Checkpoints:
- SVD / base models: https://real.stanford.edu/4dgen/checkpoints/
- Fine-tuned VAEs: https://real.stanford.edu/4dgen/checkpoints/VAE/
- 4D generation outputs: https://real.stanford.edu/4dgen/checkpoints/outputs/
āļø Installation
We recommend using conda or mamba.
cd 4dgen
conda env create -f environment.yml
conda activate video_policy
conda install pytorch3d
Tested on:
- Ubuntu 22.04
- CUDA 12.2
š§ Training
1ļøā£ Fine-tune the VAE
CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_autoencoder_workspace
2ļøā£ Train the 4D Video Generation Model
CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_svd_lightning_workspace
``$
**\text{Notes}:**
* \text{Tested} \text{on} **4 \times \text{NVIDIA} \text{A6000} (48\text{GB})**
* \text{Batch} \text{size}: 1
* \text{Training} \text{time}: ~2 \text{days}
---
## š \text{Inference}
\text{Run} \text{the} \text{provided} \text{evaluation} \text{example}:
$``bash
python notebooks/eval.py
This script demonstrates loading a trained checkpoint and generating multi-view 4D predictions.
š„ Qualitative Results
We show representative qualitative results illustrating multi-view RGB-D video generation.
Generated RGB-D Videos
Task 1
Task 2
Task 3
š Citation
If you find this project useful, please consider citing:
@article{liu2025geometry,
title={Geometry-aware 4D Video Generation for Robot Manipulation},
author={Liu, Zeyi and Li, Shuang and Cousineau, Eric and Feng, Siyuan and Burchfiel, Benjamin and Song, Shuran},
journal={arXiv preprint arXiv:2507.01099},
year={2025}
}
š License
This project is released for research use. Please see the repository for license details.
š¬ Questions or issues? Feel free to open a GitHub issue or reach out via the project page.