🧠 Geometry-aware 4D Video Generation for Robot Manipulation

January 10, 2026 Ā· View on GitHub

4DGen teaser

We propose a 4D video generation model that enforces geometric consistency across multiple camera views to predict spatio-temporally aligned RGB-D videos from a single RGB-D image per view. We further demonstrate applications to robot manipulation by extracting gripper poses from generated videos using an off-the-shelf pose tracking algorithm. We show that the model generalizes to novel viewpoints and enables robots to leverage multi-view information for planning.

4DGen real video


šŸ“„ Paper🌐 Project PagešŸ“¦ DatasetšŸ¤— Hugging Face
arXivWebsiteStanford MirrorDataset Ā· Checkpoints

šŸ‘„ Authors

Zeyi Liu¹ · Shuang Li¹ · Eric Cousineau² · Siyuan Feng² · Benjamin Burchfiel² · Shuran Song¹

¹ Stanford University
² Toyota Research Institute


🧩 Overview

Robotic manipulation requires understanding how 3D geometry evolves over time under agent actions. However, most video generation models are trained with single-view RGB videos, limiting their ability to reason about geometry and cross-view consistency.

This project introduces a geometry-aware 4D video generation pipeline that:

  • Models multi-view RGB-D observations across time
  • Enforces cross-view geometric consistency via pointmaps
  • Learns temporally coherent latent dynamics suitable for manipulation

The resulting models serve as strong foundations for world modeling, policy learning, and planning in robotics.


šŸ“¦ Dataset

We release a multi-view, multi-task robotic manipulation dataset collected in simulation.

Tasks

Simulation tasks (LBM):

  • StoreCerealBoxUnderShelf
  • PutSpatulaOnTableFromUtensilCrock
  • PlaceAppleFromBowlIntoBin

Real-world robot manipulation tasks:

  • BimanualAddOrangeSlicesToBowl
  • BimanualChopCucumber
  • BimanualCupOnSaucer
  • BimanualTwistCapOffBottle

Key Properties

  • Simulation: 50 demonstrations per task
  • Real world: 10 demonstrations per task
  • 16 RGB-D camera views per timestep, sampled from the upper hemisphere
  • Synchronized robot actions and observations
  • Simulation data collected in the Large Behavior Model (LBM) environment

šŸ“„ Download links:


🧠 Pre-trained Models

We provide multiple checkpoints to support different stages of the pipeline:

  • Stable Video Diffusion (SVD) backbones
  • Task-specific VAEs for RGB and pointmap latents
  • 4D video generation models fine-tuned on manipulation data

šŸ“¦ Checkpoints:


āš™ļø Installation

We recommend using conda or mamba.

cd 4dgen
conda env create -f environment.yml
conda activate video_policy
conda install pytorch3d

Tested on:

  • Ubuntu 22.04
  • CUDA 12.2

šŸ”§ Training

1ļøāƒ£ Fine-tune the VAE

CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_autoencoder_workspace

2ļøāƒ£ Train the 4D Video Generation Model

CUDA_VISIBLE_DEVICES=<GPU_IDS> \
HYDRA_FULL_ERROR=1 \
python scripts/train.py --config-name=finetune_svd_lightning_workspace
``$

**\text{Notes}:**

* \text{Tested} \text{on} **4 \times  \text{NVIDIA} \text{A6000} (48\text{GB})**
* \text{Batch} \text{size}: 1
* \text{Training} \text{time}: ~2 \text{days}

---

## šŸ” \text{Inference}

\text{Run} \text{the} \text{provided} \text{evaluation} \text{example}:

$``bash
python notebooks/eval.py

This script demonstrates loading a trained checkpoint and generating multi-view 4D predictions.

šŸŽ„ Qualitative Results

We show representative qualitative results illustrating multi-view RGB-D video generation.

Generated RGB-D Videos

Task 1

Task 1 RGB Task 1 depth

Task 2

Task 2 RGB Task 2 depth

Task 3

Task 3 RGB Task 3 depth

šŸ“š Citation

If you find this project useful, please consider citing:

@article{liu2025geometry,
  title={Geometry-aware 4D Video Generation for Robot Manipulation},
  author={Liu, Zeyi and Li, Shuang and Cousineau, Eric and Feng, Siyuan and Burchfiel, Benjamin and Song, Shuran},
  journal={arXiv preprint arXiv:2507.01099},
  year={2025}
}

šŸ“„ License

This project is released for research use. Please see the repository for license details.


šŸ’¬ Questions or issues? Feel free to open a GitHub issue or reach out via the project page.