README.md

June 9, 2026 · View on GitHub

Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Teli Ma^1,2 Jia Zheng^1,2 Zifan Wang^1,2 Chunli Jiang¹ Andy Cui¹ Junwei Liang^2,3,* Shuo Yang^1,*

¹Mondo Robotics ²HKUST(GZ) ³HKUST ^*Corresponding author

DiT4DiT is a Vision-Action-Model (VAM) framework that combines video generation transformers with flow-matching-based action prediction for generalizable robotic manipulation. It supports both the tabletop and whole-body control for manipulation tasks. Notably, DiT4DiT stands as the first efficient VAM to achieve real-time whole-body control of humanoid robots.

News

[2026-06-09] We release real G1 teleoperation, training, and deployment code here.
[2026-04-15] Initial release of DiT4DiT with training, evaluation, and deployment code.
[2026-03-11] We release the arXiv paper.

Whole-Body Control (all 1x speed & autonomous)

Shelf Organization

Relocate Chair

Assembly Line Work

Tabletop Manipulation (all 1x speed, 1 policy for all tasks)

Stack Cups	Drawer Interaction

Pick and Place	Arrange Flower

Move Spoon	Insert Plate

Box Packing	Twist Cap

News
TODOs
Project Structure
Installation
Quick Start
- Simulation
- Real Robot
Acknowledgements
License

TODOs

~~Release teleoperation, training and deployment code for Unitree G1 tabletop tasks.~~
~~Release teleoperation, training and deployment code for Unitree G1 whole-body control tasks.~~

Project Structure

DiT4DiT/
├── DiT4DiT/                    # Core package
│   ├── config/                 # Configurations
│   │   ├── deepseeds/          # DeepSpeed configs
│   │   ├── robocasa/           # RoboCasa experiment configs
│   │   └── real_robot/         # Real robot configs
│   ├── dataloader/             # Dataset loading (LeRobot)
│   ├── model/                  # Model architecture
│   │   ├── framework/          # DiT4DiT framework
│   │   └── modules/            # Backbone & action model
│   └── training/               # Training scripts & utilities
├── deployment/                 # WebSocket-based model server
├── docs/                       # Documentation
├── examples/
│   ├── Robocasa_tabletop/      # RoboCasa simulation example
│   │   ├── train_files/        # Training scripts
│   │   └── eval_files/         # Evaluation & simulation
│   └── Real_G1/                # Real Unitree G1 example
│       ├── train_files/        # Training scripts
│       └── eval_files/         # Evaluation
└── requirements.txt

Installation

Prerequisites

Python >= 3.10
CUDA 12.4+
>8x GPUs recommended for training

Setup

# Clone the repository
git clone https://github.com/Mondo-Robotics/DiT4DiT.git
cd DiT4DiT

# Create conda environment
conda create -n dit4dit python=3.10 -y
conda activate dit4dit

# Install PyTorch (CUDA 12.8 recommended)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Download Pretrained Backbone

Download the Cosmos-Predict2.5-2B model from Hugging Face:

huggingface-cli download nvidia/Cosmos-Predict2.5-2B --revision diffusers/base/post-trained --local-dir /path/to/Cosmos-Predict2.5-2B

Model Zoo

We release pretrained checkpoints to facilitate reproduction.

Available Checkpoints

Model	Description	Dataset	Success Rate	Link
DiT4DiT-LIBERO	DiT4DiT for LIBERO benchmark	LIBERO	98.6	🤗 Hugging Face
DiT4DiT-RoboCasa-GR1	DiT4DiT for RoboCasa-GR1 tabletop tasks	RoboCasa-GR1	56.7	🤗 Hugging Face

Note: More checkpoints will be released soon. Stay tuned!

Quick Start

Simulation

LIBERO: See the full training and evaluation guide here.
RoboCasa-GR1 Tabletop: See the full training and evaluation guide here.

Real Robot

Unitree G1 (Decoupled WholeBodyControl): For real-robot data collection, replay, and closed-loop deployment on a Unitree G1, see the full guide here.

Results

LIBERO Benchmark

Task Suite	Success Rate
LIBERO-Spatial	98.6
LIBERO-Object	100.0
LIBERO-Goal	99.2
LIBERO-10	96.6
Average	98.6

Robocasa-GR1 Benchmark

The following results are obtained using the default training parameters described in Configure Training. We report five independent evaluation runs of the same checkpoint to demonstrate reproducibility. The model consistently achieves an average success rate above 56% across all runs.

Task	Run 1	Run 2	Run 3	Run 4	Run 5
BottleToCabinetClose	50.0	72.0	68.0	64.0	70.0
CanToDrawerClose	80.0	80.0	82.0	76.0	70.0
CupToDrawerClose	50.0	34.0	50.0	44.0	60.0
MilkToMicrowaveClose	58.0	60.0	38.0	68.0	60.0
PotatoToMicrowaveClose	40.0	40.0	36.0	38.0	48.0
WineToCabinetClose	60.0	48.0	60.0	54.0	68.0
FromCuttingboardToBasket	54.0	48.0	46.0	64.0	54.0
FromCuttingboardToCardboardbox	50.0	60.0	48.0	58.0	52.0
FromCuttingboardToPan	80.0	74.0	78.0	72.0	72.0
FromCuttingboardToPot	52.0	46.0	66.0	64.0	54.0
FromCuttingboardToTieredbasket	44.0	54.0	50.0	50.0	46.0
FromPlacematToBasket	58.0	40.0	44.0	54.0	54.0
FromPlacematToBowl	64.0	66.0	72.0	60.0	56.0
FromPlacematToPlate	66.0	62.0	64.0	54.0	58.0
FromPlacematToTieredshelf	44.0	48.0	40.0	32.0	44.0
FromPlateToBowl	64.0	74.0	54.0	72.0	52.0
FromPlateToCardboardbox	50.0	54.0	52.0	56.0	52.0
FromPlateToPan	58.0	68.0	70.0	68.0	56.0
FromPlateToPlate	62.0	64.0	72.0	76.0	74.0
FromTrayToCardboardbox	52.0	50.0	60.0	58.0	58.0
FromTrayToPlate	64.0	64.0	58.0	60.0	50.0
FromTrayToPot	68.0	70.0	66.0	64.0	74.0
FromTrayToTieredbasket	50.0	46.0	50.0	42.0	50.0
FromTrayToTieredshelf	42.0	36.0	28.0	30.0	44.0
Average	56.7	56.6	56.3	57.4	57.3

LIBERO Benchmark

Citation

If you find this work useful, please consider citing:

@article{ma2026dit4dit,
  title={DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control},
  author={Ma, Teli and Zheng, Jia and Wang, Zifan and Jiang, Chunli and Cui, Andy and Liang, Junwei and Yang, Shuo},
  journal={arXiv preprint arXiv:2603.10448},
  year={2026}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project builds upon:

StarVLA
Cosmos-Predict2.5 by NVIDIA
GR00T by NVIDIA
Robocasa
LeRobot by Hugging Face
GR00T-WholeBodyControl by NVIDIA