README.md
May 9, 2026 · View on GitHub
Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control
Teli Ma1,2 Jia Zheng1,2 Zifan Wang1,2 Chunli Jiang1 Andy Cui1 Junwei Liang2,3,* Shuo Yang1,*
1Mondo Robotics 2HKUST(GZ) 3HKUST *Corresponding author
DiT4DiT is a Vision-Action-Model (VAM) framework that combines video generation transformers with flow-matching-based action prediction for generalizable robotic manipulation. It supports both the tabletop and whole-body control for manipulation tasks. Notably, DiT4DiT stands as the first efficient VAM to achieve real-time whole-body control of humanoid robots.
News
- [2026-04-15] Initial release of DiT4DiT with training, evaluation, and deployment code.
- [2026-03-11] We release the arXiv paper.
Whole-Body Control (all 1x speed & autonomous)
| Shelf Organization | |
![]() |
|
| Relocate Chair | |
![]() |
|
| Assembly Line Work | |
![]() |
Tabletop Manipulation (all 1x speed, 1 policy for all tasks)
| Stack Cups | Drawer Interaction |
![]() |
![]() |
| Pick and Place | Arrange Flower |
![]() |
![]() |
| Move Spoon | Insert Plate |
![]() |
![]() |
| Box Packing | Twist Cap |
![]() |
![]() |
Table of Contents
TODOs
- Release teleoperation, training and deployment code for Unitree G1 tabletop tasks.
- Release teleoperation, training and deployment code for Unitree G1 whole-body control tasks.
Project Structure
DiT4DiT/
├── DiT4DiT/ # Core package
│ ├── config/ # Configurations
│ │ ├── deepseeds/ # DeepSpeed configs
│ │ ├── robocasa/ # RoboCasa experiment configs
│ │ └── real_robot/ # Real robot configs
│ ├── dataloader/ # Dataset loading (LeRobot)
│ ├── model/ # Model architecture
│ │ ├── framework/ # DiT4DiT framework
│ │ └── modules/ # Backbone & action model
│ └── training/ # Training scripts & utilities
├── deployment/ # WebSocket-based model server
├── docs/ # Documentation
├── examples/
│ ├── Robocasa_tabletop/ # RoboCasa simulation example
│ │ ├── train_files/ # Training scripts
│ │ └── eval_files/ # Evaluation & simulation
│ └── Real_G1/ # Real Unitree G1 example
│ ├── train_files/ # Training scripts
│ └── eval_files/ # Evaluation
└── requirements.txt
Installation
Prerequisites
- Python >= 3.10
- CUDA 12.4+
- >8x GPUs recommended for training
Setup
# Clone the repository
git clone https://github.com/Mondo-Robotics/DiT4DiT.git
cd DiT4DiT
# Create conda environment
conda create -n dit4dit python=3.10 -y
conda activate dit4dit
# Install PyTorch (CUDA 12.8 recommended)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .
Download Pretrained Backbone
Download the Cosmos-Predict2.5-2B model from Hugging Face:
huggingface-cli download nvidia/Cosmos-Predict2.5-2B --revision diffusers/base/post-trained --local-dir /path/to/Cosmos-Predict2.5-2B
Model Zoo
We release pretrained checkpoints to facilitate reproduction.
Available Checkpoints
| Model | Description | Dataset | Success Rate | Link |
|---|---|---|---|---|
| DiT4DiT-LIBERO | DiT4DiT for LIBERO benchmark | LIBERO | 98.6 | 🤗 Hugging Face |
| DiT4DiT-RoboCasa-GR1 | DiT4DiT for RoboCasa-GR1 tabletop tasks | RoboCasa-GR1 | 56.7 | 🤗 Hugging Face |
Note: More checkpoints will be released soon. Stay tuned!
Quick Start
Simulation
- LIBERO: See the full training and evaluation guide here.
- RoboCasa-GR1 Tabletop: See the full training and evaluation guide here.
Real Robot
Coming soon.
Results
LIBERO Benchmark
| Task Suite | Success Rate |
|---|---|
| LIBERO-Spatial | 98.6 |
| LIBERO-Object | 100.0 |
| LIBERO-Goal | 99.2 |
| LIBERO-10 | 96.6 |
| Average | 98.6 |
Robocasa-GR1 Benchmark
The following results are obtained using the default training parameters described in Configure Training. We report five independent evaluation runs of the same checkpoint to demonstrate reproducibility. The model consistently achieves an average success rate above 56% across all runs.
| Task | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
|---|---|---|---|---|---|
| BottleToCabinetClose | 50.0 | 72.0 | 68.0 | 64.0 | 70.0 |
| CanToDrawerClose | 80.0 | 80.0 | 82.0 | 76.0 | 70.0 |
| CupToDrawerClose | 50.0 | 34.0 | 50.0 | 44.0 | 60.0 |
| MilkToMicrowaveClose | 58.0 | 60.0 | 38.0 | 68.0 | 60.0 |
| PotatoToMicrowaveClose | 40.0 | 40.0 | 36.0 | 38.0 | 48.0 |
| WineToCabinetClose | 60.0 | 48.0 | 60.0 | 54.0 | 68.0 |
| FromCuttingboardToBasket | 54.0 | 48.0 | 46.0 | 64.0 | 54.0 |
| FromCuttingboardToCardboardbox | 50.0 | 60.0 | 48.0 | 58.0 | 52.0 |
| FromCuttingboardToPan | 80.0 | 74.0 | 78.0 | 72.0 | 72.0 |
| FromCuttingboardToPot | 52.0 | 46.0 | 66.0 | 64.0 | 54.0 |
| FromCuttingboardToTieredbasket | 44.0 | 54.0 | 50.0 | 50.0 | 46.0 |
| FromPlacematToBasket | 58.0 | 40.0 | 44.0 | 54.0 | 54.0 |
| FromPlacematToBowl | 64.0 | 66.0 | 72.0 | 60.0 | 56.0 |
| FromPlacematToPlate | 66.0 | 62.0 | 64.0 | 54.0 | 58.0 |
| FromPlacematToTieredshelf | 44.0 | 48.0 | 40.0 | 32.0 | 44.0 |
| FromPlateToBowl | 64.0 | 74.0 | 54.0 | 72.0 | 52.0 |
| FromPlateToCardboardbox | 50.0 | 54.0 | 52.0 | 56.0 | 52.0 |
| FromPlateToPan | 58.0 | 68.0 | 70.0 | 68.0 | 56.0 |
| FromPlateToPlate | 62.0 | 64.0 | 72.0 | 76.0 | 74.0 |
| FromTrayToCardboardbox | 52.0 | 50.0 | 60.0 | 58.0 | 58.0 |
| FromTrayToPlate | 64.0 | 64.0 | 58.0 | 60.0 | 50.0 |
| FromTrayToPot | 68.0 | 70.0 | 66.0 | 64.0 | 74.0 |
| FromTrayToTieredbasket | 50.0 | 46.0 | 50.0 | 42.0 | 50.0 |
| FromTrayToTieredshelf | 42.0 | 36.0 | 28.0 | 30.0 | 44.0 |
| Average | 56.7 | 56.6 | 56.3 | 57.4 | 57.3 |
LIBERO Benchmark
Citation
If you find this work useful, please consider citing:
@article{ma2026dit4dit,
title={DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control},
author={Ma, Teli and Zheng, Jia and Wang, Zifan and Jiang, Chunli and Cui, Andy and Liang, Junwei and Yang, Shuo},
journal={arXiv preprint arXiv:2603.10448},
year={2026}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
This project builds upon:
- StarVLA
- Cosmos-Predict2.5 by NVIDIA
- GR00T by NVIDIA
- Robocasa
- LeRobot by Hugging Face










