README.md

May 9, 2026 · View on GitHub

DiT4DiT

Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

arXiv Project Page License

Teli Ma1,2    Jia Zheng1,2    Zifan Wang1,2    Chunli Jiang1    Andy Cui1    Junwei Liang2,3,*    Shuo Yang1,*

1Mondo Robotics    2HKUST(GZ)    3HKUST    *Corresponding author


DiT4DiT is a Vision-Action-Model (VAM) framework that combines video generation transformers with flow-matching-based action prediction for generalizable robotic manipulation. It supports both the tabletop and whole-body control for manipulation tasks. Notably, DiT4DiT stands as the first efficient VAM to achieve real-time whole-body control of humanoid robots.

News

  • [2026-04-15] Initial release of DiT4DiT with training, evaluation, and deployment code.
  • [2026-03-11] We release the arXiv paper.

Whole-Body Control (all 1x speed & autonomous)

Shelf Organization
Relocate Chair
Assembly Line Work

Tabletop Manipulation (all 1x speed, 1 policy for all tasks)

Stack Cups Drawer Interaction
Pick and Place Arrange Flower
Move Spoon Insert Plate
Box Packing Twist Cap

Table of Contents

TODOs

  • Release teleoperation, training and deployment code for Unitree G1 tabletop tasks.
  • Release teleoperation, training and deployment code for Unitree G1 whole-body control tasks.

Project Structure

DiT4DiT/
├── DiT4DiT/                    # Core package
│   ├── config/                 # Configurations
│   │   ├── deepseeds/          # DeepSpeed configs
│   │   ├── robocasa/           # RoboCasa experiment configs
│   │   └── real_robot/         # Real robot configs
│   ├── dataloader/             # Dataset loading (LeRobot)
│   ├── model/                  # Model architecture
│   │   ├── framework/          # DiT4DiT framework
│   │   └── modules/            # Backbone & action model
│   └── training/               # Training scripts & utilities
├── deployment/                 # WebSocket-based model server
├── docs/                       # Documentation
├── examples/
│   ├── Robocasa_tabletop/      # RoboCasa simulation example
│   │   ├── train_files/        # Training scripts
│   │   └── eval_files/         # Evaluation & simulation
│   └── Real_G1/                # Real Unitree G1 example
│       ├── train_files/        # Training scripts
│       └── eval_files/         # Evaluation
└── requirements.txt

Installation

Prerequisites

  • Python >= 3.10
  • CUDA 12.4+
  • >8x GPUs recommended for training

Setup

# Clone the repository
git clone https://github.com/Mondo-Robotics/DiT4DiT.git
cd DiT4DiT

# Create conda environment
conda create -n dit4dit python=3.10 -y
conda activate dit4dit

# Install PyTorch (CUDA 12.8 recommended)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Download Pretrained Backbone

Download the Cosmos-Predict2.5-2B model from Hugging Face:

huggingface-cli download nvidia/Cosmos-Predict2.5-2B --revision diffusers/base/post-trained --local-dir /path/to/Cosmos-Predict2.5-2B

Model Zoo

We release pretrained checkpoints to facilitate reproduction.

Available Checkpoints

ModelDescriptionDatasetSuccess RateLink
DiT4DiT-LIBERODiT4DiT for LIBERO benchmarkLIBERO98.6🤗 Hugging Face
DiT4DiT-RoboCasa-GR1DiT4DiT for RoboCasa-GR1 tabletop tasksRoboCasa-GR156.7🤗 Hugging Face

Note: More checkpoints will be released soon. Stay tuned!

Quick Start

Simulation

  • LIBERO: See the full training and evaluation guide here.
  • RoboCasa-GR1 Tabletop: See the full training and evaluation guide here.

Real Robot

Coming soon.

Results

LIBERO Benchmark

Task SuiteSuccess Rate
LIBERO-Spatial98.6
LIBERO-Object100.0
LIBERO-Goal99.2
LIBERO-1096.6
Average98.6

Robocasa-GR1 Benchmark

The following results are obtained using the default training parameters described in Configure Training. We report five independent evaluation runs of the same checkpoint to demonstrate reproducibility. The model consistently achieves an average success rate above 56% across all runs.

TaskRun 1Run 2Run 3Run 4Run 5
BottleToCabinetClose50.072.068.064.070.0
CanToDrawerClose80.080.082.076.070.0
CupToDrawerClose50.034.050.044.060.0
MilkToMicrowaveClose58.060.038.068.060.0
PotatoToMicrowaveClose40.040.036.038.048.0
WineToCabinetClose60.048.060.054.068.0
FromCuttingboardToBasket54.048.046.064.054.0
FromCuttingboardToCardboardbox50.060.048.058.052.0
FromCuttingboardToPan80.074.078.072.072.0
FromCuttingboardToPot52.046.066.064.054.0
FromCuttingboardToTieredbasket44.054.050.050.046.0
FromPlacematToBasket58.040.044.054.054.0
FromPlacematToBowl64.066.072.060.056.0
FromPlacematToPlate66.062.064.054.058.0
FromPlacematToTieredshelf44.048.040.032.044.0
FromPlateToBowl64.074.054.072.052.0
FromPlateToCardboardbox50.054.052.056.052.0
FromPlateToPan58.068.070.068.056.0
FromPlateToPlate62.064.072.076.074.0
FromTrayToCardboardbox52.050.060.058.058.0
FromTrayToPlate64.064.058.060.050.0
FromTrayToPot68.070.066.064.074.0
FromTrayToTieredbasket50.046.050.042.050.0
FromTrayToTieredshelf42.036.028.030.044.0
Average56.756.656.357.457.3

LIBERO Benchmark

Citation

If you find this work useful, please consider citing:

@article{ma2026dit4dit,
  title={DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control},
  author={Ma, Teli and Zheng, Jia and Wang, Zifan and Jiang, Chunli and Cui, Andy and Liang, Junwei and Yang, Shuo},
  journal={arXiv preprint arXiv:2603.10448},
  year={2026}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project builds upon: