MoDEDiffusionPolicy
May 16, 2025 · View on GitHub
MoDE_Diffusion_Policy
Moritz Reuss1, Jyothish Pari2, Pulkit Agrawal2, Rudolf Lioutikov1
1Intuitive Robots Lab, KIT 2MIT CSAIL
Installation
To begin, clone this repository locally
git clone --recurse-submodules git@github.com:intuitive-robots/MoDE_Diffusion_Policy.git
export mode_ROOT=$(pwd)/MoDE_Diffusion_Policy
Install requirements (Note we provided a changed verison of pyhash, given numerous problems when installing it manually)
cd $mode_ROOT
conda create -n mode_env python=3.9
conda activate mode_env
cd calvin_env/tacto
pip install -e .
cd ..
pip install -e .
cd ..
pip install setuptools==57.5.0
cd pyhash-0.9.3
python setup.py build
python setup.py install
cd ..
cd LIBERO
pip install -r requirements.txt
pip install -e .
pip install numpy~=1.23
cd ..
Next we can install the rest of the missing packages
pip install -r requirements.txt
Download
CALVIN Dataset
If you want to train on the CALVIN dataset, choose a split with:
cd $mode_ROOT/dataset
sh download_data.sh D | ABCD
Training
To train the mode model with the maximum amount of available GPUS, run:
python mode/training.py
For replication of the orginial training results I recommend to use 4 GPUs with a batch_size of 128 and train them for 15k steps for CALVIN and LIBERO results, when startin from pretrained weights. See configs for details.
Preprocessing with CALVIN
Since MoDE uses action chunking, it needs to load multiple (~10) episode_{}.npz files for each inference. In combination with batching, this results in a large disk bandwidth needed for each iteration (usually ~2000MB/iteration).
This has the potential of significantly reducing your GPU utilization rate during training depending on your hardware.
Therefore, you can use the script extract_by_key.py to extract the data into a single file, avoiding opening too many episode files when using the CALVIN dataset.
Usage example:
python preprocess/extract_by_key.py -i /YOUR/PATH/TO/CALVIN/ \
--in_task all
python preprocess/extract_by_key.py -i /hkfs/work/workspace/scratch/ft4740-play3/data --in_task all
Params:
Run this command to see more detailed information:
python preprocess/extract_by_key.py -h
Important params:
--in_root:/YOUR/PATH/TO/CALVIN/, e.g/data3/geyuan/datasets/CALVIN/--extract_key: A key ofdict(episode_xxx.npz), default is 'rel_actions', the saved file name depends on this (i.eep_{extract_key}.npy) Optional params:--in_task: default is 'all', meaning all task folders (e.gtask_ABCD_D/) of CALVIN--in_split: default is 'all', meaning bothtraining/andvalidation/--out_dir: optional, default is 'None', and will be converted to{in_root}/{in_task}/{in_split}/extracted/--force: whether to overwrite existing extracted data Thanks to @ygtxr1997 for debugging the GPU utilization and providing a merge request.
After that modify the following line to add the storage path for your CALVIN dataset: L9 in config_calvin.yaml
Evaluation
Download the pretrained models from Hugging Face:
Results on Calvin ABC→D
| Method | Active Params (Million) | PrT | 1 | 2 | 3 | 4 | 5 | Avg. Len. |
|---|---|---|---|---|---|---|---|---|
| Diff-P-CNN | 321 | × | 63.5% | 35.3% | 19.4% | 10.7% | 6.4% | 1.35±0.05 |
| Diff-P-T | 194 | × | 62.2% | 30.9% | 13.2% | 5.0% | 1.6% | 1.13±0.02 |
| RoboFlamingo | 1000 | ✓ | 82.4% | 61.9% | 46.6% | 33.1% | 23.5% | 2.47±0.00 |
| SuSIE | 860+ | ✓ | 87.0% | 69.0% | 49.0% | 38.0% | 26.0% | 2.69±0.00 |
| GR-1 | 130 | ✓ | 85.4% | 71.2% | 59.6% | 49.7% | 40.1% | 3.06±0.00 |
| MoDE | 307 | × | 91.5% | 79.2% | 67.3% | 55.8% | 45.3% | 3.39±0.03 |
| MoDE | 436 | ✓ | 96.2% | 88.9% | 81.1% | 71.8% | 63.5% | 4.01±0.04 |
Results on Calvin ABCD→D
| Method | Active Params (Million) | PrT | 1 | 2 | 3 | 4 | 5 | Avg. Len. |
|---|---|---|---|---|---|---|---|---|
| Diff-P-CNN | 321 | × | 86.3% | 72.7% | 60.1% | 51.2% | 41.7% | 3.16±0.06 |
| Diff-P-T | 194 | × | 78.3% | 53.9% | 33.8% | 20.4% | 11.3% | 1.98±0.09 |
| RoboFlamingo | 1000 | ✓ | 96.4% | 89.6% | 82.4% | 74.0% | 66.0% | 4.09±0.00 |
| GR-1 | 130 | ✓ | 94.9% | 89.6% | 84.4% | 78.9% | 73.1% | 4.21±0.00 |
| MoDE | 277 | × | 96.6% | 90.6% | 86.6% | 80.9% | 75.5% | 4.30±0.02 |
| MoDE | 436 | ✓ | 97.1% | 92.5% | 87.9% | 83.5% | 77.9% | 4.39±0.04 |
We also provide the pretrained checkpoint after pretraining MoDE for 300k steps on a small OXE subset:
We used the following split for training:
| Dataset | Weight |
|---|---|
| BC-Z | 0.258768 |
| LIBERO-10 | 0.043649 |
| BRIDGE | 0.188043 |
| CMU Play-Fusion | 0.101486 |
| Google Fractal | 0.162878 |
| DOBB-E | 0.245176 |
| Total | 1.000000 |
The model was pretrained for 300k steps with full pretraining details provided here; MoDE Pretraining Report.
Acknowledgements
This work is only possible becauase of the code from the following open-source projects and datasets:
CALVIN
Original: https://github.com/mees/calvin License: MIT
OpenAI CLIP
Original: https://github.com/openai/CLIP License: MIT
BESO
Original: https://github.com/intuitive-robots/beso License: MIT
HULC
Original: https://github.com/lukashermann/hulc License: MIT
MDT
Original: https://github.com/intuitive-robots/mdt_policy License: https://github.com/intuitive-robots/mdt_policy/blob/main/LICENSE
Citation
If you found the code usefull, please cite our work:
@inproceedings{
reuss2025efficient,
title={Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning},
author={Moritz Reuss and Jyothish Pari and Pulkit Agrawal and Rudolf Lioutikov},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=nDmwloEl3N}
}