Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling
December 3, 2025 · View on GitHub
Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling
Yueru Jia1,2*, Jiaming Liu1,2*, Shengbang Liu3*, Rui Zhou4, Wanhe Yu1, Yuyang Yan1, Xiaowei Chi5,
Yandong Guo2, Boxin Shi1, Shanghang Zhang1
1Peking University, 2AI2Robotics, 3Sun Yat-sen University, 4Wuhan University, 5HKUST

This repo is the official implementation of Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling.
Pipeline

News
- [2025.12.2] Code is coming soon! Stay tuned for updates.
Overview
Video2Act is a dual-system video diffusion policy framework that combines robotic spatio-motional modeling for advanced manipulation tasks. The framework integrates:
- Spatial and Motion Processing: Spatial and Motion feature extraction from VDM
- Policy Learning: Diffusion-based transformer architecture for action generation
- RoboTwin Integration: Evaluation
Getting Started
Code is coming soon!
We are currently finalizing the codebase and will release:
- Complete training and evaluation code
- RoboTwin evaluation tools
- Detailed documentation and tutorials
Citation
If you find this work useful, please consider citing:
@misc{jia2025video2actdualsystemvideodiffusion,
title={Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling},
author={Yueru Jia and Jiaming Liu and Shengbang Liu and Rui Zhou and Wanhe Yu and Yuyang Yan and Xiaowei Chi and Yandong Guo and Boxin Shi and Shanghang Zhang},
year={2025},
eprint={2512.03044},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2512.03044},
}
License
This project will be released under the MIT License.