Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

December 3, 2025 · View on GitHub

Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

   

Yueru Jia1,2*, Jiaming Liu1,2*, Shengbang Liu3*, Rui Zhou4, Wanhe Yu1, Yuyang Yan1, Xiaowei Chi5,
Yandong Guo2, Boxin Shi1, Shanghang Zhang1

1Peking University, 2AI2Robotics, 3Sun Yat-sen University, 4Wuhan University, 5HKUST

Overview

This repo is the official implementation of Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling.

Pipeline

Pipeline

News

  • [2025.12.2] Code is coming soon! Stay tuned for updates.

Overview

Video2Act is a dual-system video diffusion policy framework that combines robotic spatio-motional modeling for advanced manipulation tasks. The framework integrates:

  • Spatial and Motion Processing: Spatial and Motion feature extraction from VDM
  • Policy Learning: Diffusion-based transformer architecture for action generation
  • RoboTwin Integration: Evaluation

Getting Started

Code is coming soon!

We are currently finalizing the codebase and will release:

  • Complete training and evaluation code
  • RoboTwin evaluation tools
  • Detailed documentation and tutorials

Citation

If you find this work useful, please consider citing:

@misc{jia2025video2actdualsystemvideodiffusion,
      title={Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling}, 
      author={Yueru Jia and Jiaming Liu and Shengbang Liu and Rui Zhou and Wanhe Yu and Yuyang Yan and Xiaowei Chi and Yandong Guo and Boxin Shi and Shanghang Zhang},
      year={2025},
      eprint={2512.03044},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.03044}, 
}

License

This project will be released under the MIT License.