Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

March 5, 2026 · View on GitHub

Hao Luo1,3, Ye Wang2,3, Wanpeng Zhang1,3, Haoqi Yuan1,3, Yicheng Feng1,3, Haiweng Xu3,
Sipeng Zheng3, Zongqing Lu1,3†

1Peking University    2Renmin University of China    3BeingBeyond

Website arXiv License

JALA is a Transformer-based VLA pretraining framework that turns large-scale human manipulation videos into action-centric supervision without pixel-level reconstruction, bridging lab-annotated motion data and in-the-wild diversity via Joint Alignment.

News

  • [2026-02-28]: JALA accepted to CVPR 2026. Project page is live.

Citation

If you find our work useful, please consider citing us and give a star to our repository! 🌟🌟🌟

@inproceedings{luo2026jointalignedlatentaction,
  title={Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild},
  author={Hao Luo and Ye Wang and Wanpeng Zhang and Haoqi Yuan and Yicheng Feng and Haiweng Xu and Sipeng Zheng and Zongqing Lu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}