Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild

March 5, 2026 · View on GitHub

Hao Luo^1,3, Ye Wang^2,3, Wanpeng Zhang^1,3, Haoqi Yuan^1,3, Yicheng Feng^1,3, Haiweng Xu³,
Sipeng Zheng³, Zongqing Lu^1,3†

¹Peking University ²Renmin University of China ³BeingBeyond

JALA is a Transformer-based VLA pretraining framework that turns large-scale human manipulation videos into action-centric supervision without pixel-level reconstruction, bridging lab-annotated motion data and in-the-wild diversity via Joint Alignment.

News

[2026-02-28]: JALA accepted to CVPR 2026. Project page is live.

Citation

If you find our work useful, please consider citing us and give a star to our repository! 🌟🌟🌟

@inproceedings{luo2026jointalignedlatentaction,
  title={Joint-Aligned Latent Action: Towards Scalable VLA Pretraining in the Wild},
  author={Hao Luo and Ye Wang and Wanpeng Zhang and Haoqi Yuan and Yicheng Feng and Haiweng Xu and Sipeng Zheng and Zongqing Lu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}