README.md
September 22, 2025 · View on GitHub
Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Xiaoyu Yue1,2
·
ZiDong Wang3
·
Yuqinng Wang4
·
Wenlong Zhang1
·
Xihui Liu4
·
Wanli Ouyang1,3
·
Lei Bai1
·
Luping Zhou2
1 SH AI Lab 2USYD 3CUHK 4HKU
[arXiv] [Model (Coming Soon)] [Codes (Coming Soon)]
Highlights In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain.
- We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency.

- We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR).

- We conduct comprehensive experiments to validate the design of each component of ST-AR. Specifically, ST-AR brings approximately $42% FID improvement for LlamaGen-L and \49%$ FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

🚨 News
- Our ST-AR is accepted by NeurIPS 2025!🍺