README.md

September 22, 2025 · View on GitHub

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Xiaoyu Yue^1,2 · ZiDong Wang³ · Yuqinng Wang⁴ · Wenlong Zhang¹ · Xihui Liu⁴ · Wanli Ouyang^1,3 · Lei Bai¹ · Luping Zhou²

¹ SH AI Lab ²USYD ³CUHK ⁴HKU

[arXiv] [Model (Coming Soon)] [Codes (Coming Soon)]

Highlights In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain.

We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency.
We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR).
We conduct comprehensive experiments to validate the design of each component of ST-AR. Specifically, ST-AR brings approximately $42% $FID improvement for LlamaGen-L and \$ 49%$ FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.

🚨 News

Our ST-AR is accepted by NeurIPS 2025!🍺