README.md

September 22, 2025 · View on GitHub

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Xiaoyu Yue1,2·ZiDong Wang3·Yuqinng Wang4·Wenlong Zhang1·Xihui Liu4·Wanli Ouyang1,3·Lei Bai1·Luping Zhou2

1 SH AI Lab   2USYD   3CUHK   4HKU

[arXiv]  [Model (Coming Soon)]  [Codes (Coming Soon)] 


Highlights In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain.

  • We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. Figure
  • We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Figure
  • We conduct comprehensive experiments to validate the design of each component of ST-AR. Specifically, ST-AR brings approximately $42% FID improvement for LlamaGen-L and \49%$ FID improvement for LlamaGen-XL, while maintaining the same sampling strategy. Figure

🚨 News

  • Our ST-AR is accepted by NeurIPS 2025!🍺

Setup & Training & Sampling (Coming Soon)