README.md
February 19, 2025 · View on GitHub
Scale-wise Text-conditioned AutoRegressive image generation
Important: We have made the weights and code for STAR available in a new repository. Click here to access it!
News
- [2025-02] We have released official Codebase and weights at Hugging Face!
- [2024-06] STAR Technical Report is released.
Introduction
STAR, the first scale-wise text-to-image model based on VAR, supports resolutions from 256×256 to 1024×1024.
By incorporating text conditioning, normalized 2D RoPE, and causal-driven stable sampling, STAR outperforms existing models in fidelity, consistency, and quality, with a faster generation speed of 2.21s for 1024×1024 images on an A100.
CLICK for Detailed Introduction & Architecture
Unlike VAR, which focuses on a toy category-based auto-regressive generation for 256 images, STAR explores the potential of this scale-wise auto-regressive paradigm in real-world scenarios, aiming to make AR as effective as diffusion models. To achieve this, we: + replace the single category token with a text encoder and cross-attention for detailed text guidance; + introduce cross-scale normalized RoPE to stabilize structural learning and reduce training costs, unleasing the power for high-resolution training; + propose a new sampling method to overcome the intrinsic simultaneous sampling issue in AR models. While these approaches have been (partially) explored to diffusion models, we are the first to validate and apply them in auto-regressive image generation, resulting in high-resolution, text-conditioned synthesis and can get StableDiffusion 2 performance.
framework of STAR
Quantitative Performance
Per-category FID on MJHQ-30K |
Efficiency & CLIP-Score of 1024x1024 generation |
Qualitative Performance
Reproduction
See Repo for detailes.
Citation
Thanks to the developers of Visual Autoregressive Modeling for their excellent work. Our code is adapted from VAR. If our work assists your research, feel free to give us a star ⭐ or cite us using:
@article{ma2024star,
title={STAR: Scale-wise Text-conditioned AutoRegressive image generation},
author={Xiaoxiao Ma and Mohan Zhou and Tao Liang and Yalong Bai and Tiejun Zhao and Biye Li and Huaian Chen and Yi Jin},
journal={arXiv preprint arXiv:2406.10797},
year={2024}
}