Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

June 11, 2025 · View on GitHub

Paper | Page

Teng Hu, Jiangning Zhang, Ran Yi, Jieyu Weng, Yabiao Wang, Xianfang Zeng, Zhucun Xue, and Lizhuang Ma

Overview

This repository contains the implementation for IAR pipeline, including training, sampling, and evaluation components.

Prerequisites

Before getting started, ensure you have:

Python ≥ 3.7
PyTorch ≥ 2.1
Access to ImageNet or similar dataset for training

Setup

Download pretrained weights of VQGAN from LlamaGen: https://huggingface.co/FoundationVision/LlamaGen/resolve/main/vq_ds16_c2i.pt

Training Pipeline

0. Data Preparation

First extract codes from your training images:

torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12345 \
autoregressive/train/extract_codes_c2i.py \
--vq-ckpt ./pretrained_models/vq_ds16_c2i.pt \
--data-path /path/to/imagenet/train \
--code-path /path/to/output/codes \
--ten-crop \
--crop-range 1.1 \
--image-size 384

1. Preprocess VQ Checkpoint

Reorder the codebook with balanced k-means algorithm. It will output the reordered codebook for inference and a mapping function for training.

python balance_k_means.py

2. Model Training

Train the autoregressive model:

PYTHONPATH=$PYTHONPATH:./ torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12345 \
autoregressive/train/train_c2i.py \
--results-dir ./results \
--code-path /path/to/output/codes \
--image-size 384 \
--gpt-model GPT-B

Sampling

Generate Images

Generate new images using a trained model:

PYTHONPATH=$PYTHONPATH:./ torchrun \
--nnodes=1 --nproc_per_node=8 --node_rank=0 \
--master_port=12345 \
autoregressive/sample/sample_c2i_ddp.py \
--vq-ckpt ./pretrained_models/vq_ds16_c2i-reorder-kmeans+nearest-cluster_size=128.pt \
--gpt-ckpt results/your_model_directory/checkpoints \
--gpt-model GPT-B \
--image-size 384 \
--image-size-eval 256 \
--cfg-scale 2.25 \
--num-fid-samples=50000 \
--sample-dir=samples

Evaluation

Before evaluation, install required packages as specified in evaluations/README.md.

Evaluate generated samples:

python3 evaluations/c2i/evaluator.py \
evaluations/VIRTUAL_imagenet256_labeled.npz \
samples/your_generated_samples.npz

Citation

If you find this code helpful for your research, please cite:

@inproceedings{hu2025improving,
     title={Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction},
     author={Hu, Teng and Zhang, Jiangning and Yi, Ran and Weng, Jieyu and Wang, Yabiao and Zeng, Xianfang and Xue, Zhucun and Ma, Lizhuang},
     booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
     year={2025}
}