README.md

February 28, 2025 · View on GitHub

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

Siyu Jiao¹, Gengwei Zhang², Yinlong Qian³, Jiancheng Huang³, Yao Zhao¹,
Humphrey Shi⁴, Lin Ma³, Yunchao Wei¹, Zequn Jie³

¹ BJTU, ² UTS, ³ Meituan, ⁴ Georgia Tech

Introduction

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images ( $\leq$ 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256x256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512x512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512x512 resolution.

Installation

Install torch>=2.0.0.
Install other pip packages via pip3 install -r requirements.txt.

Prepare the ImageNet dataset

assume the ImageNet is in `/path/to/imagenet`. It should be like this:

/path/to/imagenet/:
    train/:
        n01440764: 
            many_images.JPEG ...
        n01443537:
            many_images.JPEG ...
    val/:
        n01440764:
            ILSVRC2012_val_00000293.JPEG ...
        n01443537:
            ILSVRC2012_val_00000236.JPEG ...

NOTE: The arg --data_path=/path/to/imagenet should be passed to the training script.

(Optional) install and compile flash-attn and xformers for faster attention computation.

VAR zoo

VQVAE Tokenizer

You need to download FlexVAE.pth first.
FlexVAR-Transformer

FID IS Step Weights
d16 3.05 291.3 10 FlexVARd16-epo179.pth
d20 2.41 299.3 10 FlexVARd20-epo249.pth
d24 2.21 299.1 10 FlexVARd16-epo349.pth

	FID	IS	Step	Weights
d16	3.05	291.3	10	FlexVARd16-epo179.pth
d20	2.41	299.3	10	FlexVARd20-epo249.pth
d24	2.21	299.1	10	FlexVARd16-epo349.pth

Evaluation

256x256 (default)

For FID evaluation, use var.autoregressive_infer_cfg to sample 50,000 images (50 per class) and save them as PNG (not JPEG) files in a folder. Pack them into a .npz file. Then use the OpenAI's FID evaluation toolkit and reference ground truth npz file of 256x256 to evaluate FID, IS, precision, and recall. See Evaluation for details

For example, evaluate our pre-trained FlexVARd20-epo249.pth model:

# 1. Download FlexVARd24-epo349.pth.
# 2. put it at `pretrained/FlexVARd20-epo349.pth`.
# 3. evaluation
  args_infer_patch_nums="1_2_3_4_5_7_10_13_16"
  torchrun --nnodes=1 --nproc_per_node=2 --node_rank=0  eval_c2i.py  --batch_size 16  --cfg 2.5 --top_k 900 \
                                                                     --maxpn 16  --infer_patch_nums $args_infer_patch_nums \
                                                                     --depth 24

Zero-shot transfer with 13 steps

Modify args_infer_patch_nums between steps 8 and 14.

  args_infer_patch_nums="1_2_3_4_5_6_7_8_9_10_12_14_16"
  torchrun --nnodes=1 --nproc_per_node=2 --node_rank=0  eval_c2i.py  --batch_size 16  --cfg 2.5 --top_k 900 \
                                                                     --maxpn 16  --infer_patch_nums $args_infer_patch_nums \
                                                                     --depth 20

Zero-shot transfer to 512x512

Reference ground truth .npz file of 512x512

  args_infer_patch_nums="1_2_3_4_5_6_7_8_9_10_12_14_16_23_32"
  torchrun --nnodes=1 --nproc_per_node=2 --node_rank=0  eval_c2i.py  --batch_size 16  --cfg 3.0 --top_k 900 \
                                                                     --maxpn 16  --infer_patch_nums $args_infer_patch_nums \
                                                                     --depth 20

Acknowledgement

VAR

Llamagen

AiM