README.md

July 24, 2025 · View on GitHub

This is the official implementation for the paper "[ICCV 2025] Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation".

Status Checklist

Code & checkpoint upload completed
FlexAttention finetuning option enabled
Infinity-8B checkpoint finetuning enabled

We introduce a method for fine-tuning visual autoregressive (VAR) models tailored for subject-driven generation tasks. Our approach efficiently customizes VAR models, enabling high-quality personalized image generation.

Hardware Requirements

Our experiments were conducted using a NVIDIA A6000 GPU. Please ensure your hardware meets the following minimum specification:

GPU Memory: ≥ 40GB

Installation

Option 1: Manual Setup

Clone the repository and install dependencies:

git clone https://github.com/jiwoogit/ARBooth.git
cd arbooth
pip install -r requirements.txt

Option 2: Docker Setup

Use our pre-configured Docker image:

docker pull wldn0202/arbooth:latest

Docker Hub link: Docker Image

Pretrained Checkpoints

Please download the official pretrained VAR checkpoints from Infinity's repository and organize them as follows:

weights/
├── infinity_2b_reg.pth
└── infinity_vae_d32_reg.pth

You can download our fine-tuned checkpoints from Hugging Face (wldn0202/ARBooth).

Data Preprocessing

We adopt the preprocessing pipeline of DreamMatcher. Please follow their instructions for detailed steps or refer inputs directory.

Training

Customize training parameters by modifying exp_name and cls_name in the provided script:

bash scripts/train_arbooth.sh

All training results and logs will be saved under the LOCAL_OUT directory.

For detailed configuration options and parameters for fine-tuning, please refer to infinity/utils/arg_util.py.

Evaluation

We evaluate performance using metrics: DINO, CLIP, PRES, and DIV. Update the paths in scripts/eval_arbooth.sh to match your training setup:

bash scripts/eval_arbooth.sh

Inference

Generate images using your custom prompts with the fine-tuned checkpoints:

bash scripts/infer_arbooth.sh

Fine-tuning Tips

Iteration Settings:
- For 2-batch configuration: 500 iterations is recommended
- For 1-batch configuration: 100-150 iterations is recommended
- Adjust these values based on your specific input data and requirements
Class Prompt Selection:
- The choice of class prompt (e.g., "dog", "cat") significantly impacts the final generation quality
- Use general, broad category nouns for optimal results

Acknowledgements

This repository is built upon the following projects:

We sincerely appreciate their invaluable contributions.

Citation

If our paper or repository assists your research, kindly cite us:

@article{chung2025fine,
  title={Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation},
  author={Chung, Jiwoo and Hyun, Sangeek and Kim, Hyunjun and Koh, Eunseo and Lee, MinKyu and Heo, Jae-Pil},
  journal={arXiv preprint arXiv:2504.02612},
  year={2025}
}

Contact

For any questions, please reach out to:

Jiwoo Chung (jiwoo.jg@gmail.com)