VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation

May 20, 2025 ยท View on GitHub

Static Badgeย  Static Badgeย  Static Badgeย  Static Badge

This repository provides the official implementation of VTBench, a benchmark designed to evaluate the performance of visual tokenizers (VTs) in the context of autoregressive (AR) image generation. VTBench enables fine-grained analysis across three core tasks: image reconstruction, detail preservation, and text preservation, isolating the tokenizer's impact from the downstream generation model.

Our goal is to encourage the development of strong, general-purpose open-source visual tokenizers that can be reliably reused across autoregressive image generation and broader multimodal tasks.

๐Ÿ”ฅ News

๐Ÿ” Why VTBench?

Recent AR models such as GPT-4o demonstrate impressive image generation quality, which we hypothesize is made possible by a highly capable visual tokenizer. However, most existing VTs significantly lag behind continuous VAEs, leading to:

  • Poor reconstruction fidelity
  • Loss of structural and semantic detail
  • Failure to preserve symbolic information (e.g., text in multilingual images)

VTBench isolates and evaluates VT quality, independent of the downstream model, using standardized tasks and metrics.

Comparison of Different Models and Visual Tokenizers

โœจ Features

  • Evaluation on three tasks:
    1. Image Reconstruction (ImageNet, High-Res, Varying-Res)
    2. Detail Preservation (patterns, fine textures)
    3. Text Preservation (posters, academic abstracts, multilingual scripts)
  • Supports VTs from models like FlowMo, MaskBiT, OpenMagViT2, VAR, BSQ-ViT, etc.
  • Includes baselines from continuous VAEs (e.g., SD3.5L, FLUX.1) and GPT-4o.
  • Metrics: PSNR, SSIM, LPIPS, FID, CER, WER
  • โœ… Automatic download of all datasets and models -- no manual setup required.

Overview of VTBench

๐Ÿ“‘ Open-Source Plan

  • Huggingface Space Demo
  • VTBench arXiv Paper
  • Evaluation Code
  • Inference Code on Supported VTs
  • VTBench Dataset

๐Ÿš€ Getting Started

1. Clone the repo

git clone https://github.com/huawei-lin/VTBench.git
cd VTBench

2. Install dependencies

conda create -n vtbench python=3.10
conda activate vtbench
pip install -r requirements.txt

3. Select a VT and Run Evaluation

โœ… No Manual Downloads Needed
All datasets and models are automatically downloaded during runtime from Hugging Face. You can directly run experiments without manually downloading any files.

๐Ÿ“ฆ Model Zoo

Code NameDisplay Name
bsqvitBSQ-VIT
chameleonChameleon
FLUX.1-devFLUX.1-dev
flowmo_hiFlowMo Hi
flowmo_loFlowMo Lo
gpt4oGPT-4o
infinity_d32Infinity-d32
infinity_d64Infinity-d64
janus_pro_1bJanus Pro 1B/7B
llamagen-ds8LlamaGen ds8
llamagen-ds16LlamaGen ds16
llamagen-ds16-t2iLlamaGen ds16 T2I
maskbit_16bitMaskBiT 16bit
maskbit_18bitMaskBiT 18bit
open_magvit2OpenMagViT
SD3.5LSD3.5L
titok_b64Titok-b64
titok_bl128Titok-bl128
titok_bl64Titok-bl64
titok_l32Titok-l32
titok_s128Titok-s128
titok_sl256Titok-sl256
var_256VAR-256
var_512VAR-512

๐Ÿ“š Dataset

VTBench datasets are available on Hugging Face: https://huggingface.co/datasets/huaweilin/VTBench

Dataset NameSplit Name
task1-imagenetval
task1-high-resolutiontest
task1-varying-resolutiontest
task2-detail-preservationtest
task3-movie-posterstest
task3-arxiv-abstractstest
task3-multilingualChinese, Hindi, Japanese, Korean

Run an experiment:

accelerate launch --num_processes=1 main.py \
    --model_name chameleon \
    --dataset_name task3-movie-posters \
    --split_name test \
    --output_dir results \
    --batch_size 4

The script will create the following directory:

results/
โ”œโ”€โ”€ original_images/
โ”œโ”€โ”€ reconstructed_images/
โ””โ”€โ”€ results/

๐Ÿ“Š Evaluate results

python ./evaluations/evaluate_images.py \
    --original_dir results/original_images \
    --reconstructed_dir results/reconstructed_images/ \
    --metrics fid ssim psnr lpips cer wer \
    --batch_size 16 \
    --num_workers 8

โ„น๏ธ Note: cer and wer are only available in text-based reconstruction tasks.

GPT-4o Support

To use GPT-4o for generation:

export OPENAI_API_KEY=${your_openai_key}

๐Ÿ› ๏ธ Automation

We provide automation scripts in examples. Simply run:

bash ./examples/run.sh

For SLURM users, adapt examples/submit.sh accordingly and uncomment the SLURM section in run.sh.

Citation

If you find this project useful, please consider citing:

@article{vtbench,
  author       = {Huawei Lin and
                  Tong Geng and
                  Zhaozhuo Xu and
                  Weijie Zhao},
  title        = {VTBench: Evaluating Visual Tokenizers for Autoregressive Image Generation},
  journal      = {arXiv preprint arXiv:2502.01634},
  year         = {2025}
}