Open-MAGVIT2.md

February 19, 2025 ยท View on GitHub

OPEN-MAGVIT2: An Open-source Project Toward Democratizing Auto-Regressive Visual Generation

Zhuoyan Luo*, Fengyuan Shi*, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan
ARC Lab Tencent PCG, Tsinghua University, Nanjing University

arXivย 

This is the official repository for Open-MAGVIT2, an open-source project re-implementing Google's MAGVIT-v2 tokenizer and democratizing autoregressive visual generation with a super large vocabulary (i.e., $2^{18}$).

Highlights

  • ๐Ÿš€ Super-large Codebook: Re-implements the advanced Lookup-Free Quantizer proposed by MAGVITv2, and achieves a super-large codebook (i.e., 2182^{18}) with strong performance (1.17rFID).
  • ๐Ÿ’ก Auto-Regressive Innovation: Introduces asymmetric token factorization and the next sub-token prediction paradigm, enabling efficient generation with a super-large vocabulary and enhanced sub-token interactions.
  • ๐Ÿš€ Scalability: Validates the scalability of plain auto-regressive models across various parameter sizes (300M to 1.5B).

This repository provides the scripts and checkpoints to replicate our results.

๐ŸŽค Features

  • A series of visual tokenizers: (1) image tokenizer for class-conditional image generation (8 ร—\times and 16 ร—\times downsampling rate with 2182^{18} codebook size), (2) text-conditional image generation (2142^{14} and 2182^{18} codebook size with 16 ร—\times downsampling rate), (3) video tokenizer (2182^{18} codebook size with 4 ร—\times 8 ร—8\times 8 downsampling rate).
  • A family of the autoregressive model ranging from 300M to 1.5B for class-conditional image generation.

๐Ÿค— Open-MAGVIT2 is still under active development. Stay tuned for the update!


๐Ÿ”ฅ Quick Start

Class Conditional Image Generation

Stage I: Training of Visual Tokenizer

๐Ÿš€ Training Scripts
bash scripts/train_tokenizer/Open-MAGVIT2/run_128_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
bash scripts/train_tokenizer/Open-MAGVIT/run_256_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
๐Ÿš€ Evaluation Scripts
bash scripts/evaluation/evaluation_128.sh
bash scripts/evaluation/evaluation_256.sh
๐Ÿบ Performance and Models
Tokenizer
MethodToken Type#TokensTrain DataCodebook SizerFIDPSNRCodebook UtilizationCheckpoint
Open-MAGVIT2-202406172D16 ร—\times 16256 ร—\times 256 ImageNet2621441.5321.53100%-
Open-MAGVIT2-202406172D16 ร—\times 16128 ร—\times 128 ImageNet2621441.5624.45100%-
Open-MAGVIT22D16 ร—\times 16256 ร—\times 256 ImageNet2621441.1721.90100%IN256_Large
Open-MAGVIT22D16 ร—\times 16128 ร—\times 128 ImageNet2621441.1825.08100%IN128_Large
Open-MAGVIT2*2D32 ร—\times 32128 ร—\times 128 ImageNet2621440.3426.19100%above

(*) denotes that the results are from the direct inference using the model trained with $128 \times 128$ resolution without fine-tuning.

Stage II: Training of Auto-Regressive Models

๐Ÿš€ Training Scripts

Please see in scripts/train_autogressive/run.sh for different model configurations.

bash scripts/train_autogressive/run.sh MASTER_ADDR MASTER_PORT NODE_RANK
๐Ÿš€ Sample Scripts

Please see in scripts/train_autogressive/run.sh for different sampling hyper-parameters for different scale of models.

bash scripts/evaluation/sample_npu.sh or scripts/evaluation/sample_gpu.sh Your_Total_Rank
๐Ÿบ Performance and Models
MethodParams#TokensFIDISCheckpoint
Open-MAGVIT2343M16 ร—\times 163.08258.26AR_256_B
Open-MAGVIT2804M16 ร—\times 162.51271.70AR_256_L
Open-MAGVIT21.5B16 ร—\times 162.33271.77AR_256_XL

Text-conditional Image Generation

Stage I: Training of Visual Tokenizer

Data Preparation

We use LAION-COCO, CC12M, CC3M, LAION-HD, LAION-Aesthetic-umap, LAION-Aesthetic-v2 and JourneyDB for Pretraining.

๐Ÿš€ Training Scripts
bash scripts/train_tokenizer/Open-MAGVIT2/pretrain_256.sh MASTER_ADDR MASTER_PORT NODE_RANK
๐Ÿš€ Evaluation Scripts
bash scripts/evaluation/evaluation_256.sh
bash scripts/evaluation/evaluation_original.sh
๐Ÿบ Performance comparison and Models
MethodQuantizer TypeTraining DataRatioResolutionCodebook SizeCheckpointrFID(COCO)PSNR(COCO)SSIM(COCO)rFID(In1k)PSNR(In1k)SSIM(In1k)
LlamaGenVQ70M16256 ร—\times 25616384-8.4020.280.552.4720.650.54
Show-oLFQ35M16256 ร—\times 2568192-9.2620.900.593.5021.340.59
CosmosFSQ-16256 ร—\times 25664000-11.9719.220.484.5719.930.49
Open-MAGVIT2LFQ100M16256 ร—\times 25616384Pretrain_256_163847.9322.210.622.5522.210.62
Open-MAGVIT2LFQ100M16256 ร—\times 256262144Pretrain_256_2621446.7622.310.651.6722.700.64
CosmosFSQ-16Original64000-7.5120.450.521.9320.560.51
Open-MAGVIT2LFQ100M16Original16384Pretrain_256_163846.6521.610.571.3921.740.56
Open-MAGVIT2LFQ100M16Original262144Pretrain_256_2621445.1022.180.600.7822.240.59

Video Generation

Stage I: Training of Video Tokenizer

๐Ÿš€ Training Scripts
bash scripts/train_tokenizer/Open-MAGVIT2/run_video.sh MASTER_ADDR MASTER_PORT NODE_RANK
๐Ÿš€ Evaluation Scripts
bash scripts/evaluation/evaluation_video.sh
๐Ÿบ Performance comparison and Models
MethodToken TypeTokensRatioTrain ResolutionCodebook SizerFVDCheckpoints
TATS2D4 ร—\times 16 ร—\times 168128 ร—\times 12816384162-
MAGVIT2D4 ร—\times 16 ร—\times 168128 ร—\times 128102425-
SweetTokenizer1D256 + 1024-256 ร—\times 25610481 + 1113944-
LARP-L1D1024-128 ร—\times 128819224-
LARP-L-Long1D1024-128 ร—\times 128819224-
SweetTokenizer1D5120-256 ร—\times 25610481 + 1113918-
Open-MAGVIT22D5 ร—\times 16 ร—\times 168128 ร—\times 12826214416Video_128_262144