Open-MAGVIT2.md
February 19, 2025 ยท View on GitHub
OPEN-MAGVIT2: An Open-source Project Toward Democratizing Auto-Regressive Visual Generation
Zhuoyan Luo*, Fengyuan Shi*, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan
ARC Lab Tencent PCG, Tsinghua University, Nanjing University
Highlights
- ๐ Super-large Codebook: Re-implements the advanced Lookup-Free Quantizer proposed by MAGVITv2, and achieves a super-large codebook (i.e., ) with strong performance (1.17rFID).
- ๐ก Auto-Regressive Innovation: Introduces asymmetric token factorization and the next sub-token prediction paradigm, enabling efficient generation with a super-large vocabulary and enhanced sub-token interactions.
- ๐ Scalability: Validates the scalability of plain auto-regressive models across various parameter sizes (300M to 1.5B).
This repository provides the scripts and checkpoints to replicate our results.
๐ค Features
- A series of visual tokenizers: (1) image tokenizer for class-conditional image generation (8 and 16 downsampling rate with codebook size), (2) text-conditional image generation ( and codebook size with 16 downsampling rate), (3) video tokenizer ( codebook size with 4 8 downsampling rate).
- A family of the autoregressive model ranging from 300M to 1.5B for class-conditional image generation.
๐ค Open-MAGVIT2 is still under active development. Stay tuned for the update!
๐ฅ Quick Start
Class Conditional Image Generation
Stage I: Training of Visual Tokenizer
๐ Training Scripts
bash scripts/train_tokenizer/Open-MAGVIT2/run_128_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
bash scripts/train_tokenizer/Open-MAGVIT/run_256_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
๐ Evaluation Scripts
bash scripts/evaluation/evaluation_128.sh
bash scripts/evaluation/evaluation_256.sh
๐บ Performance and Models
Tokenizer
| Method | Token Type | #Tokens | Train Data | Codebook Size | rFID | PSNR | Codebook Utilization | Checkpoint |
|---|---|---|---|---|---|---|---|---|
| Open-MAGVIT2-20240617 | 2D | 16 16 | 256 256 ImageNet | 262144 | 1.53 | 21.53 | 100% | - |
| Open-MAGVIT2-20240617 | 2D | 16 16 | 128 128 ImageNet | 262144 | 1.56 | 24.45 | 100% | - |
| Open-MAGVIT2 | 2D | 16 16 | 256 256 ImageNet | 262144 | 1.17 | 21.90 | 100% | IN256_Large |
| Open-MAGVIT2 | 2D | 16 16 | 128 128 ImageNet | 262144 | 1.18 | 25.08 | 100% | IN128_Large |
| Open-MAGVIT2* | 2D | 32 32 | 128 128 ImageNet | 262144 | 0.34 | 26.19 | 100% | above |
(*) denotes that the results are from the direct inference using the model trained with $128 \times 128$ resolution without fine-tuning.
Stage II: Training of Auto-Regressive Models
๐ Training Scripts
Please see in scripts/train_autogressive/run.sh for different model configurations.
bash scripts/train_autogressive/run.sh MASTER_ADDR MASTER_PORT NODE_RANK
๐ Sample Scripts
Please see in scripts/train_autogressive/run.sh for different sampling hyper-parameters for different scale of models.
bash scripts/evaluation/sample_npu.sh or scripts/evaluation/sample_gpu.sh Your_Total_Rank
๐บ Performance and Models
| Method | Params | #Tokens | FID | IS | Checkpoint |
|---|---|---|---|---|---|
| Open-MAGVIT2 | 343M | 16 16 | 3.08 | 258.26 | AR_256_B |
| Open-MAGVIT2 | 804M | 16 16 | 2.51 | 271.70 | AR_256_L |
| Open-MAGVIT2 | 1.5B | 16 16 | 2.33 | 271.77 | AR_256_XL |
Text-conditional Image Generation
Stage I: Training of Visual Tokenizer
Data Preparation
We use LAION-COCO, CC12M, CC3M, LAION-HD, LAION-Aesthetic-umap, LAION-Aesthetic-v2 and JourneyDB for Pretraining.
๐ Training Scripts
bash scripts/train_tokenizer/Open-MAGVIT2/pretrain_256.sh MASTER_ADDR MASTER_PORT NODE_RANK
๐ Evaluation Scripts
bash scripts/evaluation/evaluation_256.sh
bash scripts/evaluation/evaluation_original.sh
๐บ Performance comparison and Models
| Method | Quantizer Type | Training Data | Ratio | Resolution | Codebook Size | Checkpoint | rFID(COCO) | PSNR(COCO) | SSIM(COCO) | rFID(In1k) | PSNR(In1k) | SSIM(In1k) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LlamaGen | VQ | 70M | 16 | 256 256 | 16384 | - | 8.40 | 20.28 | 0.55 | 2.47 | 20.65 | 0.54 |
| Show-o | LFQ | 35M | 16 | 256 256 | 8192 | - | 9.26 | 20.90 | 0.59 | 3.50 | 21.34 | 0.59 |
| Cosmos | FSQ | - | 16 | 256 256 | 64000 | - | 11.97 | 19.22 | 0.48 | 4.57 | 19.93 | 0.49 |
| Open-MAGVIT2 | LFQ | 100M | 16 | 256 256 | 16384 | Pretrain_256_16384 | 7.93 | 22.21 | 0.62 | 2.55 | 22.21 | 0.62 |
| Open-MAGVIT2 | LFQ | 100M | 16 | 256 256 | 262144 | Pretrain_256_262144 | 6.76 | 22.31 | 0.65 | 1.67 | 22.70 | 0.64 |
| Cosmos | FSQ | - | 16 | Original | 64000 | - | 7.51 | 20.45 | 0.52 | 1.93 | 20.56 | 0.51 |
| Open-MAGVIT2 | LFQ | 100M | 16 | Original | 16384 | Pretrain_256_16384 | 6.65 | 21.61 | 0.57 | 1.39 | 21.74 | 0.56 |
| Open-MAGVIT2 | LFQ | 100M | 16 | Original | 262144 | Pretrain_256_262144 | 5.10 | 22.18 | 0.60 | 0.78 | 22.24 | 0.59 |
Video Generation
Stage I: Training of Video Tokenizer
๐ Training Scripts
bash scripts/train_tokenizer/Open-MAGVIT2/run_video.sh MASTER_ADDR MASTER_PORT NODE_RANK
๐ Evaluation Scripts
bash scripts/evaluation/evaluation_video.sh
๐บ Performance comparison and Models
| Method | Token Type | Tokens | Ratio | Train Resolution | Codebook Size | rFVD | Checkpoints |
|---|---|---|---|---|---|---|---|
| TATS | 2D | 4 16 16 | 8 | 128 128 | 16384 | 162 | - |
| MAGVIT | 2D | 4 16 16 | 8 | 128 128 | 1024 | 25 | - |
| SweetTokenizer | 1D | 256 + 1024 | - | 256 256 | 10481 + 11139 | 44 | - |
| LARP-L | 1D | 1024 | - | 128 128 | 8192 | 24 | - |
| LARP-L-Long | 1D | 1024 | - | 128 128 | 8192 | 24 | - |
| SweetTokenizer | 1D | 5120 | - | 256 256 | 10481 + 11139 | 18 | - |
| Open-MAGVIT2 | 2D | 5 16 16 | 8 | 128 128 | 262144 | 16 | Video_128_262144 |