DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

May 14, 2026 ยท View on GitHub

If our project helps you, please give us a star โญ and cite our paper!

Paper PDF Project Page

๐ŸŒˆ Introduction

This repo implements DualToken, a method that unifies representations for both visual understanding and generation within a single tokenizer. Directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks.

teaser

Built upon DualToken, we construct an unified MLLM which demonstrates remarkable effectiveness in downstream understanding and generation tasks. The code and weights of our unified MLLM will be released soon.

teaser

๐Ÿ“ฐ News

  • [2026/04/21] ๐ŸŒŸ We have released the inference and training code of our tokenizer. More versions are scheduled to be updated. Please stay tuned!
  • [2025/03/18] ๐ŸŒŸ We have released the technical report of DualToken. See here!

๐Ÿ”ง Requirements and Installation

  • Python โ‰ฅ 3.11
  • PyTorch โ‰ฅ 2.4.1
  • transformers == 4.44.0

๐Ÿš€ Training

To train a tokenizer from scratch, run:

torchrun --nproc_per_node 8 -m main \
    --sem_weight 1 \
    --stage 1 \
    --name siglip-384-rvq8 \
    --model "model_config_siglip_384_rvq8" \
    --save-frequency 1 \
    --train-data="$YOUR_DATA_PATH/cc12/cc12m-train-{0000..2175}.tar" \
    --train-num-samples 10000000 \
    --dataset-type "webdataset" \
    --warmup=10000 \
    --batch-size=32 \
    --lr=7.2e-5 \
    --beta1=0.5 \
    --beta2=0.9 \
    --wd=0.0001 \
    --epochs=20 \
    --gan_start_epoch=0 \
    --restart_gan=20 \
    --workers=1

or you can directly run the tokenizer training command:

bash run.sh

Inference

python inference.py

๐Ÿ™‡ Acknowledgement

DualToken is built upon the awesome works VILA-U, OpenCLIP, and LLaVA.

๐Ÿ“ Citation

@article{song2025dualtoken,
  title={Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies},
  author={Song, Wei and Wang, Yuran and Song, Zijia and Li, Yadong and Zhou, Zenan and Chen, Long and Xu, Jianhua and Wang, Jiaqi and Yu, Kaicheng},
  journal={arXiv preprint arXiv:2503.14324},
  year={2025}
}

LICENSE

This project is licensed under the MIT License - see the LICENSE file for details.