DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
May 14, 2026 ยท View on GitHub
If our project helps you, please give us a star โญ and cite our paper!
๐ Introduction
This repo implements DualToken, a method that unifies representations for both visual understanding and generation within a single tokenizer. Directly integrating reconstruction and semantic objectives in a single tokenizer creates conflicts, leading to degraded performance in both reconstruction quality and semantic performance. Instead of forcing a single codebook to handle both semantic and perceptual information, DualToken disentangles them by introducing separate codebooks for high and low-level features, effectively transforming their inherent conflict into a synergistic relationship. As a result, DualToken achieves state-of-the-art performance in both reconstruction and semantic tasks.

Built upon DualToken, we construct an unified MLLM which demonstrates remarkable effectiveness in downstream understanding and generation tasks. The code and weights of our unified MLLM will be released soon.

๐ฐ News
- [2026/04/21] ๐ We have released the inference and training code of our tokenizer. More versions are scheduled to be updated. Please stay tuned!
- [2025/03/18] ๐ We have released the technical report of DualToken. See here!
๐ง Requirements and Installation
- Python โฅ 3.11
- PyTorch โฅ 2.4.1
- transformers == 4.44.0
๐ Training
To train a tokenizer from scratch, run:
torchrun --nproc_per_node 8 -m main \
--sem_weight 1 \
--stage 1 \
--name siglip-384-rvq8 \
--model "model_config_siglip_384_rvq8" \
--save-frequency 1 \
--train-data="$YOUR_DATA_PATH/cc12/cc12m-train-{0000..2175}.tar" \
--train-num-samples 10000000 \
--dataset-type "webdataset" \
--warmup=10000 \
--batch-size=32 \
--lr=7.2e-5 \
--beta1=0.5 \
--beta2=0.9 \
--wd=0.0001 \
--epochs=20 \
--gan_start_epoch=0 \
--restart_gan=20 \
--workers=1
or you can directly run the tokenizer training command:
bash run.sh
Inference
python inference.py
๐ Acknowledgement
DualToken is built upon the awesome works VILA-U, OpenCLIP, and LLaVA.
๐ Citation
@article{song2025dualtoken,
title={Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies},
author={Song, Wei and Wang, Yuran and Song, Zijia and Li, Yadong and Zhou, Zenan and Chen, Long and Xu, Jianhua and Wang, Jiaqi and Yu, Kaicheng},
journal={arXiv preprint arXiv:2503.14324},
year={2025}
}
LICENSE
This project is licensed under the MIT License - see the LICENSE file for details.