README.md
April 1, 2026 · View on GitHub
Multimodal Dataset Distillation via Phased Teacher Models
Shengbin Guo* Hang Zhao* Senqiao Yang Chenyang Jiang
Yuhang Cheng Xiangru Peng Rui Shao Zhuotao Tian†
* equal contribution † corresponding author
📖 OpenReview | 🤗 Huggingface | 📑 Arxiv
TODO list:
- Upload basic code.
- Upload buffers and models.
- Updated Repro Guidance (README).
- Submit paper to arxiv.
- Add ST method.
TL; DR
We introduce PTM-ST, a novel phased distillation framework that achieves superior performance in Multimodal Dataset Distillation by leveraging stage-aware teacher modeling and shortcut-based trajectories to efficiently compress massive image-text data.
Getting Started
Environment
We tested our code on a single RTX3090 GPU, CUDA Driver Version: 12.1. Run the following commands to build environment:
git clone https://github.com/Previsior/PTM-ST.git
cd PTM-ST
conda create -n ptm-st python=3.8 -y
conda activate ptm-st
pip install -r requirements.txt
Models and Datasets
Pretrained models: All pretrained weights/tokenizers used by the code are downloaded automatically on first use and stored in the default cache directory of the corresponding library (transformers, timm, or clip).
Reference links for the pretrained models currently used/supported by the code:
- Default image encoder
nfnet_l0(timm): https://huggingface.co/timm/nfnet_l0.ra2_in1k - Default text encoder
bert-base-uncased: https://huggingface.co/google-bert/bert-base-uncased - Optional text encoder
distilbert-base-uncased: https://huggingface.co/distilbert/distilbert-base-uncased - Optional text encoder
BAAI/bge-base-en-v1.5: https://huggingface.co/BAAI/bge-base-en-v1.5 - Optional CLIP model
ViT-B/32: https://huggingface.co/openai/clip-vit-base-patch32
Datasets: You can run the command below to download Flickr30K, COCO and LLaVA-CC3M datasets.
bash sh/prepare_data.sh
Or download it manually yourself, and put them here:
./distill_utils/data/
├── Flickr30k/
│ ├── flickr30k-images/
│ │ ├── 1234.jpg
│ │ └── ......
│ ├── results_20130124.token
│ └── readme.txt
└── COCO/
│ ├── train2014/
│ ├── val2014/
│ └── test2014/
└── CC3M/
├── GCC_train_002820774.jpg
├── GCC_train_002805422.jpg
└── ...
Annotations: Download the annotation files:
hf download previsor/PTM-ST --repo-type dataset --include "Annotation/*" --local-dir ./data
Generate Expert Trajectories
You can generate expert trajectories by running the sh/buffer.sh, or alternatively, download our pre-generated trajectories for faster reproduction:
hf download previsor/PTM-ST --repo-type dataset --include "flickr/nfnet_bert/InfoNCE/normal/*" --local-dir ./buffer
hf download previsor/PTM-ST --repo-type dataset --include "coco/nfnet_bert/InfoNCE/normal/*" --local-dir ./buffer
hf download previsor/PTM-ST --repo-type dataset --include "cc3m/nfnet_bert/InfoNCE/normal/*" --local-dir ./buffer
Genetate Convex Trajectories for ST
After preparing the normal trajectory, use convexify.py to generate the interpolated trajectory.
Example usage:
python convexify.py --start 0 --end 10 --normal_dir ./buffer/flickr/nfnet_bert/InfoNCE/normal
Distillation
You can distill multimodal datasets by running sh/distill.sh.
The file records the specific parameter settings of different datasets and distilled data pairs. For example, the Flickr 500 pair:
export CUDA_VISIBLE_DEVICES=\$1
python distill_ptm-st.py --dataset=flickr \
--buffer_path './buffer/flickr/nfnet_bert/InfoNCE/convexified_0_6' './buffer/flickr/nfnet_bert/InfoNCE/convexified_0_8' \
--min_start_epoch 0 1 --max_start_epoch 2 3 \
--lr_img 1000 --lr_txt 1000 --lr_lr 0.01 \
--lr_teacher_img 0.1 --lr_teacher_txt 0.1 \
--lr_sim 10.0 --sim_type full \
--num_queries 200 299 --name ${EXP_NAME} \
--Iteration 2000 2000 --subset_num 2 \
--mini_batch_size 40 --ema_decay 0.99 --image_trainable
Citation
If you find this code useful in your research, please consider citing our work:
@article{guo2026multimodal,
title={Multimodal Dataset Distillation via Phased Teacher Models},
author={Guo, Shengbin and Zhao, Hang and Yang, Senqiao and Jiang, Chenyang and Cheng, Yuhang and Peng, Xiangru and Shao, Rui and Tian, Zhuotao},
journal={arXiv preprint arXiv:2603.25388},
year={2026}
}