Multimodal Distribution Matching for Vision-Language Dataset Distillation (CVPR 2026)

May 25, 2026 · View on GitHub

Official implementation of Multimodal Distribution Matching for Vision-Language Dataset Distillation, a method for condensing a large vision-language dataset into smaller synthetic sets while preserving its downstream performance.

Paper

About

Experts: buffers.

.
├── distill_mdm.py       # Main dataset distillation entry point
├── eval.py              # Retrieval evaluation on distilled checkpoints
├── src/                 
│   ├── clustering_utils.py
│   ├── epoch.py
│   ├── geo_utils.py
│   ├── model.py
│   ├── model_utils.py
│   ├── networks.py
│   ├── reparam_module.py
│   ├── similarity_mining.py
│   ├── utils.py
│   └── vl_distill_utils.py
├── utils/               
├── sh/
│   ├── distill.sh       
│   └── eval.sh           

Dataset

data/
├── datasets/
│   ├── Flickr30k/
│   ├── Flickr8k/
│   └── COCO/
└── annotations/
    ├── flickr30k/
    ├── flickr8k/
    └── coco/

Defaults in distill_mdm.py include:

image roots such as ./data/datasets/Flickr30k/ and annotation root ./data/annotations/ when using the flickr / flickr8k / coco options.

Training

export CKPT_PATH=/path/to/distilled.pt
./sh/distill.sh <gpu_id> [run_name]

Citation

If you use this code in your research, please cite:

@inproceedings{jeong2026mdm,
  title={Multimodal Distribution Matching for Vision-Language Dataset Distillation},
  author={Jeong, Jongoh and Kwon, Hoyong and Kim, Minseok and Yoon, Kuk-Jin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}