Influence-Guided Diffusion for Dataset Distillation

February 12, 2025 · View on GitHub

This is the official implementation for the ICLR 2025 paper "Influence-Guided Diffusion for Dataset Distillation".

Abstract

Dataset distillation aims to streamline the training process by creating a compact yet effective dataset that retains the essential information of a much larger original dataset.

Motivated by the remarkable capabilities of diffusion generative models in learning target dataset distributions and controllably sampling high-quality data tailored to user needs, we propose framing dataset distillation as a controlled diffusion generation task aimed at generating data specifically optimized for effective training.

By establishing a correlation between the overarching objective of dataset distillation and the trajectory influence function (TracIn), we introduce the Influence-Guided Diffusion (IGD) sampling framework to generate training-effective data without the need to retrain diffusion models.

An influence guidance function is designed by leveraging TracIn as an indicator to steer the diffusion process toward producing data with high training impact, complemented by a deviation guidance function for diversity enhancement.

Extensive experiments demonstrate that our IGD method achieves state-of-the-art performance in distilling ImageNet datasets.

Implementation

Getting Started

First, create the Conda virtual environment:

conda env create -f environment.yaml

Then, activate the Conda environment:

source activate diff

Before starting, ensure that your ImageNet-1K dataset is located at:

../imagenet/

Training a Surrogate Model for Influence Computation

Before running distillation, you need to train a surrogate model on the original dataset by executing:

bash train_ckpts.sh

This script will train a ConvNet-6 model on your target dataset (specified by "spec") for 50 epochs. The trained model will be saved in ./ckpts/.

Downloading Pre-Trained DiT Models

python download.py

Influence-Guided Sampling with DiT

To generate an IPC50 surrogate dataset for ImageWoof using a pre-trained DiT model with our IGD sampling method, run:

bash sample_mp.sh

To reproduce our results obtained with the Minimax fine-tuning approach, follow these steps:

Access the official Minimax repository and fine-tune a DiT model as per their instructions.
After obtaining the Minimax fine-tuned checkpoint, modify its path in the following script and run:

bash sample_mp_minimax.sh

Training Models on the Generated Data for Validation

Run the following script to train a ResNetAP-10 model on the generated dataset using five random seeds:

bash train.sh

Hyperparameter Setup

Use the following hyperparameters to reproduce the results reported in Tables 1 & 2 of our paper:

Citation

If you find our work useful for your research, please cite:

@inproceedings{
chen2025influenceguided,
title={Influence-Guided Diffusion for Dataset Distillation},
author={Mingyang Chen and Jiawei Du and Bo Huang and Yi Wang and Xiaobo Zhang and Wei Wang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}

Acknowledgements

This project is primarily developed based on the following works: