README.md

May 14, 2025 ยท View on GitHub

pipeline

Project Page | Paper | Distilled Dataset

This repository contains the code and implementation for the paper "BACON: Bayesian Optimal Condensation Framework for Dataset Distillation".

๐Ÿ” Overview

Figure 1: Comparison of our method with previous methods: (a) Existing DD methods typically align gradients and distributions, but lack theoretical guarantees. (b) BACON models DD as a Bayesian optimization problem, generating synthetic images by assessing likelihood and prior probabilities, thereby improving accuracy and reducing training costs.

Abstract BACON (BAyesian Optimal CONdensation Framework) introduces a Bayesian framework to Dataset Distillation (DD), offering a principled probabilistic approach that addresses the lack of theoretical grounding in existing methods. By formulating DD as a Bayesian optimization problem, BACON derives a numerically tractable lower bound on expected risk, facilitating efficient data synthesis. Evaluated on multiple image classification benchmarks, BACON consistently outperforms state-of-the-art methods, achieving significant accuracy gains while reducing both synthesis and training costs.

๐Ÿš€ Contributions

Figure 2: Overview of BACON: BACON formulates DD as Bayesian risk minimization over embeddings (*), and derives a tractable lower bound for optimization (I), guided by prior and likelihood from the original data (II). Monte Carlo sampling accelerates optimization (III), and a loss is constructed under two assumptions (IV) to update synthetic data via gradient descent (V).

  • First Bayesian DD Framework: We are the first to introduce a Bayesian framework for dataset distillation, formulating it as a Bayesian optimization problem that minimizes the expected risk. We derive a theoretical lower bound on the expected risk over the joint distribution of latent variables, providing new insights into the fundamental limits of optimal condensation.

  • Efficient Distillation Algorithm: We propose the BACON framework (Bayesian optimal CONdensation), an efficient method that minimizes the expected risk for dataset distillation. By incorporating key assumptions such as a Gaussian prior and a total variance constraint, BACON derives loss terms to effectively guide the distillation process.

  • Superior Empirical Performance: Extensive experiments comparing BACON with various dataset distillation methods across multiple image classification datasets demonstrate that BACON consistently outperforms all methods, showcasing superior performance in both accuracy and efficiency.

๐Ÿ“ˆ Experimental Results

We present results for several representative methods, including DM, IDM, and our proposed BACON. In total, we compare 18 existing methods across 4 datasets. Additional methods such as MTT, DataDAM, and IID are also evaluated, with further details available in the full paper.

All distilled datasets are publicly accessible at Distilled Dataset.

Comparison to the State-of-the-art Methods

  • IPC-50
MethodSVHNCIFAR-10CIFAR-100TinyImageNet
DM82.663.043.6-
IDM84.167.550.0-
BACON89.170.0652.29-
  • IPC-10
MethodSVHNCIFAR-10CIFAR-100TinyImageNet
DM72.848.929.712.9
IDM81.058.645.121.9
BACON84.6462.0646.1525.0
  • IPC-1
MethodSVHNCIFAR-10CIFAR-100TinyImageNet
DM21.626.011.43.9
IDM65.345.220.110.1
BACON69.4445.6223.6810.2

Visulizations

image samples image samples

๐Ÿš€ Getting Started

Step 1

  • Run the following command to download the Repo.
    git clone https://github.com/zhouzhengqd/BACON.git
    

Step 2

  • Download Datasets (SVHN, CIFAR-10, CIFAR-100, Tiny-ImageNet).

Step 3

  • Run the following command to create a conda environment
    cd BACON
    cd Code
    conda env create -f environment.yml
    conda activate bacon
    

๐Ÿ“ Directory Structure

  • BACON
    • Code
      • data
        • datasets
      • checkpoints
      • result
      • Files for BACON
      • enviroment.yml
      • ...
      • ...
      • ...

๐Ÿ› ๏ธ Command for Reproducing Experiment Results and Evaluation

  • For example: Validate on the CIFAR-10, other datasets follow the "Command.txt" file.
  • BACON CIFAR-10 IPC-50
      python3 -u BACON_cifar10.py --dataset CIFAR10 --model ConvNet --ipc 50 --dsa_strategy color_crop_cutout_flip_scale_rotate --init real --lr_img 0.2 --num_exp 5 --num_eval 5 --net_train_real --eval_interval 500 --outer_loop 1 --mismatch_lambda 0 --net_decay --embed_last 1000 --syn_ce --ce_weight 0.1 --train_net_num 1 --aug
    
  • BACON CIFAR-10 IPC-10
      python3 -u BACON_cifar10.py --dataset CIFAR10 --model ConvNet --ipc 10 --dsa_strategy color_crop_cutout_flip_scale_rotate --init real --lr_img 0.2 --num_exp 5 --num_eval 5 --net_train_real --eval_interval 100 --outer_loop 1 --mismatch_lambda 0 --net_decay --embed_last 1000 --syn_ce --ce_weight 0.5 --train_net_num 1 --aug
    
  • BACON CIFAR-10 IPC-1
      python3 -u BACON_cifar10.py --dataset CIFAR10 --model ConvNet --ipc 1 --dsa_strategy color_crop_cutout_flip_scale_rotate --init real --lr_img 0.2 --num_exp 5 --num_eval 5 --net_train_real --eval_interval 100 --outer_loop 1 --mismatch_lambda 0 --net_decay --embed_last 1000 --syn_ce --ce_weight 0.5 --train_net_num 1 --batch_real 5000 --net_generate_interval 5 --aug
    

๐Ÿ™ Acknowledge

We gratefully acknowledge the contributors of DC-bench and IDM, as our code builds upon their work (DC-bench and IDM).