CondenseLM

April 24, 2026 · View on GitHub

Official code for EMNLP 2025 paper: CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching

Setup

pip install -r requirements.txt

Add your OpenAI API key to experiments/openai.env:

OPENAI_API_KEY=your_key_here

Usage

# Step 1: Train the representability reward model
bash run_train_rep_model.sh

# Step 2: Run the condensation pipeline
bash run_pipeline.sh

# Step 3: Evaluate the condensed dataset
bash run_eval.sh

Uncomment the desired dataset block in each script. Supported datasets: SST-2, MNLI, AG News, IMDB.

Pipeline configurations are in pipeline_configs/.

Output Structure

experiments/pipeline/sst2_DPC20_s10_20250320/
├── config.json
├── selected_indices/
├── iter_1/
│   ├── scores/          # coverage reward scores
│   ├── selected_data/   # source examples for LLM
│   ├── distilled_data/  # LLM-generated candidates + best candidate
│   └── models/          # coverage reward model checkpoint
├── iter_2/
├── iter_3/
├── iter_4/
├── result/
│   ├── DPC_5/           # 5 samples per class (iter 1 only)
│   ├── DPC_10/          # 10 samples per class (iter 1 + 2)
│   ├── DPC_15/          # 15 samples per class (iter 1 + 2 + 3)
│   └── DPC_20/          # 20 samples per class (all iterations)
└── exp_result/          # evaluation results from run_eval.sh
    ├── DPC_5/           # evaluation results for DPC 5
    ├── DPC_10/          # evaluation results for DPC 10
    └── DPC_20/          # evaluation results for DPC 20

Results are accumulated across iterations. If you run with --total_dpc 20, you can evaluate smaller DPC settings (e.g., DPC 5, 10, 15) directly from the result/ folder without running a separate pipeline.

Citation

@inproceedings{shen2025condenselm,
    title={CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching},
    author={Shen, Cheng and Ong, Yew-Soon and Zhou, Joey Tianyi},
    booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
    pages={1237--1252},
    year={2025}
}