CondenseLM
April 24, 2026 · View on GitHub
Official code for EMNLP 2025 paper: CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching
Setup
pip install -r requirements.txt
Add your OpenAI API key to experiments/openai.env:
OPENAI_API_KEY=your_key_here
Usage
# Step 1: Train the representability reward model
bash run_train_rep_model.sh
# Step 2: Run the condensation pipeline
bash run_pipeline.sh
# Step 3: Evaluate the condensed dataset
bash run_eval.sh
Uncomment the desired dataset block in each script. Supported datasets: SST-2, MNLI, AG News, IMDB.
Pipeline configurations are in pipeline_configs/.
Output Structure
experiments/pipeline/sst2_DPC20_s10_20250320/
├── config.json
├── selected_indices/
├── iter_1/
│ ├── scores/ # coverage reward scores
│ ├── selected_data/ # source examples for LLM
│ ├── distilled_data/ # LLM-generated candidates + best candidate
│ └── models/ # coverage reward model checkpoint
├── iter_2/
├── iter_3/
├── iter_4/
├── result/
│ ├── DPC_5/ # 5 samples per class (iter 1 only)
│ ├── DPC_10/ # 10 samples per class (iter 1 + 2)
│ ├── DPC_15/ # 15 samples per class (iter 1 + 2 + 3)
│ └── DPC_20/ # 20 samples per class (all iterations)
└── exp_result/ # evaluation results from run_eval.sh
├── DPC_5/ # evaluation results for DPC 5
├── DPC_10/ # evaluation results for DPC 10
└── DPC_20/ # evaluation results for DPC 20
Results are accumulated across iterations. If you run with --total_dpc 20, you can evaluate smaller DPC settings (e.g., DPC 5, 10, 15) directly from the result/ folder without running a separate pipeline.
Citation
@inproceedings{shen2025condenselm,
title={CondenseLM: LLMs-driven Text Dataset Condensation via Reward Matching},
author={Shen, Cheng and Ong, Yew-Soon and Zhou, Joey Tianyi},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={1237--1252},
year={2025}
}