Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining
August 22, 2025 ยท View on GitHub
This is the repo for the paper (ACL2025)Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration.

Updates
- [5 May, 2025]: Our paper is accepted by ACL2025! And our code is released.
- [21 October, 2024]: We release the labeled SlimPajama datasets.
- [14 October, 2024]: We release our 1.3B model checkpoints and BERT Topic Classifier.
Release plan
TODOs:
- Model Checkpoints
- BERT Topic Model Checkpoint
- Labeled Slimpajama-670B datasets
- Code for methods ......