Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining

August 22, 2025 ยท View on GitHub

This is the repo for the paper (ACL2025)Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration. Illustration of multi-actor collaborative framework

Updates

Release plan

TODOs:

  • Model Checkpoints
  • BERT Topic Model Checkpoint
  • Labeled Slimpajama-670B datasets
  • Code for methods ......