CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

November 4, 2025 · View on GitHub

News

2025.9.19: 🔥 CPPO has been accepted to NeurIPS'25!
2025.6.03: 🔥 We release the verl version of CPPO.

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training---their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32\times $** speedup on GSM8K and **\$ 3.51\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO.

Main Results

GSM8K

Method	Group Size (G)	Pruning Rate (P)	k	Accuracy	Training Time	Accelerate Ratio
Qwen2.5-1.5B-Instruct	-	-	-	55.72%	-	-
GRPO	16	0.00%	16	77.05%	23393s	1.00×
CPPO	16	50.00%	8	77.67%	12930s	1.81×
CPPO	16	75.00%	4	78.81%	7159s	3.27×
CPPO	16	87.50%	2	80.41%	4781s	4.89×
CPPO	16	93.75%	1	78.20%	2813s	8.32×

Math & Out-of-Distribution tasks

Method	Group Size	Pruning Rate	k	Accuracy	Time	Accelerate Ratio	AMC 2023	AIME 2024
Qwen2.5-7B-Instruct	-	-	-	55.20%	-	-	25.62%	5.00%
GRPO	16	0.00%	16	75.20%	33902s	1.00×	46.88%	5.83%
CPPO	16	50.00%	8	75.20%	20550s	1.65×	53.12%	10.00%
CPPO	16	75.00%	4	77.20%	12959s	2.62×	49.38%	6.67%
CPPO	16	87.50%	2	75.20%	9657s	3.51×	46.25%	8.33%
CPPO	16	93.75%	1	72.80%	8375s	4.05×	45.00%	5.83%

To Reproduce

1. Prepare the environment:

conda create -n cppo python=3.11
conda activate cppo
pip install vllm==0.7.2
pip install setuptools
pip install flash-attn --no-build-isolation
pip install -e ".[dev]"

sh scripts/GRPO_gsm.sh

CPPO

sh scripts/CPPO_gsm.sh

Evaluation

Qwen2.5-1.5B-Instruct

sh scripts/Eval_qwen2.5-1.5b.sh

CPPO-1.5B-n-16-0.875

sh scripts/Eval_gsm.sh

You can download the ckpt from huggingface 🤗.

sh scripts/GRPO_math.sh

CPPO

sh scripts/CPPO_math.sh

Evaluation

Qwen2.5-7B-Instruct

sh scripts/Eval_qwen2.5-7b.sh

CPPO-7B-n-16-0.75

sh scripts/Eval_math.sh

You can download the ckpt from huggingface 🤗.

Affiliation

Shanghai Innovation Institute
Xiamen University
Rakuten
East China Normal University

Acknowledgments

We are very grateful to the Open R1 teams for creating awesome repo.

Citation

@article{lin2025cppo,
  title={CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models},
  author={Lin, Zhihang and Lin, Mingbao and Xie, Yuan and Ji, Rongrong},
  journal={arXiv preprint arXiv:2503.22342},
  year={2025}
}

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

News

Abstract

Main Results

GSM8K

Math & Out-of-Distribution tasks

To Reproduce

1. Prepare the environment:

2. GSM8K:

Training

GRPO

CPPO

Evaluation

Qwen2.5-1.5B-Instruct

CPPO-1.5B-n-16-0.875

4. Math:

Training

GRPO

CPPO

Evaluation

Qwen2.5-7B-Instruct

CPPO-7B-n-16-0.75

Affiliation

Acknowledgments

Citation