Minimal GRPO implementation

January 22, 2025 · View on GitHub

Goal: Working toy implementation of llama-3.2-3b locally RL training with GRPO. Understanding the algorithm & hyper parameters. Just running everything locally on a single node.

Setup

Create conda env

conda create --name grpo python=3.12 -y
conda activate grpo

Install dependencies

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Play with the source in train.py

python train.py

Inspiration

OpenRLHF
Spinning Up in Deep RL

References

DeepSeek-R1 tech report
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models