RLFromScratch

August 28, 2025 · View on GitHub

This repo implements Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) from scratch in PyTorch, without relying on off-the-shelf libraries like TRL or VERL.

GRPO paper: arXiv:2402.03300
DPO paper: arXiv:2305.18290

Why this repo

To open the black box: we unpack the training details—masking, KL penalties, scheduling, and evaluation—so you can see exactly how these algorithms work in practice.

Quick results

GRPO on Llama-3.2-1B-Instruct (GSM8K): ~10% → ~23% accuracy in 1 epoch.
DPO on Llama-3.2-1B using Tiny-Safe-Pair (safe-pair-data): ~50% → ~60% preference accuracy in 3 epochs.

Both evaluation pipelines are included.

The scripts default to multi-GPU training with PyTorch DDP, and can be easily adapted to a single GPU by adjusting the launch command and disabling distributed initialization. The evaluation is preformed using a single GPU.

Training:

torchrun --standalone --nproc_per_node=8 dpo/grpo_train_from_scratch.py

Evaluation:
```
python dpo/grpo_evaluation.py
```

Algorithm Resources

I’ve written down explanation of the two algorithms in the following blogs:

DPO
GRPO

Why this repo

Quick results

Training setup

Algorithm Resources