RLFromScratch

August 28, 2025 · View on GitHub

This repo implements Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) from scratch in PyTorch, without relying on off-the-shelf libraries like TRL or VERL.

Why this repo

To open the black box: we unpack the training details—masking, KL penalties, scheduling, and evaluation—so you can see exactly how these algorithms work in practice.

Quick results

  • GRPO on Llama-3.2-1B-Instruct (GSM8K): ~10% → ~23% accuracy in 1 epoch.
  • DPO on Llama-3.2-1B using Tiny-Safe-Pair (safe-pair-data): ~50% → ~60% preference accuracy in 3 epochs.

Both evaluation pipelines are included.

Training setup

The scripts default to multi-GPU training with PyTorch DDP, and can be easily adapted to a single GPU by adjusting the launch command and disabling distributed initialization. The evaluation is preformed using a single GPU.

  • Training:

    torchrun --standalone --nproc_per_node=8 dpo/grpo_train_from_scratch.py
    
  • Evaluation:

    python dpo/grpo_evaluation.py
    

Algorithm Resources

I’ve written down explanation of the two algorithms in the following blogs: