RLHF Book - Code Examples

May 15, 2026 · View on GitHub

Educational code examples accompanying RLHF Book by Nathan Lambert.

Join the Discord Community to ask questions, share runs, and compare notes on these examples.

I primarily run experiments on a DGX Spark. For setup advice, see my dgx-spark-setup guide.

Note: There's an open PR here exploring the idea of adding speedrun functionality to this repository — comment if you're interested in pushing this further or seeing it merged into main.

Reader Experiment Path

All commands below assume:

cd code/
uv sync

Start with one short run, confirm that the learning signal is visible, then sweep one variable at a time. If you do not want W&B logging, set WANDB_MODE=disabled or use the module-specific no-W&B option when available. If you are running these with a coding assistant, launch long training/eval commands in the background and monitor them; foreground runs can time out before they produce useful metrics.

ChapterStarting experimentCommandWhat to inspect
Chapter 4: Instruction TuningSFT OLMo-2-1B base on No Robotsuv run python -m instruction_tuning.train --config instruction_tuning/configs/sft_olmo2_1b.yamlLoss curve and the in-loop sample panels — the base model rambles at step 0; after a few hundred steps it answers and stops.
Chapter 5: Reward ModelsBradley-Terry RM on UltraFeedbackuv run python -m reward_models.train_preference_rm --samples 2000 --epochs 1Chosen/rejected reward margin, training loss, demo scoring
Chapter 5: Reward ModelsORM on GSM8Kuv run python -m reward_models.train_orm --samples 400 --epochs 2Whether correct final answers score above perturbed answers
Chapter 6: Policy GradientsGRPO on spell_backwarduv run python -m policy_gradients.train --config policy_gradients/configs/grpo.yamlavg_correctness, avg_format, avg_binary, and whether groups contain contrast
Chapter 8: Direct AlignmentDPO on UltraFeedbackuv run python -m direct_alignment.train --loss dpo --max_samples 1000accuracy, margins, chosen_rewards, rejected_rewards, sample generations
Chapter 9: Rejection SamplingGSM8K reward selection versus random controlsuv run python -m rejection_sampling.train --config rejection_sampling/configs/top_per_prompt.yamlFinal exact-match accuracy against the matched random baseline

Good first sweeps:

  • Instruction tuning: keep sft_olmo2_1b.yaml fixed and vary lr (5e-6 vs 1e-5), num_epochs, or max_samples to see how quickly the base→assistant transition emerges.
  • Policy gradients: copy policy_gradients/configs/grpo.yaml and vary num_rollouts, temperature, format_weight, and data.size.
  • Direct alignment: hold the dataset fixed and compare dpo.yaml, ipo.yaml, and dpo_norm.yaml; read IPO through margins/accuracy, not raw loss scale.
  • Reward models: vary --samples, --lr, and --model-id before changing the model architecture.
  • Rejection sampling: keep generation/scoring settings identical while comparing top_* configs to their random_* controls.

The book chapters now include suggested exercises at the end of Chapters 4, 5, 6, 8, and 9.

Attribution

This code is built on the excellent work of community contributors:

Policy Gradients

Original Repository: zafstojano/policy-gradients Author: Zafir Stojanovski (@zafstojano) License: Apache 2.0

A clean, educational implementation of policy gradient methods for reinforcement learning. Implements REINFORCE, RLOO, PPO, GRPO, Dr. GRPO, GSPO, CISPO, SAPO, and DAPO with mathematical formulations matching the book's Chapter 6 (Policy Gradient Methods). Other details:

Reward Models (ORM/PRM)

Original Repository: myhott163com/RLHF_ORM_PRM Author: @myhott163com License: MIT

Minimal implementations of Outcome Reward Models (ORM) and Process Reward Models (PRM), demonstrating the concepts from Chapter 5 (Reward Models).


Installation

Requires Python 3.12+ and an up-to-date uv (uv self update). See #366 for troubleshooting uv compatibility.

Ubuntu/Debian users: install build tools first (needed to compile native dependencies):

sudo apt install -y build-essential python3-dev

Then install:

cd code/
uv sync

By default, Flash Attention is turned off to support a broad range of hardware, but for speedups you should consider installing it:

uv sync --extra flash

Note: If a pre-built wheel matches your CUDA version this installs in seconds. If not (e.g. CUDA 13), it falls back to a source build which needs a CUDA toolkit and can take several minutes. If the build fails, just use the base install — the code automatically falls back to PyTorch SDPA and all examples will work without it.

Platform notes

  • Standard x86_64 systems: Flash Attention provides a ~10-20% training speedup on Ampere/Ada GPUs (e.g. 3090, 4090). Pre-built wheels are available for CUDA 12.x (releases); as of 11 Apr. 2026 CUDA 13 requires a source build (which tends to be a pain), so nothing is gated on it.
  • DGX Spark / aarch64: Flash Attention is not available on ARM64/Blackwell. The code automatically falls back to PyTorch SDPA, which is actually faster on these systems due to native cuDNN optimizations.

Instruction Tuning (SFT)

Supervised fine-tune a base language model on an instruction dataset so it answers questions and stops, instead of continuing the prompt as raw text. See instruction_tuning/README.md for the full walk-through.

# SFT OLMo-2-1B base on No Robots (Chapter 4)
uv run python -m instruction_tuning.train \
    --config instruction_tuning/configs/sft_olmo2_1b.yaml

Training Results

Instruction Tuning Training Results

The most informative signal is the in-loop sample panels: at step ~100 the model rambles past <|endoftext|> and invents follow-up questions; by step ~650 it produces a single, terminated assistant reply.

Example Run

ModelDatasetExample Run
OLMo-2-0425-1B (base)HuggingFaceH4/no_robotswandb

Policy Gradient Training

Train various policy gradient algorithms on procedural reasoning tasks:

# GRPO (Chapter 6)
uv run python -m policy_gradients.train --config policy_gradients/configs/grpo.yaml

# PPO with value function
uv run python -m policy_gradients.train --config policy_gradients/configs/ppo.yaml

# REINFORCE baseline
uv run python -m policy_gradients.train --config policy_gradients/configs/reinforce.yaml

# RLOO (Leave-One-Out)
uv run python -m policy_gradients.train --config policy_gradients/configs/rloo.yaml

Training Results

Policy Gradient Training Results

Available algorithms

AlgorithmConfigDescriptionExample Run
REINFORCEreinforce.yamlWilliams (1992) - vanilla policy gradientwandb
RLOOrloo.yamlREINFORCE Leave-One-Out (Ahmadian et al., 2024)wandb
PPOppo.yamlProximal Policy Optimization (Schulman et al., 2017)wandb
GRPOgrpo.yamlGroup Relative Policy Optimization (Shao et al., 2024)wandb
Dr. GRPOdrgrpo.yamlDr. GRPO (Liu et al., 2025)wandb
GSPOgspo.yamlGroup-Sequence Policy Optimization (Zheng et al., 2025)wandb
CISPOcispo.yamlClipped Importance Sampling PO (MiniMax, 2025)wandb
SAPOsapo.yamlSoft Adaptive Policy Optimization (Qwen Team, 2025)wandb
DAPOdapo.yamlDecoupled Clip and Dynamic sAmpling Policy Optimization (ByteDance, 2025)wandb
MaxRLmaxrl.yamlMaximum Likelihood Reinforcement Learning (Tajwar et al., 2026)wandb

Reward Model Training

Note: Experimental - Reward model training needs tuning of hyperparameters, datasets, and models for cleaner training curves. Contributions welcome!

Train reward models on various datasets:

# Standard Preference RM (Chapter 5) - Bradley-Terry on UltraFeedback
uv run python -m reward_models.train_preference_rm

# Outcome Reward Model (Chapter 5) - trains on GSM8K
uv run python -m reward_models.train_orm

# Process Reward Model (Chapter 5) - trains on PRM800K
uv run python -m reward_models.train_prm

Preference RM (Bradley-Terry)

Standard preference-based reward model using the Bradley-Terry loss: -log(sigmoid(r_chosen - r_rejected)). This is the approach used in InstructGPT, Llama 2, and most production RLHF systems. Trains on UltraFeedback preference data.

ORM (Outcome Reward Model)

Binary classification on solution correctness. Fine-tunes Qwen3-0.6B on GSM8K, learning to distinguish correct from incorrect math solutions.

PRM (Process Reward Model)

Step-level classification on reasoning quality. Fine-tunes Qwen3-0.6B on PRM800K, learning to rate individual reasoning steps as {-1, 0, 1} (bad, neutral, good).

Example Runs

ModelDescriptionExample Run
Preference RMBradley-Terry on UltraFeedbackwandb
ORMOutcome RM on GSM8Kwandb
PRMProcess RM on PRM800Kwandb

Direct Alignment Training

Train direct alignment algorithms (DPO and variants) on preference data:

# DPO (Chapter 8)
uv run python -m direct_alignment.train --config direct_alignment/configs/dpo.yaml

# IPO - more robust to noisy labels
uv run python -m direct_alignment.train --config direct_alignment/configs/ipo.yaml

# SimPO - no reference model needed
uv run python -m direct_alignment.train --config direct_alignment/configs/simpo.yaml

# Quick test run (1k samples)
uv run python -m direct_alignment.train --loss dpo --max_samples 1000

Available algorithms

AlgorithmConfigDescription
DPOdpo.yamlDirect Preference Optimization (Rafailov et al., 2023)
cDPON/A (use --loss cdpo)Conservative DPO with label smoothing
IPOipo.yamlIdentity Preference Optimization (Azar et al., 2023)
SimPOsimpo.yamlSimple PO - length-normalized, no ref model (Meng et al., 2024)
ORPOorpo.yamlOdds Ratio PO - combines SFT + preference (Hong et al., 2024)
KTOkto.yamlKahneman-Tversky Optimization (Ethayarajh et al., 2024)
APO-Zeroapo_zero.yamlAnchored PO, chosen-up / rejected-down (D'Oosterlinck et al., 2024)
APO-Downapo_down.yamlAnchored PO, both-down with larger rejected drop (D'Oosterlinck et al., 2024)

Training Results

Direct Alignment Training Results

See Chapter 8 of RLHF Book for mathematical derivations.

Rejection Sampling

Train the rejection sampling pipeline from Chapter 9: generate multiple completions per prompt, score them with a reward model, select a subset, then SFT on the selected pairs.

# Preprocess once (generate + score rollouts)
uv run python -m rejection_sampling.preprocess \
    --config rejection_sampling/configs/top_per_prompt.yaml

# Train each selection config on the cached rollouts
uv run python -m rejection_sampling.train \
    --config rejection_sampling/configs/top_per_prompt.yaml
uv run python -m rejection_sampling.train \
    --config rejection_sampling/configs/random_per_prompt.yaml
uv run python -m rejection_sampling.train \
    --config rejection_sampling/configs/top_k_overall.yaml
uv run python -m rejection_sampling.train \
    --config rejection_sampling/configs/random_k_overall.yaml

Training Results

Rejection Sampling Results

Example Runs

StrategyDescriptionExample Run
top_per_promptBest-of-N completion per promptwandb
random_per_promptRandom per-prompt controlwandb
top_k_overallBest K completions across the full poolwandb
random_k_overallRandom flat-pool controlwandb

On the reference 1k-train / 200-test GSM8K slice, top_k_overall beat its matched random baseline, while top_per_prompt and random_per_prompt were effectively tied.

Configuration

Weights & Biases Logging

Training runs are logged to Weights & Biases. Configure via environment variables:

# Required: Your wandb API key
export WANDB_API_KEY="your-key"

# Optional: Override project name (default: from config file)
export WANDB_PROJECT="rlhf-book"

# Official maintainers publishing reference runs can target the team project:
# export WANDB_ENTITY="rlhf-book"
# export WANDB_PROJECT="core"

# Optional: Override run name
export WANDB_RUN_NAME="grpo_experiment_1"

Official runs for this repo are logged to: wandb.ai/rlhf-book/core

For public reference links, set the project visibility to Public in W&B so no login is required to view training curves, configs, and metrics.

To disable wandb logging entirely, set wandb_project: null in your config or:

export WANDB_MODE="disabled"

Other environment variables

# HuggingFace access (for gated models)
export HF_TOKEN="your-token"

Memory requirements

TrainingModelGPU Memory
Policy gradientsQwen3-1.7B~16GB (single GPU)
Reward modelsQwen3-0.6B~8-16GB
Reward modelsQwen3-1.7B~16-20GB

Linting

This project uses Ruff for linting and formatting. A CI check runs on every PR that touches code/.

# Check for lint errors
uvx ruff check .

# Check formatting
uvx ruff format --check .

# Auto-fix lint errors and formatting
uvx ruff check --fix .
uvx ruff format .

Configuration is in pyproject.toml (line length 100, Python 3.12 target).

Testing

The test suite intentionally starts with lightweight smoke coverage for imports and CLI entrypoints. It should not download datasets, load models, or require GPUs.

uv run --extra dev pytest

Book Chapters

These examples correspond to:

  • Chapter 4: Instruction Tuning (SFT)
  • Chapter 5: Reward Models (ORM, PRM, Preference RM)
  • Chapter 6: Policy Gradient Methods (REINFORCE, PPO, GRPO, etc.)
  • Chapter 8: Direct Alignment (DPO, IPO, SimPO, KTO, etc.)
  • Chapter 9: Rejection Sampling

See rlhfbook.com for the full text.

Citation

To cite this book, please use the following format:

@book{rlhf2025,
  author       = {Nathan Lambert},
  title        = {Reinforcement Learning from Human Feedback},
  year         = {2025},
  publisher    = {Online},
  url          = {https://rlhfbook.com},
}

License

  • Policy gradients code (policy_gradients/): Apache 2.0 (from zafstojano/policy-gradients)
  • Reward models code (reward_models/): MIT (from myhott163com/RLHF_ORM_PRM)
  • Direct alignment code and other adaptations: MIT