README.md

April 5, 2026 · View on GitHub

R³L

Reflect-then-Retry Reinforcement Learning

Language-Guided Exploration, Pivotal Credit and Positive Amplification

R³L is a reinforcement learning algorithm that improves LLM reasoning and agentic capabilities through language-guided exploration, pivotal credit assignment, and positive amplification. Built on Trinity-RFT.

Installation

conda create -n r3l python=3.10
conda activate r3l

pip install -e ".[dev]"
pip install flash-attn==2.8.1 --no-build-isolation

Quick Start

# 1. Set environment variables
export TRINITY_MODEL_PATH=/path/to/Qwen2.5-1.5B-Instruct
export TRINITY_CHECKPOINT_ROOT_DIR=./checkpoints

# 2. Start Ray cluster
ray start --head

# 3. Run training
# ALFWorld
trinity run --config examples/R3L/alfworld/opmd_R3L_1.5B.yaml

# WebShop
trinity run --config examples/R3L/webshop/opmd_R3L_1.5B.yaml

# ScienceWorld
trinity run --config examples/R3L/scienceworld/opmd_R3L_1.5B.yaml

# DAPO (Math)
trinity run --config examples/R3L/dapo/opmd_R3L_1.5B.yaml

Supported Environments

Environment	Config	Notes
ALFWorld	`examples/R3L/alfworld/opmd_R3L_*.yaml`	Requires `alfworld` package
WebShop	`examples/R3L/webshop/opmd_R3L_*.yaml`	Requires ~1TB memory
ScienceWorld	`examples/R3L/scienceworld/opmd_R3L_*.yaml`	Requires Java
DAPO	`examples/R3L/dapo/opmd_R3L_*.yaml`	Auto-download from HuggingFace

Model sizes: 1.5B, 3B, 7B

Baselines

Method	Config Pattern
GRPO	`examples/R3L/<env>/grpo_*.yaml`
OPMD	`examples/R3L/<env>/opmd_*.yaml`
RAFT	`examples/R3L/<env>/RAFT_*.yaml`
Reflect-GRPO	`examples/R3L/<env>/reflect_grpo_*.yaml`
Critique-GRPO	`examples/R3L/<env>/critique_grpo_*.yaml`

Ablation Variants

Variant	Config Pattern
R³L (Full)	`examples/R3L/<env>/opmd_R3L_*.yaml`
w/o Pivotal Credit	`examples/R3L/<env>/opmd_R3L_w_o_credit_*.yaml`
w/o Positive Amplification	`examples/R3L/<env>/opmd_R3L_w_o_reweight_*.yaml`
w/o Reflect-Retry	`examples/R3L/<env>/opmd_reweight_adv_*.yaml`

Configuration

Key hyperparameters:

algorithm:
  algorithm_type: opmd_reweight_adv  # R3L algorithm
  repeat_times: 8                     # Trajectories per task

model:
  model_path: ${oc.env:TRINITY_MODEL_PATH}
  max_response_tokens: 512

buffer:
  explorer_input:
    default_workflow_type: 'R3L_alfworld_workflow'

Parameter	Default	Description
`repeat_times`	8	Group size (N)
`α`	3.0	Positive amplification factor
`lr`	1e-6	Learning rate
`temperature`	1.0	Training temperature

Project Structure

R3L/
├── trinity/common/workflows/envs/R3L/   # R3L workflows
│   ├── alfworld/
│   ├── webshop/
│   ├── scienceworld/
│   └── dapo/
├── trinity/algorithm/                    # Algorithm implementations
│   └── advantage_fn/opmd_advantage.py   # Positive Amplification
└── examples/R3L/                         # Training configs

Environment Setup

ALFWorld

Step 1: Install alfworld

pip install alfworld

Step 2: Download data

# Option 1: Auto download to ~/.cache/alfworld/
alfworld-download

# Option 2: Specify download path
alfworld-download --data-dir ./alf-data

Step 3: Configure data path

Edit examples/R3L/alfworld/get_alfworld_data.py:

# Line 11, modify to your actual data path
alfworld_data_root = "/your/local/path/alfworld/json_2.1.1"

Note: Keep json_2.1.1 at the end of the path.

Step 4: Process data

cd examples/R3L/alfworld
python get_alfworld_data.py

Processed data will be saved to examples/R3L/alfworld/alfworld_data/.

Step 5: Start training

trinity run --config examples/R3L/alfworld/opmd_R3L_1.5B.yaml

WebShop

Note: WebShop requires ~1TB memory. Skip if resources are limited.

Step 1: Clone WebShop repository

git clone https://github.com/princeton-nlp/webshop.git webshop
cd webshop

Step 2: Install Java 17+

# Using conda
conda install -c conda-forge openjdk=17

Step 3: Run setup script

# Small dataset (recommended for testing)
./setup.sh -d small

# Full dataset
./setup.sh -d all

Follow WebShop's installation instructions. Note that some Python dependencies may conflict with R3L - install them individually if needed.

Step 4: Process data

cd examples/R3L/webshop
python get_webshop_data.py

Step 5: Configure WebShop path

Option A: Set environment variable

export WEBSHOP_PATH=/path/to/webshop

Option B: Modify workflow files directly

Edit path in all WebShop workflow files (trinity/common/workflows/envs/R3L/webshop/*.py):

# Find this line and update the path
sys.path.append("/your/path/to/webshop")

Step 6: Start training

trinity run --config examples/R3L/webshop/opmd_R3L_1.5B.yaml

ScienceWorld

Step 1: Clone and install ScienceWorld

git clone https://github.com/allenai/ScienceWorld.git
cd ScienceWorld
pip install .

Step 2: Configure jar path

Edit examples/R3L/scienceworld/get_sciworld_data.py:

# Line 101-103, set the jar path to your ScienceWorld directory
jar_path = "/your/path/ScienceWorld/scienceworld/scienceworld.jar"

Step 3: Process data

cd examples/R3L/scienceworld
python get_sciworld_data.py

Step 4: Start training

trinity run --config examples/R3L/scienceworld/opmd_R3L_1.5B.yaml

DAPO

DAPO uses HuggingFace datasets that are automatically downloaded. No manual setup required.

trinity run --config examples/R3L/dapo/opmd_R3L_1.5B.yaml

License

Apache License 2.0