README.md

July 15, 2026 · View on GitHub

Agentic RL on any harness, with any backend, on any benchmark.

rLLM is an open-source framework for training language agents with reinforcement learning. Bring any harness, run it in any sandbox, and switch training backends with one flag — the same agent code drives both eval and training.

Core features

Any harness. 10+ CLI harnesses (Claude Code, Codex, Terminus-2, mini-swe-agent, opencode, ...) plus Harbor-compatible task dirs. Or wrap your own agent — LangGraph, OpenAI Agents SDK, openai.OpenAI — with @rllm.rollout.
Any sandbox. Docker, Daytona, Modal, or local — with snapshot + warm-pool acceleration to keep rollouts cheap at training-scale.
Multiple training backends, one API. verl (distributed multi-GPU), tinker (single-machine), fireworks (Fireworks platform). Switch with one flag.
60+ integrated benchmarks. Math, code, MCQ, QA, search, VLM, translation, agentic — Terminal-Bench 2.0, SWE-bench, SkillsBench, AIME, MATH-500, GPQA, and more. rllm eval <name> auto-pulls and runs.
Multiple training methods. GRPO, REINFORCE, RLOO, SFT, on-policy distillation, and more.
Battle-tested. State-of-the-art open-source results (DeepScaleR-1.5B, DeepCoder-14B, DeepSWE-32B, FinQA-4B). Adopted by academic labs and industry research teams (see Community Projects below).

Installation

rLLM requires Python >= 3.11. You can install it either directly via pip or build from source.

uv pip install "rllm @ git+https://github.com/rllm-org/rllm.git"

This installs dependencies for running rllm CLI with the tinker backend (single-machine, Tinker API). For other backends:

# Distributed multi-GPU training (verl + vLLM/SGLang)
uv pip install "rllm[verl] @ git+https://github.com/rllm-org/rllm.git"

# Fireworks training platform
uv pip install "rllm[fireworks] @ git+https://github.com/rllm-org/rllm.git"

For building from source or Docker, see the installation guide.

Quickstart

Option A: CLI (no code needed)

# 1. Configure your model provider
rllm model setup

# 2. Evaluate on a benchmark
rllm eval gsm8k

# 3. Train with RL
rllm train gsm8k

Option B: Python API

Define a rollout (your agent) and an evaluator (your reward function), then hand them to the trainer:

# my_flow.py
from openai import OpenAI
import rllm
from rllm.types import AgentConfig, Episode, Task, Trajectory

@rllm.rollout
def solve(task: Task, config: AgentConfig) -> Episode:
    client = OpenAI(base_url=config.base_url, api_key="EMPTY")
    response = client.chat.completions.create(
        model=config.model,
        messages=[{"role": "user", "content": task.instruction}],
    )
    answer = response.choices[0].message.content or ""
    return Episode(
        trajectories=[Trajectory(name="solver", steps=[])],
        artifacts={"answer": answer},
    )

# my_evaluator.py
import rllm
from rllm.eval.types import EvalOutput, Signal
from rllm.types import Episode

@rllm.evaluator
def score(task: dict, episode: Episode) -> EvalOutput:
    answer = str(episode.artifacts.get("answer", ""))
    is_correct = answer.strip() == task["ground_truth"].strip()
    reward = 1.0 if is_correct else 0.0
    return EvalOutput(reward=reward, is_correct=is_correct,
                      signals=[Signal(name="accuracy", value=reward)])

# train.py
from rllm.trainer import AgentTrainer
trainer = AgentTrainer(
    backend="tinker",
    agent_flow=solve,
    evaluator=score,
    config=config,
    train_dataset=dataset,
)
trainer.train()

During training, config.base_url points to a gateway that transparently captures token IDs and logprobs — your agent code stays the same for eval and training.

See the cookbooks for complete working examples (single-turn VLM solver, multi-agent solver-judge, and more).

Architecture

rLLM follows a simple pipeline: run your agent → collect traces → compute rewards → update the model.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Your Agent  │───▶│    Traces     │───▶│   Rewards    │───▶│  RL Update   │
│  (any code)  │    │  (auto-logged)│    │ (your logic) │    │  (GRPO etc.) │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘

Your agent runs as-is — rLLM's model gateway captures LLM calls (token IDs + logprobs) by URL-routed sessions and structures them into Episodes (one task) containing Trajectories (one agent run) made of Steps (one LLM call). A reward function scores the result, and the RL algorithm updates the model weights. The same agent code works for both eval and training.

Under the hood:

Workflow Engine runs N parallel agent instances to collect rollouts
Model Gateway routes requests and captures token IDs + logprobs
Transform Pipeline groups trajectories for advantage computation
Training Backend (verl, tinker, or fireworks) handles the policy update

Community Projects

Tongyi DeepResearch — Open-source AI researchers by Alibaba NLP
Terminal-Bench-RL — Training long-horizon terminal agents with RL
PettingLLMs — Multi-agent RL with on-policy training
SETA — Scaling environments for terminal agents
LLM-in-Sandbox — Building general agents by running LLMs in a sandbox
Vision-DeepResearch — The first long-horizon multimodal deep-research MLLM
OpenSearch-VL - An Open Recipe for Frontier Multimodal Search Agents
Cogito, Ergo Ludo — An agent that learns to play by reasoning and planning
Cut the Bill, Keep the Turns — Affordable multi-turn search RL
Experiential Reinforcement Learning — Experience-reflection-consolidation loop for RL with sparse rewards
V1: Unifying Generation and Self-Verification — Pairwise self-verification for parallel test-time scaling
TherapyGym - Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
SandMLE - Synthetic Sandbox for Training MLE Agents
AxPO - Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
Remember When It Matters — Proactive Memory Agent for Long-Horizon Agents

Articles & Blog Posts

rLLM UI: Real-Time Observability Tool for Agent Training & Evaluation — Mar 2026
rLLM On-Policy Distillation: Training Smaller Students from Stronger Teachers — Mar 2026
Faster and Better: Open-Source Recipe for Deep Research Agents with Fully Async Training — Feb 2026
rLLM-FinQA: How a 4B Model Outperforms 235B and Rivals Gemini 2.5 Pro on Financial Analysis — Feb 2026
rLLM SDK: Training Any Agentic Program without Code Changes — Dec 2025
rLLM v0.2: RL Training for General Agentic Programs — Oct 2025
DeepSWE: Open-source SWE Agent via RL — Jul 2025
DeepCoder: 14B Coder at O3-mini Level — Apr 2025
DeepScaleR: 1.5B Surpasses O1-Preview — Feb 2025

Acknowledgements

Our work is done as part of Berkeley Sky Computing Lab. The rLLM team is generously supported by grants from Laude Institute, AWS, Hyperbolic, Fireworks AI, and Modal. We pay special thanks to Together AI for the research partnership and compute support.

Citation

@misc{rllm2025,
  title={rLLM: A Framework for Post-Training Language Agents},
  author={Sijun Tan and Michael Luo and Colin Cai and Tarun Venkat and Kyle Montgomery and Aaron Hao and Tianhao Wu and Arnav Balyan and Manan Roongta and Chenguang Wang and Li Erran Li and Raluca Ada Popa and Ion Stoica},
  year={2025},
  howpublished={\url{https://pretty-radio-b75.notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31}},
  note={Notion Blog},
}

You may also cite our prior work DeepScaleR, DeepCoder, and DeepSWE.