rLLM

June 12, 2026 · View on GitHub

rLLM

Agentic RL on any harness, with any backend, on any benchmark.

Documentation Slack Website Blogs X

rLLM is an open-source framework for training language agents with reinforcement learning. Bring any harness, run it in any sandbox, and switch training backends with one flag — the same agent code drives both eval and training.

Core features

  • Any harness. 10+ CLI harnesses (Claude Code, Codex, Terminus-2, mini-swe-agent, opencode, ...) plus Harbor-compatible task dirs. Or wrap your own agent — LangGraph, OpenAI Agents SDK, openai.OpenAI — with @rllm.rollout.
  • Any sandbox. Docker, Daytona, Modal, or local — with snapshot + warm-pool acceleration to keep rollouts cheap at training-scale.
  • Multiple training backends, one API. verl (distributed multi-GPU), tinker (single-machine), fireworks (Fireworks platform). Switch with one flag.
  • 60+ integrated benchmarks. Math, code, MCQ, QA, search, VLM, translation, agentic — Terminal-Bench 2.0, SWE-bench, SkillsBench, AIME, MATH-500, GPQA, and more. rllm eval <name> auto-pulls and runs.
  • Multiple training methods. GRPO, REINFORCE, RLOO, SFT, on-policy distillation, and more.
  • Battle-tested. State-of-the-art open-source results (DeepScaleR-1.5B, DeepCoder-14B, DeepSWE-32B, FinQA-4B). Adopted by academic labs and industry research teams (see Community Projects below).

Read more on our documentation site.

Installation

rLLM requires Python >= 3.11. You can install it either directly via pip or build from source.

uv pip install "rllm @ git+https://github.com/rllm-org/rllm.git"

This installs dependencies for running rllm CLI with the tinker backend (single-machine, Tinker API). For other backends:

# Distributed multi-GPU training (verl + vLLM/SGLang)
uv pip install "rllm[verl] @ git+https://github.com/rllm-org/rllm.git"

# Fireworks training platform
uv pip install "rllm[fireworks] @ git+https://github.com/rllm-org/rllm.git"

For building from source or Docker, see the installation guide.

Quickstart

Option A: CLI (no code needed)

# 1. Configure your model provider
rllm model setup

# 2. Evaluate on a benchmark
rllm eval gsm8k

# 3. Train with RL
rllm train gsm8k

Option B: Python API

Define a rollout (your agent) and an evaluator (your reward function), then hand them to the trainer:

# my_flow.py
from openai import OpenAI
import rllm
from rllm.types import AgentConfig, Episode, Task, Trajectory

@rllm.rollout
def solve(task: Task, config: AgentConfig) -> Episode:
    client = OpenAI(base_url=config.base_url, api_key="EMPTY")
    response = client.chat.completions.create(
        model=config.model,
        messages=[{"role": "user", "content": task.instruction}],
    )
    answer = response.choices[0].message.content or ""
    return Episode(
        trajectories=[Trajectory(name="solver", steps=[])],
        artifacts={"answer": answer},
    )
# my_evaluator.py
import rllm
from rllm.eval.types import EvalOutput, Signal
from rllm.types import Episode

@rllm.evaluator
def score(task: dict, episode: Episode) -> EvalOutput:
    answer = str(episode.artifacts.get("answer", ""))
    is_correct = answer.strip() == task["ground_truth"].strip()
    reward = 1.0 if is_correct else 0.0
    return EvalOutput(reward=reward, is_correct=is_correct,
                      signals=[Signal(name="accuracy", value=reward)])
# train.py
from rllm.trainer import AgentTrainer
trainer = AgentTrainer(
    backend="tinker",
    agent_flow=solve,
    evaluator=score,
    config=config,
    train_dataset=dataset,
)
trainer.train()

During training, config.base_url points to a gateway that transparently captures token IDs and logprobs — your agent code stays the same for eval and training.

See the cookbooks for complete working examples (single-turn VLM solver, multi-agent solver-judge, and more).

Architecture

rLLM follows a simple pipeline: run your agent → collect traces → compute rewards → update the model.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Your Agent  │───▶│    Traces     │───▶│   Rewards    │───▶│  RL Update   │
│  (any code)  │    │  (auto-logged)│    │ (your logic) │    │  (GRPO etc.) │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘

Your agent runs as-is — rLLM's model gateway captures LLM calls (token IDs + logprobs) by URL-routed sessions and structures them into Episodes (one task) containing Trajectories (one agent run) made of Steps (one LLM call). A reward function scores the result, and the RL algorithm updates the model weights. The same agent code works for both eval and training.

Under the hood:

  • Workflow Engine runs N parallel agent instances to collect rollouts
  • Model Gateway routes requests and captures token IDs + logprobs
  • Transform Pipeline groups trajectories for advantage computation
  • Training Backend (verl, tinker, or fireworks) handles the policy update

Community Projects

Articles & Blog Posts

Acknowledgements

Our work is done as part of Berkeley Sky Computing Lab. The rLLM team is generously supported by grants from Laude Institute, AWS, Hyperbolic, Fireworks AI, and Modal. We pay special thanks to Together AI for the research partnership and compute support.

Citation

@misc{rllm2025,
  title={rLLM: A Framework for Post-Training Language Agents},
  author={Sijun Tan and Michael Luo and Colin Cai and Tarun Venkat and Kyle Montgomery and Aaron Hao and Tianhao Wu and Arnav Balyan and Manan Roongta and Chenguang Wang and Li Erran Li and Raluca Ada Popa and Ion Stoica},
  year={2025},
  howpublished={\url{https://pretty-radio-b75.notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31}},
  note={Notion Blog},
}

You may also cite our prior work DeepScaleR, DeepCoder, and DeepSWE.