Dressage: Scalable RL for Any Agent and Sandbox

June 23, 2026 ยท View on GitHub

Dressage is an agentic reinforcement learning training framework built on top of slime. It bridges the gap between policy rollouts, sandboxed tool execution, and training data conversion through a shared proxy and paddock layer.

Dressage lets you train diverse types of LLM agents that use real tools โ€” like code editors, shell commands, file I/O, retrieval APIs โ€” with full RL gradient flow. Both whitebox (Python tool loops) and blackbox (HTTP agents like opencode / openclaw) paradigms are supported through a unified interface.

Table of Contents

๐Ÿ“ข News

  • [2026/06] Dressage is now open source!

๐Ÿ’ก Technical Highlights

Dressage introduces several key innovations for agentic RL:

Any Agent, Any Sandbox

Dressage separates agent semantics from execution placement. Whitebox Python tool loops, blackbox HTTP agents such as opencode / openclaw, local bubblewrap pools, and remote E2B sandboxes all converge on the same proxy-to-training path.

Token-Wise Control

Dressage records training evidence at token granularity: token_id, logprob, loss_mask, token_version, and token_expert. TITO โ€” Token-In-Token-Out avoids retokenization drift by encoding only each turn's append delta and splicing token IDs incrementally.

The same token-wise model powers Token-Boundary Pause/Resume through GenerationController, version-aware masking through token_version, and MoE Routing Replay (R3) through token_expert. See Training Layer for the deep dive.

Segment-Aware Training

When trajectories split because of history compaction or tool-schema changes, Dressage expands every segment into a training sample and broadcasts the anchor segment's terminal advantage to its siblings. Shared rollout_id / parent_traj_id values keep segments in the same slime training step, while prompt-equal denominators prevent split-heavy trajectories from receiving extra gradient weight, please refer to Training Layer.

Slime-Native Integration

Dressage is built on slime, not maintained as a fork. Rollout generation, reward processing, sample conversion, and training behavior plug into upstream slime through dotted import-path hooks.

Production-Grade Rollout Safety

Atomic trajectory logging, HTTP error redaction, context overflow detection, session artifact archiving, and abort safety contracts keep failed rollouts diagnosable without corrupting training data.

๐Ÿ—๏ธ Architecture Overview

Dressage's architecture is built on an elegant separation of concerns across three orthogonal axes. The entire system sits on top of slime as a git submodule, extending it through dotted import-path customization hooks โ€” no upstream fork required.

Dressage architecture overview

Compact mental model:

  • Dressage owns the agentic RL bridge: rollout hooks create sessions, the proxy records token-level evidence, and the training layer converts trajectories into slime-ready samples.
  • Paddocks own interaction semantics: whitebox paddocks run Python tool loops; blackbox paddocks delegate to HTTP agents through BlackboxServer.
  • Sandboxes own placement and isolation: the same rollout path can target local bubblewrap slots or remote E2B sandboxes without changing the agent logic.
  • Slime owns the training substrate: Megatron training, Ray rollout orchestration, and SGLang inference stay upstream and are customized through import-path hooks.

The key insight: paddock mode (whitebox vs blackbox) and sandbox provider (local vs remote) are orthogonal axes โ€” mix and match freely. The sandbox layer now exposes provider-neutral SandboxSpec, SandboxLease, and service endpoints; paddocks create and terminate leases while blackbox rollouts resolve the blackbox service endpoint from the lease.

Both whitebox and blackbox agent scaffolds converge on the same post-rollout path:

proxy.finalize_session โ†’ trajectory/read โ†’ expand_segments_to_samples โ†’ list[Sample]

End-to-End Rollout Flow

1๏ธโƒฃ  Generate hook creates a paddock lease + proxy session
2๏ธโƒฃ  Agent runs (whitebox Python loop or blackbox HTTP agent)
3๏ธโƒฃ  Each LLM call goes through Proxy โ†’ SGLang; tokens recorded per step
4๏ธโƒฃ  Session finalizes into one or more trajectory segments
5๏ธโƒฃ  Segments expand to training Samples; reward + loss hooks prepare train_data

๐Ÿš€ Quick Start

Start Docker Environment

docker pull huang3eng/dressage:v0.1.0

docker run --rm --gpus all --ipc=host --shm-size=16g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -it huang3eng/dressage:v0.1.0 /bin/bash

Prepare Model Checkpoint

hf download Qwen/Qwen3.5-4B \
  --local-dir /root/Qwen3.5-4B

cd /root/Dressage/slime
source scripts/models/qwen3.5-4B.sh

PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
  "${MODEL_ARGS[@]}" \
  --hf-checkpoint /root/Qwen3.5-4B \
  --save /root/Qwen3.5-4B_torch_dist

Run a Training Example

cd /root/Dressage
bash examples/scripts/run_blackbox_qwen3.5_4b_async_local.sh

For detailed setup instructions, all configuration options, and troubleshooting, see the Quick Start Guide. To prepare and run the bundled ALFWorld and HotpotQA whitebox recipes, see the Whitebox Agent Quick Start.

๐Ÿ“ Project Structure

dressage/
โ”œโ”€โ”€ proxy/                 # Inference proxy, trajectory recording, TITO
โ”‚   โ””โ”€โ”€ tito/              # Token-In-Token-Out incremental tokenizer
โ”œโ”€โ”€ paddock/               # Environment interaction (blackbox | whitebox)
โ”œโ”€โ”€ sandbox/               # Isolation backends (bwrap, E2B)
โ”‚   โ”œโ”€โ”€ local/bwrap/       # Ray-managed bubblewrap pool
โ”‚   โ””โ”€โ”€ remote/e2b/        # E2B provider
โ”œโ”€โ”€ rollout/               # Generate hooks, multi-segment, async rollout
โ”‚   โ””โ”€โ”€ generate/          # Blackbox dispatch + whitebox agent scaffolds
โ”œโ”€โ”€ training/              # Reward post-process, async train with pause
โ”œโ”€โ”€ reward/                # Sample-oriented reward registry
โ”œโ”€โ”€ recipes/               # Example agents (alfworld, hotpotqa)
โ””โ”€โ”€ config/                # Shared defaults and env vars

blackbox_server/
โ”œโ”€โ”€ api/                   # FastAPI route handlers
โ”œโ”€โ”€ adapters/              # Backend adapters (opencode, openclaw, claude_code)
โ”œโ”€โ”€ core/                  # Server logic, models, monitoring
โ”œโ”€โ”€ proxy/                 # In-process LLM proxy with session headers
โ”œโ”€โ”€ store/                 # In-memory session store
โ””โ”€โ”€ runtime/               # Path and runtime ID resolution

slime/                     # Git submodule โ€” upstream RL framework (v0.3.0)
examples/
โ”œโ”€โ”€ scripts/               # Training run scripts (sync, async, partial)
โ””โ”€โ”€ data/                  # Sample/demo prompt datasets (DAPO Math)
tests/                     # 30+ test modules
docs/                      # Protocol and implementation docs
docker/                    # Reproducible local environment

Core modules map onto the main runtime layers:

Inference Proxy: Token-Level Trajectory Recording

OpenAI-compatible HTTP service that sits between agent rollouts and the SGLang inference router. Every LLM call passes through it โ€” every token, logprob, and loss mask is captured for training. Agents never call SGLang directly. Supports per-step recording, auto-segmentation on history rewrites, preemptible generation via GenerationController, weight version tracking, and MoE routing replay. The central nervous system of trajectory recording.

Paddock: Unified Environment Interaction

Single-class abstraction that manages all environment interaction during rollouts โ€” sandbox lease creation, agent/tool calls, pause/resume lifecycle, and cleanup. The "what-to-do" layer. BlackboxPaddock drives HTTP agents (register_agent โ†’ call_agent โ†’ pause/resume), while WhiteboxPaddock exposes tool_call for Python agents (shell, file read/write). Factory pattern: create_paddock_from_env() wires everything from environment variables.

Sandbox Backends: Pluggable Isolation

Two pluggable isolation backends, swapped via a single environment variable. Local bubblewrap (local_bwrap): Ray-managed pool of bubblewrap-isolated slots with supervisor health monitoring. E2B (e2b): elastic cloud-native sandboxes from E2B templates/images. Two local pool modes: blackbox (full BlackboxServer service endpoint) or command_only (shell/file ops). Local blackbox clusters can also be managed with the dressage-local-blackbox-* / dressage-blackbox-* scripts.

BlackboxServer: Unified HTTP Adapter

Bundled HTTP adapter service that decouples the rollout manager from concrete agentic backends. Sits inside sandboxes, manages exactly one backend agent process and one active session at a time, and transparently proxies all LLM calls back through the Dressage inference proxy. Supports opencode, openclaw, and future backends through a pluggable adapter pattern. Features turn idempotency, register-and-rebind, and background health monitoring.

Rollout Hooks: Slime Integration

Bridges slime's RL training loop with Dressage's agentic capabilities through customizable hooks โ€” all specified as dotted import paths, no fork required. Two generate paradigms: whitebox_agent (Python tool loops) and blackbox_dispatch (BlackboxServer delegation). Three async scheduling modes: sync, fully async (background worker pipelines), and partial async (early return on global_batch_size). Includes pluggable reward registry, prompt-equal sample conversion, and trajectory-level logging.

Training Layer: Multi-Segment & TITO

Transforms proxy-recorded trajectories into slime-compatible training data. Multi-segment expansion trains on all segments (not just the last). TITO tokenization eliminates retokenization drift across turns. Prompt-equal gradient aggregation ensures fair scaling for GRPO. Partial rollout resume preserves in-flight generation across weight updates. Reward post-processing broadcasts anchor advantages to sibling segments.

Recipes: Example Agents

Complete whitebox agent implementations demonstrating the framework. ALFWorld: TextWorld navigation agent with env_step tool for household task completion. HotpotQA: multi-hop retrieval agent with local FAISS+BGE index for complex question answering. Both include training scripts, reward functions, and a build-your-own guide for custom agents.

For the end-to-end data preparation and launch commands for these recipes, see the Whitebox Agent Quick Start.

๐Ÿ“š Documentation

  • Proxy โ€” Inference proxy architecture, session model, trajectory build modes, routing replay
  • Paddock โ€” Paddock interface, blackbox vs whitebox interaction, factory pattern
  • Sandbox โ€” Sandbox providers, bwrap pools, E2B integration
  • BlackboxServer โ€” HTTP adapter protocol, backends, in-process LLM proxy, session states
  • Rollout โ€” Generate hooks, async modes, reward registry, slime wiring
  • Training โ€” Multi-segment, TITO, prompt-equal aggregation, partial rollout
  • Recipes โ€” ALFWorld and HotpotQA example agents, build-your-own guide
  • Whitebox Agent Quick Start โ€” Data preparation and launch commands for ALFWorld and HotpotQA whitebox agents
  • Quick Start โ€” Step-by-step setup, configuration reference, troubleshooting

๐Ÿณ Docker

A reproducible local environment is provided via Docker:

docker/build.sh    # Build the image
docker/run.sh      # Run with --gpus all --network host --ipc host --privileged

The image includes bubblewrap, opencode, openclaw, Dressage, and BlackboxServer. --privileged is required for bubblewrap inside containers. See docker/README.md for details.

๐Ÿ‘ฅ Team

Dressage is built by the Alibaba Accio team.

Core Contributors (alphabetical): Liangmeng Huang (huang3eng@gmail.com) Qingchuan Li (lqcustc@gmail.com) Hongwei Xue (xuehongwe@gmail.com) Shilin Yan (tattoo.ysl@gmail.com) For academic collaborations, citations, or technical inquiries, please contact us.

Contributors (alphabetical): Wenhui Chen, Hao Dong, Xueyuan Han, Jing He, Hongyu Li, Junbo Li, Zicheng Liu, Senyu Zhang, Guannan Zhang.

To cite Dressage:

@misc{dressage_github,
  author       = {Liangmeng Huang and Qingchuan Li and Hongwei Xue and Shilin Yan and {Dressage Contributors}},
  title        = {{Dressage}: Scalable {RL} for Any Agent and Any Sandbox},
  year         = {2026},
  howpublished = {\url{https://github.com/Accio-Lab/Dressage}}
}

๐Ÿค Contributing

Contributions are welcome! If you have suggestions for new features, performance tuning, or feedback on user experience, feel free to submit an Issue or PR.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Ensure tests pass (pytest)
  4. Commit your changes (git commit -m 'Add amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

Development notes:

  • When bumping the slime submodule, diff convert_samples.py against upstream slime/ray/rollout.py
  • TITO currently supports qwen3_5 only โ€” contributions for additional model templates welcome
  • The claude_code adapter is reserved (501) โ€” contributions welcome

๐Ÿ™ Acknowledgements

Dressage is built by the Alibaba Accio team. We gratefully acknowledge:

  • slime (THUDM, Tsinghua University) โ€” The upstream RL post-training framework that Dressage builds upon. slime provides the foundational Megatron training loop, Ray rollout management, and SGLang inference integration.
  • SGLang โ€” High-performance inference engine powering the generation backend.
  • Megatron-LM (NVIDIA) โ€” Distributed training infrastructure used via slime.
  • bubblewrap โ€” Unprivileged sandboxing for local agent isolation.

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 โ€” see the LICENSE file for details.

The Docker setup can install opencode and openclaw using their public install scripts. Those tools are not source code distributed by this repository; review their upstream licenses before redistribution.