Inference Proxy
June 20, 2026 · View on GitHub
Token-Level Trajectory Recording for Agentic RL
← Back to Main README · Overview · Key Features · Core Modules · Session Model · Build Modes · Endpoints · Routing Replay
📖 Overview
The Dressage Proxy is an OpenAI-compatible HTTP service that sits between agent rollouts and the SGLang inference router. It is the central nervous system of trajectory recording — every LLM call passes through it, and every token, logprob, and loss mask is captured for training. The proxy is what makes Dressage's training pipeline possible: without it, there would be no way to faithfully reconstruct the exact token sequences, probabilities, and decision boundaries that the agent produced during rollout.
Important
Agents never call SGLang directly. The proxy transparently forwards generation requests while building rich, training-ready trajectory data. This design ensures that token-level recording is always active, regardless of whether the agent is a Python whitebox loop or an external HTTP blackbox like opencode.
Agent (whitebox or blackbox)
│ POST /v1/chat/completions
│ headers/body ids: X-Session-Id, X-SMG-Routing-Key, X-Instance-Id, X-Turn-Id
▼
Dressage Proxy
│ forwards generation to SGLang
│ records tokens, logprobs, loss masks per step
│ tracks weight versions, MoE routing IDs
▼
SGLang Router → Policy Model
The proxy runs as a standalone FastAPI service (CLI: dressage-proxy) and is designed to handle concurrent sessions from multiple rollout workers. Each session represents one complete agent trajectory, and each call to /v1/chat/completions within a session appends a new step to the trajectory record.
✨ Key Features
- OpenAI-Compatible API — Drop-in replacement for
/v1/chat/completions. Agents don't need any custom integrations — just point yourbase_urlat the proxy. Supports streaming and non-streaming modes, tool calls, and all standard OpenAI chat completion parameters. - Per-Step Recording — Every proxy call captures the full request messages, prompt/response token IDs, per-token logprobs, weight version stamps, and computed loss masks. These per-step records form the raw material for training data construction.
- TITO Support — When
concatbuild mode is active, the proxy records incremental tokenization data in fields such asconcat_token_ids,concat_response_logprobs,concat_response_mask, andconcat_versions. These fields are later stitched together at finalize time, guaranteeing exact prefix consistency across arbitrarily long multi-turn trajectories. See TITO Tokenizer below. - Auto Segmentation — The proxy automatically detects when an agent rewrites conversation history (compaction, summarization) or changes the available tool schema mid-trajectory. When this happens, it closes the current segment and starts a new one, preserving clean token boundaries for training. Each segment becomes an independent training sample.
- Preemptible Generation — The
GenerationControllercan abort active SGLang generation at any token boundary in response to a weight update signal. Partial output is preserved in the step record, and generation continues after/v1/rollout/resumewhen the proxy was started with--dressage-partial-rollout. This enables continuous rollout without discarding in-flight computation. - Weight Version Tracking — Every generated token is stamped with the model weight version that produced it. When a trajectory spans multiple weight updates (partial rollout),
--record-token-versionsstores the per-token versions and--mask-nonlast-version-tokensmarks tokens from older versions for selective loss masking. - Routing Replay (R3) — For Mixture-of-Experts (MoE) models, the proxy captures routed expert IDs per generated token via
--use-rollout-routing-replay. This data is stored as base64-encoded chunks and forwarded to training for faithful MoE routing replay. - Configurable Parsers — Pluggable tool call and reasoning extraction backends (
local,sglang_api,hybrid). Both parser backends default tosglang_api;localparses model output directly, andhybridtries SGLang first with local fallback. Reasoning parsers extract<think>blocks for models like Qwen3. - Version and Context Safety — Non-partial trajectories are rejected if the model weight version or rollout epoch changes mid-trajectory (
trajectory_version_changed). Proxy-side context checks return stablecontext_overflowpayloads and can clampmax_tokensto the remaining context window.
🧱 Core Modules
The proxy codebase is organized into focused, single-responsibility modules:
| Module | Responsibility |
|---|---|
server.py | FastAPI application — chat completions endpoint, session finalize, trajectory read. CLI entry point dressage-proxy. Handles request validation, header extraction, and response formatting. |
session_manager.py | Per-session step management, turn tracking, and history-rewrite detection. Maintains the ordered list of StepRecord objects for each active session. Detects when conversation messages violate the append-only contract and triggers segment boundaries. |
trajectory_store.py | Thread-safe in-memory segment store. Finalized segments are written here and can be read back by rollout code via /trajectory/read. Supports cleanup by session ID. |
generation_controller.py | Preemptible SGLang generation for partial rollout. Wraps SGLang client calls with abort/resume capability. Manages generation state machine (idle → generating → paused → resumed). |
sglang_client.py | Low-level SGLang router client with weight-version tracking. Sends generation requests, receives responses with token IDs and logprobs, records which weight version was active. |
tool_call_parser.py | Model-specific tool call extraction from assistant responses. Supports multiple backend modes (local for direct parsing, sglang_api for SGLang-native, hybrid for fallback chain). Currently optimized for Qwen3.5 tool call format. |
reasoning_parser.py | Reasoning-content parsing for models that produce structured thinking blocks (e.g., Qwen3's <think>...</think> format). Separates reasoning tokens from action tokens for selective loss masking. |
proxy_client.py | Async HTTP client used by rollout code to interact with the proxy. Provides typed methods for chat_completions, finalize_session, and read_trajectory. |
tool_call_ids.py | Deterministic tool call ID generation. Ensures that tool call IDs are reproducible across re-runs, which is important for trajectory consistency. |
last_step/prompt_assistant_mask.py | Last-step trajectory build mode implementation. Constructs segments from the final assistant step, with loss masks marking assistant tokens as trainable. |
📋 Session & Step Model
A session (session_id) represents one complete agent trajectory — from the first user message through the final assistant response. Each call to /v1/chat/completions within a session appends a new step to the trajectory. Steps are ordered and immutable once recorded.
What Each Step Records
Every step captures a comprehensive snapshot of one LLM interaction:
- Request messages — The full conversation history sent by the agent
- Prompt token IDs — Tokenized input with per-token logprobs (when available)
- Response token IDs — Generated output tokens with per-token logprobs
- Weight versions — Which model weight version generated each token
- Loss masks — Binary masks indicating which tokens are trainable
- TITO fields — Incremental tokenization data (
concat_token_ids,concat_response_logprobs,concat_response_mask, etc.) when concat mode is active - Segment markers — Whether this step triggered a segment boundary
- MoE routing data — Routed expert IDs per token (when R3 is enabled)
Runtime Identifiers
Every /v1/chat/completions request must provide these identifiers for proper trajectory attribution. Headers are preferred, but session_id, instance_id, and turn_id body fields are accepted as fallbacks. X-SMG-Routing-Key is also accepted as a session routing key when X-Session-Id is absent.
| Header | Purpose | Example |
|---|---|---|
X-Session-Id | Trajectory key — becomes parent_traj_id in training samples | sess-abc123 |
X-SMG-Routing-Key | Alternate session key used by sticky routing proxies | sess-abc123 |
X-Instance-Id | Prompt / task instance — used for prompt-equal gradient aggregation | inst-xyz789 |
X-Turn-Id | Optional explicit turn identifier for idempotency tracking | turn-001 |
Tip
The X-Instance-Id header is critical for prompt-equal gradient scaling. All samples from the same prompt instance share a gradient denominator, ensuring fair contribution regardless of how many segments each trajectory produces.
🧬 Trajectory Build Modes
Dressage supports two modes for converting proxy-recorded steps into training-ready segments. The choice of build mode fundamentally affects how token sequences are constructed and how multi-turn context is handled.
Concat Mode (Default) — TITO-Powered
The default and recommended mode for long agentic trajectories. Segments are assembled by concatenating per-step TITO fragments across the full multi-turn context, guaranteeing exact prefix consistency.
Turn 1 → TITO fragment₁ (system + user₁)
Turn 2 → TITO fragment₂ (asst₁ + tool₁ + user₂)
Turn 3 → TITO fragment₃ (asst₂ + tool₂ + user₃)
↓
concat(fragment₁ + fragment₂ + fragment₃) → Segment
- With
trajectory_build_model=qwen3_5, infersmodel_mask_type=qwen3_5,model_tool_call_type=qwen3_5,model_reasoning_type=qwen3, andtito_model=qwen3_5in concat mode - Best for long agentic trajectories (SWE tasks, coding agents, multi-step reasoning)
- Avoids retokenization drift — the #1 correctness challenge in agentic RL training
- Each fragment is independently tokenized, then IDs are concatenated (never re-tokenized as a whole)
Last-Step Mode
A simpler mode where each segment is built from the last assistant step's full message snapshot. The entire conversation is re-tokenized from scratch at finalize time.
Turn 1 → (context, not directly used)
Turn 2 → (context, not directly used)
Turn 3 → Full message list snapshot → tokenize → Segment
- Loss masks mark only assistant tokens as trainable
- Best for shorter trajectories where retokenization drift is negligible
- More general model support (no model-specific TITO template required)
- Not recommended for long multi-turn rollouts due to prefix inconsistency risk
Configuration
dressage-proxy \
--tokenizer-path /path/to/Qwen3.5-4B \
--trajectory-build-mode concat \
--trajectory-build-model qwen3_5 \
--tito-model qwen3_5
🧬 TITO Deep Dive
TITO (Token-In-Token-Out) is the proxy's answer to the retokenization drift problem. In standard multi-turn LLM inference, re-encoding the full message list each turn can produce subtly different token IDs for the same prefix text — breaking the alignment between logprobs recorded at rollout time and the token sequences used during training.
The Problem
Turn 1: tokenize("system: ... user: Hello") → [101, 202, 303]
Turn 2: tokenize("system: ... user: Hello assistant: Hi user: How?") → [101, 202, 304, ...]
↑ DRIFT! 303 ≠ 304
How TITO Fixes It
Turn 1: encode("system: ... user: Hello") → fragment₁ = [101, 202, 303]
Turn 2: encode("assistant: Hi user: How?") → fragment₂ = [405, 506]
concat(fragment₁ + fragment₂) → [101, 202, 303, 405, 506] ✅ prefix intact
The proxy stores TITO data in StepRecord fields:
concat_token_ids— concatenated context and response token IDs for the stepconcat_response_logprobs— per-token logprobs, with context positions filled by0.0concat_response_mask— loss mask, with context positions set to0and generated response positions set to1concat_versions— token weight-version markersconcat_context_token_count/concat_output_token_count— context and generated-token countsconcat_logprobs_invalid/concat_incremental_tokenization_failed— safety flags for concat assembly
Append-Only Contract
TITO depends on an append-only contract on conversation history. If the agent rewrites history, changes the existing message prefix, changes tool schemas, or concat tokenization fails, the proxy triggers a segment boundary — closing the current segment and starting a fresh one with TITO state reset.
Note
On TITO failure (e.g., template rendering error), the proxy marks concat_incremental_tokenization_failed=True on the step and starts a new segment. This is a safe fallback — no data is lost, just split into separate segments.
✂️ Segment Boundaries
The proxy automatically splits one session into multiple segments when it detects events that would break token-level consistency. Understanding segment boundaries is important because each segment becomes an independent training sample.
| Trigger | Detection | What Happens |
|---|---|---|
| History Rewrite | Agent sends messages that don't extend the previous conversation | Current segment finalizes; new segment starts with fresh state |
| Tool Schema Change | Available tools change between turns | Segment boundary; new tool context starts clean |
| Concat Prefix Mismatch | The existing message prefix changes in concat mode | Current segment finalizes; new segment starts with fresh state |
| TITO Fallback | Incremental tokenization fails (template error, encoding mismatch) | Marks failure flag; starts new segment with reset TITO state |
Note
Each segment becomes an independent training sample, but all segments from one session share the same parent_traj_id and rollout_id, ensuring they are grouped together during training.
DRESSAGE_PROXY_MAX_STEPS_PER_SESSION is a separate guard: once a proxy session already has that many steps, the next generation request returns HTTP 400 before generation. It does not finalize the session automatically.
🌐 HTTP Endpoints
The proxy exposes these endpoints for agent interaction and rollout management:
| Endpoint | Method | Purpose | Details |
|---|---|---|---|
/v1/models | GET | Model listing | OpenAI-compatible model list passthrough. |
/v1/chat/completions | POST | Agent inference | OpenAI-compatible. Records step data. Requires session headers. |
/session/finalize | POST | Finalize session | Closes all open segments, writes to trajectory store. |
/trajectory/read | POST | Read segments | Returns finalized segments by session ID or trajectory ID. |
/trajectory/stats | GET | Store stats | Reports in-memory trajectory store statistics. |
/v1/rollout/pause | POST | Pause generation | Signals GenerationController to abort at next token boundary. |
/v1/rollout/resume | POST | Resume generation | Re-enables generation after weight update completes. |
/v1/rollout/pause_state | GET | Pause state | Reports GenerationController pause/resume state. |
/health | GET | Health check | Returns active session, trajectory store, rollout pause, and proxy config state. |
Preemptible Generation Flow
The GenerationController enables safe interruption of active generation for weight updates during partial rollout. This is critical for continuous training where rollout and training overlap.
1️⃣ Weight update signal arrives
2️⃣ POST /v1/rollout/pause → GenerationController.abort()
3️⃣ Active SGLang request aborts at next token boundary
4️⃣ Partial output preserved in current StepRecord
5️⃣ Weight update completes
6️⃣ POST /v1/rollout/resume → GenerationController.resume()
7️⃣ Next chat_completions call picks up where generation left off
Tip
The pause/resume mechanism is atomic — there's no window where tokens could be generated with stale weights. The GenerationController state machine guarantees clean transitions between idle → generating → paused → resumed states.
🚀 Usage
Starting the Proxy
# With current startup and parser controls
dressage-proxy \
--tokenizer-path /path/to/Qwen3.5-4B \
--sglang-router-url http://<sglang-router-host>:<port> \
--trajectory-build-model qwen3_5 \
--context-window 32768 \
--no-dynamic-max-tokens \
--rollout-temperature 1.0 \
--record-token-versions \
--mask-nonlast-version-tokens \
--dressage-partial-rollout \
--tool-call-parse-backend sglang_api \
--reasoning-parse-backend sglang_api \
--model-tool-call-type qwen3_5 \
--model-reasoning-type qwen3
Using the Proxy Client
from dressage.proxy.proxy_client import ProxyClient
client = ProxyClient(proxy_url="http://localhost:8800")
# Send a chat completion
response = await client.chat_completions(
{"model": "proxy-model", "messages": [{"role": "user", "content": "Hello!"}]},
session_id="sess-001",
instance_id="inst-001",
turn_id="turn-001",
)
# Finalize the session
await client.finalize_session("sess-001", instance_id="inst-001")
# Read the trajectory
payload = await client.read_trajectory(session_id="sess-001", drain=True)
segments = payload["data"]
🔀 Routing Replay (R3)
For Mixture-of-Experts (MoE) models, the proxy can capture routed expert IDs per generated token, enabling faithful routing replay during training. Without R3, training would use random expert routing, potentially diverging from the rollout-time behavior.
Proxy (--use-rollout-routing-replay)
│
├── Requests routed expert IDs from SGLang for each generated token
├── Encodes expert ID arrays as base64 chunks for efficient transfer
├── Stores in trajectory segment metadata
└── rollout.artifacts.samples.extract_routed_experts → training data
Data Formats
R3 stores routed expert IDs as base64-encoded int32 payloads. Dressage supports three record shapes:
| Field | Description |
|---|---|
routed_experts | Direct payload for a single uninterrupted generation. |
routed_experts_chunks | Chunked payload for partial or resumed generation. |
routed_experts_parts | Multi-step wrapper for concat segments; each part may contain direct data or chunks. |
Enable R3 by setting --use-rollout-routing-replay on the proxy.
🔧 Configurable Parsers
The proxy supports pluggable backends for tool call and reasoning extraction, accommodating different model architectures and SGLang configurations:
| Parser Type | Backend | Description |
|---|---|---|
| Tool Call | local | Direct model output parsing using model-specific regex/heuristics |
| Tool Call | sglang_api | Delegates to SGLang's built-in tool call extraction |
| Tool Call | hybrid | Tries sglang_api first, falls back to local on failure |
| Reasoning | local | Parses <think>...</think> blocks from model output |
| Reasoning | sglang_api | Delegates reasoning extraction to SGLang |
| Reasoning | hybrid | SGLang-first with local fallback |
Both --tool-call-parse-backend and --reasoning-parse-backend default to sglang_api.
dressage-proxy \
--tokenizer-path /path/to/Qwen3.5-4B \
--tool-call-parse-backend sglang_api \
--reasoning-parse-backend sglang_api \
--model-tool-call-type qwen3_5 \
--model-reasoning-type qwen3
Tip
The hybrid backend is recommended for production. It leverages SGLang's optimized parsing when available, with graceful fallback to local parsing when SGLang doesn't support the model's format.
📊 Data Flow
The complete data flow from agent request to stored trajectory:
┌─────────────┐ ┌──────────────────────────┐ ┌──────────────┐
│ Agent │────▶│ Proxy │────▶│ SGLang │
│ │ │ │ │ Router │
│ whitebox │◀────│ records per-step: │◀────│ │
│ or blackbox │ │ • token IDs + logprobs │ │ policy │
└─────────────┘ │ • loss masks │ │ model │
│ • weight versions │ └──────────────┘
│ • TITO fragments │
│ • MoE routing IDs │
└──────────┬───────────────┘
│ finalize
┌──────────▼───────────────┐
│ Trajectory Store │
│ │
│ segments[] │
│ ├── tokens[] │
│ ├── logprobs[] │
│ ├── loss_mask[] │
│ ├── weight_vers[] │
│ └── experts[] │ ← MoE routing (optional)
└──────────────────────────┘
📁 Package Structure
dressage/proxy/
├── server.py # FastAPI app, CLI entry point
├── session_manager.py # Per-session step tracking
├── trajectory_store.py # In-memory segment storage
├── generation_controller.py # Preemptible generation
├── sglang_client.py # SGLang router client
├── tool_call_parser.py # Tool call extraction
├── reasoning_parser.py # Reasoning content parsing
├── proxy_client.py # Async client for rollout code
├── tool_call_ids.py # Deterministic ID generation
├── last_step/ # Last-step build mode
│ └── prompt_assistant_mask.py # Assistant loss mask builder
└── tito/ # TITO tokenizer
├── tito_tokenizer.py # Qwen35TITOTokenizer
├── template_utils.py # Fixed-template rendering
└── templates/
└── qwen3_5_fixed.jinja # Pinned chat template
🔗 Integration Points
| Component | Relationship |
|---|---|
| Paddock | Paddock coordinates proxy sessions — each rollout creates a session via proxy client |
| Sandbox | BlackboxServer's in-process LLM proxy forwards all agent calls through the Dressage proxy |
| BlackboxServer | Injects session/turn headers on every LLM call, routing through proxy |
| Rollout | Generate hooks use ProxyClient to manage sessions and read trajectories |
| Training | Training layer consumes proxy-produced segments for TITO tokenization and multi-segment expansion |