Background: Environment Tuning
April 14, 2026 · View on GitHub
T3RL implements the Environment Tuning paradigm from Don't Just Fine-tune the Agent, Tune the Environment (ICLR 2026). The core idea: train LLMs to perform multi-turn tool-use by interacting with realistic execution environments, rather than imitating static trajectories.
Paper Techniques
The paper introduces three key techniques:
- Four-stage curriculum — Progresses from format/syntax regulation → basic calling conventions → complex multi-tool scenarios → realistic evaluation alignment.
- Actionable environment augmentation — Converts vague execution failures into corrective hints (e.g., missing parameters, tool dependency ordering), turning failures into learning signals.
- Fine-grained progress reward — Per-turn dense feedback:
count(correct_turns) / total_turns, avoiding sparse binary rewards and enabling steady improvement.
How T3RL Differs from the Paper
T3RL currently uses a two-stage curriculum instead of four:
| Stage | Data | Notes |
|---|---|---|
| Stage 1 | bfcl_train_base.jsonl (base split) | CP=1, builds core tool-calling capability |
| Stage 2 | bfcl_train.jsonl (full data) | CP=4, longer context, harder multi-turn scenarios |
This simplification works because Qwen3-4B-Instruct already produces well-formed tool calls out of the box — the explicit format-regulation stages (Stages 1–2 in the paper) that weaker backbones (Qwen2.5, Llama 3.1) need are unnecessary. Additionally, T3RL obtains good training signal without environment augmentation on this backbone.
The paper's remaining techniques are on the Roadmap.
Trained Models & Training Curves
We release the checkpoints from both stages, trained on Qwen3-4B-Instruct-2507 with 8x H200:
| Stage | Model | Description |
|---|---|---|
| Stage 1 | Qwen3-4B-EnvTuning-Base | After base-split training (CP=1) |
| Stage 2 | Qwen3-4B-EnvTuning | Final model after full-data training (CP=4) |
Full training curves (reward, accuracy, KL, etc.) for both stages are available on Weights & Biases.