Background: Environment Tuning

April 14, 2026 · View on GitHub

T3RL implements the Environment Tuning paradigm from Don't Just Fine-tune the Agent, Tune the Environment (ICLR 2026). The core idea: train LLMs to perform multi-turn tool-use by interacting with realistic execution environments, rather than imitating static trajectories.

Paper Techniques

The paper introduces three key techniques:

Four-stage curriculum — Progresses from format/syntax regulation → basic calling conventions → complex multi-tool scenarios → realistic evaluation alignment.
Actionable environment augmentation — Converts vague execution failures into corrective hints (e.g., missing parameters, tool dependency ordering), turning failures into learning signals.
Fine-grained progress reward — Per-turn dense feedback: count(correct_turns) / total_turns, avoiding sparse binary rewards and enabling steady improvement.

How T3RL Differs from the Paper

T3RL currently uses a two-stage curriculum instead of four:

Stage	Data	Notes
Stage 1	`bfcl_train_base.jsonl` (base split)	CP=1, builds core tool-calling capability
Stage 2	`bfcl_train.jsonl` (full data)	CP=4, longer context, harder multi-turn scenarios

This simplification works because Qwen3-4B-Instruct already produces well-formed tool calls out of the box — the explicit format-regulation stages (Stages 1–2 in the paper) that weaker backbones (Qwen2.5, Llama 3.1) need are unnecessary. Additionally, T3RL obtains good training signal without environment augmentation on this backbone.

The paper's remaining techniques are on the Roadmap.

Trained Models & Training Curves

We release the checkpoints from both stages, trained on Qwen3-4B-Instruct-2507 with 8x H200:

Stage	Model	Description
Stage 1	Qwen3-4B-EnvTuning-Base	After base-split training (CP=1)
Stage 2	Qwen3-4B-EnvTuning	Final model after full-data training (CP=4)

Full training curves (reward, accuracy, KL, etc.) for both stages are available on Weights & Biases.