Background: Environment Tuning

April 14, 2026 · View on GitHub

T3RL implements the Environment Tuning paradigm from Don't Just Fine-tune the Agent, Tune the Environment (ICLR 2026). The core idea: train LLMs to perform multi-turn tool-use by interacting with realistic execution environments, rather than imitating static trajectories.

Paper Techniques

The paper introduces three key techniques:

  1. Four-stage curriculum — Progresses from format/syntax regulation → basic calling conventions → complex multi-tool scenarios → realistic evaluation alignment.
  2. Actionable environment augmentation — Converts vague execution failures into corrective hints (e.g., missing parameters, tool dependency ordering), turning failures into learning signals.
  3. Fine-grained progress reward — Per-turn dense feedback: count(correct_turns) / total_turns, avoiding sparse binary rewards and enabling steady improvement.

How T3RL Differs from the Paper

T3RL currently uses a two-stage curriculum instead of four:

StageDataNotes
Stage 1bfcl_train_base.jsonl (base split)CP=1, builds core tool-calling capability
Stage 2bfcl_train.jsonl (full data)CP=4, longer context, harder multi-turn scenarios

This simplification works because Qwen3-4B-Instruct already produces well-formed tool calls out of the box — the explicit format-regulation stages (Stages 1–2 in the paper) that weaker backbones (Qwen2.5, Llama 3.1) need are unnecessary. Additionally, T3RL obtains good training signal without environment augmentation on this backbone.

The paper's remaining techniques are on the Roadmap.

Trained Models & Training Curves

We release the checkpoints from both stages, trained on Qwen3-4B-Instruct-2507 with 8x H200:

StageModelDescription
Stage 1Qwen3-4B-EnvTuning-BaseAfter base-split training (CP=1)
Stage 2Qwen3-4B-EnvTuningFinal model after full-data training (CP=4)

Full training curves (reward, accuracy, KL, etc.) for both stages are available on Weights & Biases.