Optimizing Test-Time Compute via Meta Reinforcement Finetuning

June 23, 2026 · View on GitHub

Code for Optimizing Test-Time Compute via Meta Reinforcement Finetuning (MRT). MRT adds a dense progress reward — the change in the likelihood of eventual success contributed by each reasoning episode — on top of outcome-reward RL, training LLMs to make steady progress and use test-time compute more efficiently.

v0.1 ships the full RL training + evaluation code so the open-ended result (Table 1, DeepSeek-R1-Distill-Qwen-1.5B) can be reproduced end to end, including a new on-policy variant.

✨ MRT-online (new in v0.1)

The paper applies the progress bonus over an off-policy stale prefix as a single end-of-trace scalar, and leaves as an open problem: how to do branched rollouts from a meta-prover policy efficiently. MRT-online does exactly that — it generates the thinking trace on-policy, segments episodes online, forks short forced-termination branches at each episode boundary, and assigns a per-episode dense progress reward (cheap via SGLang radix prefix caching).

Empirically, MRT-online beats outcome-reward GRPO (gain +0.62 vs +0.48) but trails the offline single-scalar MRT (+1.20) — useful evidence that the off-policy scalar form is a good default. Code: train/rl/mrt_plugin/rollout_mrt_online.py; details in train/rl/REPRODUCTION.md.

Reproducing the open-ended RL result

Everything needed is in train/rl/ — see train/rl/REPRODUCTION.md for the full recipe (method, assumptions, run commands, infra notes). Headline (5-benchmark avg pass@1 gain over the base model, 248 training steps, eval n=64):

GRPOMRTMRT-onlinepaper (GRPO / MRT)
gain over base+0.48+1.20+0.62+1.1 / +2.2
ratio vs GRPO1.0×2.50×1.29×2–3×

The MRT/GRPO ratio (2.50×) reproduces the paper's headline 2–3×. The training code is built on miles (Megatron-LM + SGLang), vendored at train/rl/miles/ in a separate commit with the MRT changes added on top — so git diff against that commit is exactly the MRT delta.

Repository layout

pathwhat
train/rl/v0.1 RL training + eval (GRPO, MRT, MRT-online) on vendored miles
train/sft/STaR/SFT training (backtracking setting)
eval/evaluation scripts for the SFT / backtracking setting
analysis/R1 scaling-curve, regret, and extrapolation analyses (paper §5, §8.5.1, Appendices D–F)
src/shared utilities

Citation

If you use our work or codebase in your research, please cite our paper:

@misc{qu2025optimizingtesttimecomputemeta,
      title={Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning},
      author={Yuxiao Qu and Matthew Y. R. Yang and Amrith Setlur and Lewis Tunstall and Edward Emanuel Beeching and Ruslan Salakhutdinov and Aviral Kumar},
      year={2025},
      eprint={2503.07572},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.07572},
}

Acknowledgements & licensing

This repo is MIT-licensed. train/rl/miles/ vendors radixark/miles (Apache-2.0); its license is preserved at train/rl/miles/LICENSE and noted in NOTICE. MRT builds on TRL and Open-R1; we thank those teams.