π THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning π
February 26, 2026 Β· View on GitHub

This is the official implementation of our paper THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning.
:fire: News:
- πππ Our paper has been selected for the π€ Hugging Face Daily Papers! Thanks to the community for the recognition and support π
- πππCongratulations! Our paper has been accepted by ICLR 2026.
TODO:
- Update arXiv preprint.
- Update inference code.
- Update TIRGen code.
- Update training code.
- Update the TIRGen dataset.
π Overview
Large Language Models (LLMs) have advanced in mathematical reasoning but still struggle with precise computation and symbolic manipulation. THOR (Tool-Integrated Hierarchical Optimization via RL) addresses this by:
- TIRGen β an actorβcritic pipeline to construct high-quality tool-integrated reasoning data.
- Hierarchical RL β jointly optimizing trajectory-level reasoning and step-level code generation.
- Self-Correction β leveraging tool feedback to fix reasoning errors during inference.
THOR achieves state-of-the-art performance on multiple mathematical benchmarks and shows consistent improvements on code generation tasks, generalizing well across both reasoning and non-reasoning models.
β¨ Key Contributions
- π TIRGen Pipeline β Generates policy-aligned tool-integrated reasoning data.
- π― Hierarchical RL β Combines trajectory-level optimization with step-level correction.
- π Self-Correction Inference β Dynamically fixes reasoning errors during inference.
- π Broad Generalization β Effective across reasoning and non-reasoning models.
βοΈ Method
Our method, THOR, enhances tool-integrated reasoning with a three-stage pipeline:
1οΈβ£ TIRGen: Tool-Integrated Data Construction
- Actor generates natural language reasoning steps.
- Critic evaluates whether parts of the reasoning can be executed as code.
- Identified steps are transformed into tool-augmented reasoning paths.
- Multi-stage filtering ensures policy alignment, code quality, and difficulty balance.

2οΈβ£ Hierarchical Reinforcement Learning
- Trajectory-level RL: Optimizes overall correctness of the final answer using GRPO.
- Step-level RL: Focuses on error-prone code generation steps, using execution results as fine-grained rewards.
- Joint optimization addresses sparse reward issues in long reasoning chains.

3οΈβ£ Self-Correction During Inference
- During inference, if a tool call fails, the model backtracks to the reasoning step.
- It regenerates a new suffix and revised action, guided by tool feedback.
- This enables online error correction with minimal overhead.
π Results
Comparison With State-of-the-Art Methods

Effectiveness of TIRGen

Ablation Study

π₯ Installation
Step1. Install SandboxFusion
git clone https://github.com/bytedance/SandboxFusion
cd SandboxFusion
# install sandboxfusion to support code execution
conda create -n sandbox -y python=3.12
conda activate sandbox
poetry install
# to build the real docs, run `cd docs && npm ci && npm run build`
mkdir -p docs/build
make run-online
Step2. Install THOR environment
git clone https://github.com/JingMog/THOR
cd THOR
conda create -n THOR -y python=3.10
pip install -r requirements.txt
π Usage
1. TIRGen: TIR data construction pipeline
cd TIRGen
# TIR dataset construction
bash construct_dataset_main.sh
# multi_stage_filter
bash filter.sh
2. TIR Inference
cd inference
bash submit_bon_policy.sh
3. cold start
Our cold start is based on swift, the usage of ms-swift can be found in ms-swift.
cd swift
bash sft_demo.sh
4. RL training
# TODO
π Acknowledgements
We thank the open-source community from Qwen, verl and SandboxFusion.
ποΈ Citation
If you find our work helpful, please consider giving us a β and citing our paper:
@article{THOR,
title={THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning},
author = {Chang, Qikai and Zhang, Zhenrong and Hu, Pengfei and Ma, Jiefeng and Pan, Yicheng and Zhang, Jianshu and Du, Jun and Liu, Quan and Gao, Jianqing},
journal={arXiv preprint arXiv:2509.13761},
year={2025}
}