🚀 THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning 🚀

February 26, 2026 · View on GitHub

Pipeline

This is the official implementation of our paper THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning.

:fire: News:

🎉🎉🎉 Our paper has been selected for the 🤗 Hugging Face Daily Papers! Thanks to the community for the recognition and support 🚀
🎉🎉🎉Congratulations! Our paper has been accepted by ICLR 2026.

TODO:

Large Language Models (LLMs) have advanced in mathematical reasoning but still struggle with precise computation and symbolic manipulation. THOR (Tool-Integrated Hierarchical Optimization via RL) addresses this by:

TIRGen – an actor–critic pipeline to construct high-quality tool-integrated reasoning data.
Hierarchical RL – jointly optimizing trajectory-level reasoning and step-level code generation.
Self-Correction – leveraging tool feedback to fix reasoning errors during inference.

THOR achieves state-of-the-art performance on multiple mathematical benchmarks and shows consistent improvements on code generation tasks, generalizing well across both reasoning and non-reasoning models.

✨ Key Contributions

🛠 TIRGen Pipeline – Generates policy-aligned tool-integrated reasoning data.
🎯 Hierarchical RL – Combines trajectory-level optimization with step-level correction.
🔄 Self-Correction Inference – Dynamically fixes reasoning errors during inference.
📊 Broad Generalization – Effective across reasoning and non-reasoning models.

⚙️ Method

Our method, THOR, enhances tool-integrated reasoning with a three-stage pipeline:

1️⃣ TIRGen: Tool-Integrated Data Construction

Actor generates natural language reasoning steps.
Critic evaluates whether parts of the reasoning can be executed as code.
Identified steps are transformed into tool-augmented reasoning paths.
Multi-stage filtering ensures policy alignment, code quality, and difficulty balance.

TIRGen

2️⃣ Hierarchical Reinforcement Learning

Trajectory-level RL: Optimizes overall correctness of the final answer using GRPO.
Step-level RL: Focuses on error-prone code generation steps, using execution results as fine-grained rewards.
Joint optimization addresses sparse reward issues in long reasoning chains.

THOR

3️⃣ Self-Correction During Inference

During inference, if a tool call fails, the model backtracks to the reasoning step.
It regenerates a new suffix and revised action, guided by tool feedback.
This enables online error correction with minimal overhead.

git clone https://github.com/bytedance/SandboxFusion
cd SandboxFusion
# install sandboxfusion to support code execution
conda create -n sandbox -y python=3.12
conda activate sandbox
poetry install
# to build the real docs, run `cd docs && npm ci && npm run build`
mkdir -p docs/build
make run-online

Step2. Install THOR environment

git clone https://github.com/JingMog/THOR
cd THOR
conda create -n THOR -y python=3.10
pip install -r requirements.txt

🚀 Usage

1. TIRGen: TIR data construction pipeline

cd TIRGen
# TIR dataset construction
bash construct_dataset_main.sh

# multi_stage_filter
bash filter.sh

2. TIR Inference

cd inference
bash submit_bon_policy.sh

3. cold start

Our cold start is based on swift, the usage of ms-swift can be found in ms-swift.

cd swift
bash sft_demo.sh

4. RL training

# TODO

🙌 Acknowledgements

We thank the open-source community from Qwen, verl and SandboxFusion.

🖊️ Citation

If you find our work helpful, please consider giving us a ⭐ and citing our paper:

@article{THOR,
  title={THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning},
  author = {Chang, Qikai and Zhang, Zhenrong and Hu, Pengfei and Ma, Jiefeng and Pan, Yicheng and Zhang, Jianshu and Du, Jun and Liu, Quan and Gao, Jianqing},
  journal={arXiv preprint arXiv:2509.13761},
  year={2025}
}

🚀 THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning 🚀

:fire: News:

🔍 Overview

✨ Key Contributions

⚙️ Method

📊 Results

Comparison With State-of-the-Art Methods

Effectiveness of TIRGen

Ablation Study

📥 Installation