Teaching Language Models to Reason with Tools

October 23, 2025 ยท View on GitHub

Paper ModelScope Models GitHub stars Hugging Face Model Hugging Face Model
๐Ÿค– A post-training framework for enhancing mathematical reasoning in Large Language Models using strategic code interpreter integration.

๐Ÿ“Œ Overview

CoRT is a post-training framework that teaches Large Reasoning Models (LRMs) to effectively leverage Code Interpreters (CI) for enhanced mathematical reasoning. Our approach addresses the key challenge of integrating external computational tools with LRMs' internal reasoning processes through strategic hint engineering and efficient training methodologies.

CoRT Framework


๐Ÿ“ˆ Model Performance Comparison

Below is a comprehensive performance comparison of different mathematical reasoning models across various benchmarks, including AIME24, AIME25, AMC23, MATH500, and Olympiad. Our CoRT-trained models (Prompt-Hint-1.5B-RL and Hint-Engineering-1.5B-RL) demonstrate strong performance among lightweight models while maintaining token efficiency.

ModelTool-UseAIME24AIME25AMC23MATH500OlympiadAvg
SOTA Models
o1โŒ74.379.2-96.4--
DeepSeek-R1โŒ79.870.0-97.3--
QwQ-32BโŒ79.565.394.392.379.782.2
Frontier Models (32B)
DeepSeek-R1-32BโŒ72.959.088.894.372.577.5
START-32Bโœ…66.747.195.094.4--
STILL-3-TOOL-32Bโœ…76.764.491.396.675.981.0
ReTool-R1-32Bโœ…72.554.392.994.369.276.6
Prompt-Hint-SFT-32Bโœ…77.365.095.096.675.181.8
Hint-Engineering-SFT-32Bโœ…72.160.291.394.471.277.8
Hint-Engineering-RFT-32Bโœ…76.767.194.495.173.481.3
Lightweight Models (1.5B)
DeepSeek-R1-1.5BโŒ28.821.862.983.943.348.1
DeepScaleR-1.5B-PreviewโŒ40.030.073.687.850.056.3
ToRL-1.5Bโœ…26.726.767.577.844.048.5
Prompt-Hint-1.5B-SFTโœ…30.625.063.183.350.450.5
Prompt-Hint-1.5B-RLโœ…43.130.273.887.357.158.3
Hint-Engineering-1.5B-SFTโœ…34.023.564.684.249.851.2
Hint-Engineering-1.5B-RLโœ…41.029.470.085.855.656.4

Note:

  • Best results in each section are shown in bold.
  • Second-best results are underlined.
  • During inference: temperature = 0.6, top_p = 0.95.
  • Results for AIME24, AIME25, and AMC23 are averaged over 16 samples; others over 4 samples.
  • Max sequence length: 32,768 tokens. Max tool calls: 15.

As shown, our Prompt-Hint-1.5B-RL model achieves the highest average accuracy (58.3%) among all 1.5B-sized models, demonstrating the effectiveness of the CoRT framework in enhancing mathematical reasoning with minimal model scale.

โœจ Key Features

  • ๐Ÿ” Hint-Engineering: Strategic insertion of hints at appropriate positions to optimize LRM-CI interaction
  • ๐Ÿš€ High Sample Efficiency: Achieves significant improvements with only 30 manually annotated high-quality samples
  • ๐Ÿงพ Token Efficiency: Reduces token usage by 30โ€“50% while maintaining competitive performance
  • ๐Ÿ“ฆ Complete Training Pipeline: Supports SFT, RFT, and RL training stages

โš™๏ธ Installation

# Recommend Python 3.10
# We recommend following https://github.com/agentica-project/rllm/tree/deepscaler for installation
# Note: We depend on vLLM version 0.6.3.post1

cd deepscaler
pip install -e ./verl
pip install -e .

Note: Our implementation is built upon the deepscaler LongCOT RL framework with modifications for LongTIR RL.


๐Ÿง  Open-Sourced Models

We open-source two 1.5B RL models trained with our CoRT framework:

๐Ÿ“ฅ Prompt-Hint-1.5B-RL

๐Ÿ› ๏ธ Hint-Engineering-1.5B-RL


๐Ÿงช Inference

To run inference with our models, use the following command:

TOKENIZERS_PARALLELISM=false VLLM_USE_V1=1 python -m infer.inference_vllm_dp_mj \
    --input_file <path_to_input_file_in_jsonl> \
    --start 0 \
    --end 0 \
    --output_dir <path_to_output_dir> \
    --model_name_or_path <local_path_to_our_1.5b_model> \
    --engine vllm \
    --temperature 0.6 \
    --top_p 0.95 \
    --n_sampling 16 \
    --stop_tokens_mode normal_code_block_end \
    --max_tokens_per_call 32768 \
    --max_model_len 32768 \
    --max_func_call 15 \
    --func_call_mode jupyter \
    --data_parallel_size 1 \
    --tensor_parallel_size 1

๐Ÿ“„ Input Format

The input file should be in JSONL format, where each line contains a JSON object with a prompt field. Each prompt should be a mathematical problem followed by the instruction:

{
    "prompt": "Every morning Aya goes for a \$9$-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of $s$ kilometers per hour, the walk takes her 4 hours, including $t$ minutes spent in the coffee shop. When she walks $s+2$ kilometers per hour, the walk takes her 2 hours and 24 minutes, including $t$ minutes spent in the coffee shop. Suppose Aya walks at $s+\frac{1}{2}$ kilometers per hour. Find the number of minutes the walk takes her, including the $t$ minutes spent in the coffee shop.\nPlease integrate natural language reasoning with python programs to solve the problem above, and put your final answer within \\boxed{}."
}

๐Ÿ”ง Key Parameters

  • --input_file: Path to your JSONL input file
  • --model_name_or_path: Path to the downloaded model (either Prompt-Hint-1.5B-RL or Hint-Engineering-1.5B-RL)
  • --output_dir: Directory to save inference results
  • --n_sampling: Number of samples to generate per problem (default: 16)
  • --max_func_call: Maximum number of function calls allowed (default: 15)
  • --max_model_len: Maximum sequence length (default: 32768)

๐Ÿ“Š Evaluation

Evaluate and reproduce the performance of our two RL 1.5B models:

cd CORT
sh evaluation/eval_dp_8_tp_1_n_16_maxml_32k_maxfc_15_maxlpc_32k.sh <model_to_eval>

๐Ÿงฐ RL Training Your own LongTIR Models

# Please refer to `data/toy_train.reason_step.parquet` for train samples construction
cd CORT
sh deepscaler/rl_scripts/launch_cort_rl.sh

๐Ÿ“Ž Citation

If you find our work useful for your research, please cite our paper:

@misc{li2025cortcodeintegratedreasoningthinking,
      title={CoRT: Code-integrated Reasoning within Thinking}, 
      author={Chengpeng Li and Zhengyang Tang and Ziniu Li and Mingfeng Xue and Keqin Bao and Tian Ding and Ruoyu Sun and Benyou Wang and Xiang Wang and Junyang Lin and Dayiheng Liu},
      year={2025},
      eprint={2506.09820},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.09820}, 
}

Our implementation builds upon the open-source codebases of:


๐Ÿ“ License

This project is released under the MIT License.


๐Ÿ“ฌ Contact

For questions or suggestions, feel free to reach out to us at chengpengli@mail.ustc.edu.cn.


๐ŸŒŸ Star History

Star History Chart