README.md
October 23, 2025 · View on GitHub
RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
Kaiwen Zha*,1,
Zhengqi Gao*,1,
Maohao Shen1,
Zhang-Wei Hong2,
Duane S. Boning1,
Dina Katabi1
1MIT 2MIT-IBM Watson AI Lab
NeurIPS 2025
Paper | Code
Abstract
RL Tango is a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems.
Core Files
verl/trainer/main_rl_tango.py: Main entry script for RL Tango training that sets up the environment (e.g., Ray, reward manager, resource pools), and launches the training process.verl/trainer/ppo/rl_tango_trainer.py: Core trainer class implementing the co-evolution of generator and verifier models through interleaved RL training, reward computation, and coordinated optimization.verl/workers/verifier_worker.py: The worker class for the verifier model that handle verification input preparation and structured result extraction from verification outputs.
Installation
Environment Setup
git clone https://github.com/kaiwenzha/rl-tango.git
cd rl-tango
conda create -n tango python==3.10
conda activate tango
pip install -e '.[vllm]'
pip install ninja
pip install flash-attn --no-build-isolation
Download Models
mkdir base_models
huggingface-cli download Qwen/Qwen2.5-Math-7B \
--local-dir ./base_models/Qwen2.5-Math-7B --local-dir-use-symlinks False
huggingface-cli download Qwen/Qwen2.5-7B \
--local-dir ./base_models/Qwen2.5-7B --local-dir-use-symlinks False
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
--local-dir ./base_models/Llama-3.1-70B-Instruct --local-dir-use-symlinks False
Prepare Datasets
# training datasets
python data_preprocess/eurus2_sft.py
python data_preprocess/eurus2_rl.py
# evaluation datasets
mkdir -p ./data/StrategyQA
wget -P ./data/StrategyQA https://huggingface.co/datasets/voidful/StrategyQA/resolve/main/strategyqa_train.json
python data_preprocess/prepare_strategyqa.py
mkdir -p ./data/TableBench
wget -P ./data/TableBench https://huggingface.co/datasets/Multilingual-Multimodal-NLP/TableBench/resolve/main/TableBench.jsonl
python data_preprocess/prepare_eval_benchmarks.py
Running
We run our experiments on 4 nodes with 8×A100 GPUs each. Other compute resource configurations may also work, though we have not evaluated them.
SFT
SFT Data Generation
# export on all nodes before starting ray
export VLLM_ATTENTION_BACKEND=XFORMERS
# launch the master node of ray
ray start --head
# add the other 3 nodes to the ray cluster
ray start --address ${MASTER_NODE_ADDRESS}:6379
# on the master node, run
bash scripts/run_sft_data_generation.sh
# split SFT data into train/test splits
python data_preprocess/split_parquet.py \
--input ./data/eurus2_sft_math/llama70b_sft_data_generation.parquet
SFT Training
# on node i=0,1,2,3, run
bash scripts/run_sft_generator.sh --nnodes 4 --node_rank ${i} \
--master_addr ${MASTER_NODE_ADDRESS}
RL Tango
# similar to above
export VLLM_ATTENTION_BACKEND=XFORMERS
ray start --head
ray start --address ${MASTER_NODE_ADDRESS}:6379
# on the master node, run
bash scripts/run_rl_tango.sh <sft_model_checkpoint_path>
Acknowledgements
Our codebase is built on veRL. Special thanks to this great infrastructure.
Citation
If you find our work useful or relevant to your research, please consider citing our paper:
@article{zha2025rl,
title={RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning},
author={Zha, Kaiwen and Gao, Zhengqi and Shen, Maohao and Hong, Zhang-Wei and Boning, Duane S and Katabi, Dina},
journal={arXiv preprint arXiv:2505.15034},
year={2025}
}