Training Reasoning Model with Dynamic Advantage Estimation on Reinforcement Learning

March 23, 2025 · View on GitHub

Training Reasoning Model with Dynamic Advantage Estimation on Reinforcement Learning

Hugging Face Notion Wandb LLM Wandb VLM

ADORA (Advantage Dynamics via Online Rollout Adaptation) is a reinforcement learning (RL) framework designed to dynamically adjust advantage values during training based on the model's rollout distribution. By simple yet effective experiments, it significantly improves long Chain-of-Thought (CoT) reasoning and reflective capabilities in Large Language Models (LLMs) and Vision-Language Models (VLMs).

For LLMs, our ADORA implementation in the Logic-RL framework achieves 40 AMC with only 100 training steps compared to the original paper's 39 AMC with 1200 steps, while maintaining comparable AIME performance at 8. For VLMs, using only 2K samples from the Geometry3K dataset and starting from Qwen2.5-VL-7B-Instruct, we achieved 73.5% accuracy on MathVista, with consistent response-length progression, setting a new SOTA for the multimodal implementation of DeepSeek-R1-Zero.

News

Key Results

ADORA

Implementing ADORA within the Logic-RL experiments achieved 40 AMC and 8 AIME scores, surpassing the GRPO's 35 and 6 respectively.

adora-figure_00

All results in pass@1 accuracy

2345678K&K IDK&K OODAvgAIMEAMC
GRPO
1000.440.420.210.260.270.170.200.330.210.31635
2000.390.230.220.180.210.200.200.260.200.23534
3000.830.810.810.640.590.480.460.770.510.66534
3500.740.780.780.700.610.470.410.750.500.64634
ADORA
1000.340.270.210.130.140.050.090.240.090.17740
2000.790.620.670.510.360.260.240.650.290.49836
3000.840.630.670.670.570.450.440.700.490.61638
3500.840.740.730.670.500.380.420.740.440.61835

ADORA-VL

Training dynamics comparison of GRPO vs ADORA on Qwen2.5-VL-7B-Instruct (geometry3k). GRPO exhibits stagnant response length growth with KL/policy loss outliers. ADORA achieves sustained length expansion with stabilized optimization at the cost of slight training reward degradation. Benchmark results demonstrate GRPO&ADORA's superior in/out-of-domain task performance.

adora-figure_01

Data comparison of the compared approaches

MM-EUREKA-8BMMR1-math-v0Vision-R1-7BADORA (ours)
Base modelInternVL2.5-8B-InstructQwen2.5-VL-7BQwen2.5-VL-7BQwen2.5-VL-7B
Cold Start Data54k (open-source)None200k (Modality Bridging VLM CoT)None
RL Data9.3k (K-12 data)6k (open-source, carefully curated)10k (math data)2k (geometry3k@ train)

All results in pass@1 accuracy

MathVista (AVG)MathVista(ID)MathVista(OOD)MMStar
Qwen2.5-VL-7B-Instruct67.369.665.563.9
MM-EUREKA-8B68.173.463.864.3
MMR1-math-v070.272.368.564.9
Vision-R1-7B (report)73.581.966.8-
GRPO70.271.669.161.9
ADORA73.576.171.463.8

Reproducing

To reproduce the experiments of LLM and VLM in the article, you can refer to the tutorials in the ADORA and ADORA_VL folders. We conducted the experiments in an environment with 8 * A800 GPUs, and both experiments took approximately 1.5 days to complete.

One More Thing

ADORA can be implemented in verl or OpenRLHF by modifying only a single function. Still, based on your specific training objectives, you need to define a method to generate advantage weights from the results of actor rollouts. Moreover, you can also choose to incorporate ADORA only at certain stages of RL training. Notably, ADORA demonstrates compatibility and independence, showing seamless integration capabilities with cold-start scenarios and the recently proposed DAPO. We welcome feedback, improvements, and collaboration opportunities to further explore ADORA's potential implementations.

Citation

If you find this blog or our code useful, we would appreciate it if you could cite our work:

@misc{gui2025adora,
  title={Training Reasoning Model with Dynamic Advantage Estimation on Reinforcement Learning},
  author={Lujun Gui and Qingnan Ren},
  year={2025},
  howpublished={\url{https://www.notion.so/Training-Reasoning-Model-with-Dynamic-Advantage-Estimation-on-Reinforcement-Learning-1a830cc0904681fa9df3e076b6557a3e}},
  note={Notion Blog},
}

Previous Work

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Acknowledgement

We thank the verl and OpenRLHF for the awesome open-source RL infrastructure.