README.md

April 16, 2026 · View on GitHub

[NeurIPS 2025 Spotlight] Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning


Paper PDF Model Code

Overview

This repository contains the official implementation of FAST-GRPO (Fast-Slow Thinking Group Relative Policy Optimization), achieving high performance in applying fast-slow thinking to both visual and textual reasoning.

Table of Contents

Installation

Setup Environment

# Clone the repository
git clone https://github.com/Mr-Loevan/FAST-GRPO.git
cd FAST-GRPO

# Create conda environment
conda create -n fast_grpo python=3.11
conda activate fast_grpo

# Install dependencies (Refer to EasyR1 installation)
pip install -r requirements.txt
pip install -e .

Quick Start

# Run training with default configuration
bash examples/train_fast_llm.sh

Core Components

FAST-GRPO introduces three key innovations that work together to achieve fast-slow reasoning:

1. Thinking Reward Function

The Thinking Reward Function (examples/reward_function/thinking_reward.py) implements an adaptive difficulty-aware reward mechanism:

  • Adaptive Difficulty: difficulty = (1 - pass_rate) * normalized_complexity
  • Differentiated Rewards:
    • Easy problems (< 80th percentile) and correct answer: Rewards concise solutions
    • Hard problems (> 80th percentile) and incorrect answer: Rewards exploration efforts
2. Dynamic KL Penalty

Implements group-based adaptive KL divergence control for stable training:

# Configuration in config.yaml
algorithm:
  kl_penalty: low_var_kl
  kl_coef: 1.0e-2
  kl_type: "group_accuracy_based"
  kl_min_coef: 0.001  # β_min
  kl_max_coef: 0.01   # β_max
  • Group-based Adaptation: Adjusts KL coefficient based on group performance
3. Slow2Fast Sampling

Progressive curriculum learning that gradually increases training difficulty:

# Configuration in config.yaml
algorithm:
  online_filtering: true
  filter_key: accuracy
  dynamic_filter_schedule:
    - epoch_ratio: 0.5   
      filter_low: 0.3    
      filter_high: 0.99  
    - epoch_ratio: 1.0   
      filter_low: 0.01  
      filter_high: 0.7   
  • Phase 1 (0-50%): Learn from medium-to-high difficulty samples for slow thinking
  • Phase 2 (50-100%): Include easy samples for fast-thinking

Training

Run Training Example

# Use provided script (recommended)
bash examples/train_fast_llm.sh

Model Zoo

ModelBase ModelDownload
FAST-1.5BDeepSeek-R1-Distill-Qwen-1.5BModelScope
FAST-3BQwen-2.5-VL-3BModelScope
FAST-7BQwen-2.5-VL-7BModelScope
FAST-8B-PreviewQwen-3-VL-8BModelScope
FAST-8BQwen-3-VL-8BComing Soon

Note: FAST-8B-Preview is trained on only 10k data points from ViRL39K. Training of FAST-8B is ongoing.

Evaluation Results

Performance on Textual Reasoning Benchmarks

MethodGSM8K (Acc)GSM8K (Length)MATH 500 (Acc)MATH 500 (Length)AIME 2024 (Acc)AIME 2024 (Length)
FAST-1.5B86.885185.8264534.178003

Note: Length denotes the number of generated tokens.

Performance on Visual Reasoning Benchmarks

BenchmarkQwen3-VL-8B (Acc)Qwen3-VL-8B (Length)GRPO (Acc)GRPO (Length)FAST-8B-Preview (Accuracy)FAST-8B-Preview (Length)
MathVerse42.91768.281.21750.381.6622.5
MathVista68.3804.272.0894.073.5371.9
CLEVR89.0304.288.0592.191.0204.1
Dynamath62.71134.776.51235.177.6495.5
Geo3k58.61680.270.21973.870.7639.0
LogicVista49.42078.062.91890.460.9713.9
MathVision21.73007.545.53007.652.71245.7
MMMUpro27.01722.351.91813.351.9737.4
MMK1251.72096.475.52045.679.1864.4
WeMath64.01536.683.41476.182.1468.4
A-OKvqa63.6394.386.3384.687.8158.7

Note: Length denotes the number of generated tokens. FAST-8B-Preview is trained on only 10k data points from ViRL39K. The evaluation is adapted from PAPO-Eval.

Citation

If you find this work useful, please cite our paper:

@inproceedings{xiao2025fastslow,
  title={Fast-Slow Thinking {GRPO} for Large Vision-Language Model Reasoning},
  author={Wenyi Xiao and Leilei Gan},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=MI1uT5rReV}
}

License

This project is licensed under the Apache 2.0 License.

Acknowledgments

  • The results reported in our paper were originally implemented with OpenRLHF
  • This repository provides a reimplementation using EasyR1 framework
  • Thanks to the VeRL and EasyR1 team for the base training framework.