README.md

November 24, 2025 · View on GitHub

FlowRL

Matching Reward Distributions via Flow Balance

𝕏 Post 1 | 𝕏 Post 2 | 𝕏 Post 3 | 𝕏 Post 4

FlowRL Overview

FlowRL Objective
Trained Models & Experiment Logs
Quick Start
Testing
Citation

FlowRL Objective

\mathcal{L}_{\text{FlowRL}} = w \cdot \left( \log Z_{\phi}(x) + \frac{1}{|y|} \log \pi_{\theta}(y \mid x) - \beta \hat{r}(x, y) - \frac{1}{|y|} \log \pi_{\text{ref}}(y \mid x) \right)^2

FlowRL is a flow-balanced reinforcement learning method that matches full reward distributions instead of maximizing rewards, promoting diverse exploration and generalizable reasoning trajectories in LLMs.

Trained Models & Experiment Logs

Base Model	Domain	WandB Logs	Hugging Face Model
Qwen2.5-7B	Math	🔗 View Run	🤗 Model
DeepSeek-7B	Code	🔗 View Run	🤗 Model
Qwen2.5-32B	Math	-	🤗 Model

Quick Start

There are three ways to use FlowRL:

⭐ We recommend using Option 1 as the default choice. Since verl updates frequently, the newest versions may have unstable factors such as training and inference mismatches. Option 1 uses verl 0.4.0, which is stable and has been thoroughly tested with our paper results.

Option 1: Original Paper Reproduction (verl 0.4.0) ⭐ Recommended

For exact reproduction of results from the paper, use the original repository with verl 0.4.0:

👉 Original Code: https://github.com/Xuekai-Zhu/FlowRL

Step 1: Installation

Install verl first before using FlowRL.

Step 2: Data Preparation

# Option A: Download our pre-processed datasets directly
bash preprocess/down_load_dataset.sh
# Move data to default directory
mv data/xuekai/flowrl-data-collection/math_data data/math_data
mv data/xuekai/flowrl-data-collection/code_data data/code_data

# Option B: Process data from original sources
# For detailed processing instructions, see data/README.md

Step 3: Model Preparation

For Math Tasks: Qwen/Qwen2.5-7B (default in script) ; Qwen/Qwen2.5-32B

For Code Tasks: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

# Download default model (Qwen2.5-7B for math)
bash preprocess/down_load_model.sh

# For other models, modify MODEL_NAME in the script before running

Step 4: Training Scripts

cd verl_FlowRL

# For 7B math training
bash command/training/math/flowrl_7B_math.sh

# For 32B math training
bash command/training/math/flowrl_32B_math.sh

# For 7B code training
bash command/training/code/flowrl_7B_code.sh

Option 2: Latest verl Recipe FlowRL

For running FlowRL using the latest verl framework:

Latest verl:

verl recipe: https://github.com/volcengine/verl/tree/main/recipe/flowrl

Step 1: Prepare Data and Model

# Prepare dataset
bash recipe/flowrl/prepare/prepare_data.sh

# Prepare model
bash recipe/flowrl/prepare/prepare_model.sh

Step 2: Run Training

# Train FlowRL with Qwen2.5-7B
bash recipe/flowrl/run_flowrl_qwen2.5_7b.sh

Option 3: Implement FlowRL Yourself

If you want to implement FlowRL in your own codebase, we provide a detailed implementation guide:

📖 FlowRL Implementation Guide

This guide walks you through the key components and steps needed to integrate FlowRL into your existing training pipeline.

Testing

After training your FlowRL models, you can evaluate them using the following commands:

cd verl_Test

# First merge the model
bash command/eval/merge_model.sh

# For math testing
bash command/eval/math/flowrl_math_test.sh

# For code testing
bash command/eval/code/flowrl_code_test.sh

Reference: For verl v0.5.0.dev merge model script, see merge_model.sh

Citation

If you think this repo helps you, please kindly consider citing our paper:

@article{zhu2025flowrl,
  title={FlowRL: Matching Reward Distributions for LLM Reasoning},
  author={Zhu, Xuekai and Cheng, Daixuan and Zhang, Dinghuai and Li, Hengli and Zhang, Kaiyan and Jiang, Che and Sun, Youbang and Hua, Ermo and Zuo, Yuxin and Lv, Xingtai and others},
  journal={arXiv preprint arXiv:2509.15207},
  year={2025}
}