README.md

November 24, 2025 ยท View on GitHub

FlowRL

Matching Reward Distributions via Flow Balance

๐Ÿ“„ arXiv Paper | ๐Ÿค— #1 Paper of the Day

๐• Post 1 | ๐• Post 2 | ๐• Post 3 | ๐• Post 4

FlowRL Overview

Table of Contents

FlowRL Objective

LFlowRL=wโ‹…(logโกZฯ•(x)+1โˆฃyโˆฃlogโกฯ€ฮธ(yโˆฃx)โˆ’ฮฒr^(x,y)โˆ’1โˆฃyโˆฃlogโกฯ€ref(yโˆฃx))2\mathcal{L}_{\text{FlowRL}} = w \cdot \left( \log Z_{\phi}(x) + \frac{1}{|y|} \log \pi_{\theta}(y \mid x) - \beta \hat{r}(x, y) - \frac{1}{|y|} \log \pi_{\text{ref}}(y \mid x) \right)^2

FlowRL is a flow-balanced reinforcement learning method that matches full reward distributions instead of maximizing rewards, promoting diverse exploration and generalizable reasoning trajectories in LLMs.

Trained Models & Experiment Logs

Base ModelDomainWandB LogsHugging Face Model
Qwen2.5-7BMath๐Ÿ”— View Run๐Ÿค— Model
DeepSeek-7BCode๐Ÿ”— View Run๐Ÿค— Model
Qwen2.5-32BMath-๐Ÿค— Model

Quick Start

There are three ways to use FlowRL:


โญ We recommend using Option 1 as the default choice. Since verl updates frequently, the newest versions may have unstable factors such as training and inference mismatches. Option 1 uses verl 0.4.0, which is stable and has been thoroughly tested with our paper results.


For exact reproduction of results from the paper, use the original repository with verl 0.4.0:

๐Ÿ‘‰ Original Code: https://github.com/Xuekai-Zhu/FlowRL

Step 1: Installation

Install verl first before using FlowRL.

Step 2: Data Preparation

# Option A: Download our pre-processed datasets directly
bash preprocess/down_load_dataset.sh
# Move data to default directory
mv data/xuekai/flowrl-data-collection/math_data data/math_data
mv data/xuekai/flowrl-data-collection/code_data data/code_data
# Option B: Process data from original sources
# For detailed processing instructions, see data/README.md

Step 3: Model Preparation

For Math Tasks: Qwen/Qwen2.5-7B (default in script) ; Qwen/Qwen2.5-32B

For Code Tasks: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

# Download default model (Qwen2.5-7B for math)
bash preprocess/down_load_model.sh

# For other models, modify MODEL_NAME in the script before running

Step 4: Training Scripts

cd verl_FlowRL

# For 7B math training
bash command/training/math/flowrl_7B_math.sh

# For 32B math training
bash command/training/math/flowrl_32B_math.sh

# For 7B code training
bash command/training/code/flowrl_7B_code.sh

Option 2: Latest verl Recipe FlowRL

For running FlowRL using the latest verl framework:

Latest verl:

Step 1: Prepare Data and Model

# Prepare dataset
bash recipe/flowrl/prepare/prepare_data.sh

# Prepare model
bash recipe/flowrl/prepare/prepare_model.sh

Step 2: Run Training

# Train FlowRL with Qwen2.5-7B
bash recipe/flowrl/run_flowrl_qwen2.5_7b.sh

Option 3: Implement FlowRL Yourself

If you want to implement FlowRL in your own codebase, we provide a detailed implementation guide:

๐Ÿ“– FlowRL Implementation Guide

This guide walks you through the key components and steps needed to integrate FlowRL into your existing training pipeline.

Testing

After training your FlowRL models, you can evaluate them using the following commands:

cd verl_Test

# First merge the model
bash command/eval/merge_model.sh

# For math testing
bash command/eval/math/flowrl_math_test.sh

# For code testing
bash command/eval/code/flowrl_code_test.sh

Reference: For verl v0.5.0.dev merge model script, see merge_model.sh

Citation

If you think this repo helps you, please kindly consider citing our paper:

@article{zhu2025flowrl,
  title={FlowRL: Matching Reward Distributions for LLM Reasoning},
  author={Zhu, Xuekai and Cheng, Daixuan and Zhang, Dinghuai and Li, Hengli and Zhang, Kaiyan and Jiang, Che and Sun, Youbang and Hua, Ermo and Zuo, Yuxin and Lv, Xingtai and others},
  journal={arXiv preprint arXiv:2509.15207},
  year={2025}
}