SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
June 24, 2025 · View on GitHub
[📖paper]   [🤗SophiaVL-R1-7B model]   [🤗Thinking Reward Model]
[🤗SophiaVL-R1-130k Dataset]   [🤗SophiaVL-R1-Thinking-156k Dataset]
Intro
We introduce SophiaVL-R1 to explore the R1 paradigm using thinking-level rewards in vision-language reasoning, motivated by the phenomenon of "wrong thinking, correct answer"
To achieve this, we train a Thinking Reward Model to yield a reward that measures the thinking process from various dimensions, using our curated SophiaVL-R1-Thinking-156k dataset.
Besides, We introduce the Trust-GRPO algorithm, which assigns a trustworthiness weight to thinking rewards based on their reliability. This method guides the model to explore favorable reasoning policies in a trustworthy manner without extra computational overhead for uncertainty estimation.
Our SophiaVL-R1-7B model achieves strong performance across multiple benchmarks (e.g., 61.3% on MMMU) and can be efficiently trained on 8 A100 GPUs in just 1,500 steps using our SophiaVL-R1-Thinking-130k dataset.
Reqirements
Software Requirements
- Python 3.9+
- transformers>=4.51.0
- flash-attn>=2.4.3
- vllm>=0.8.3
Start with the following commands:
git clone https://github.com/kxfan2002/SophiaVL-R1.git
cd SophiaVL-R1
conda create -n sophiavl python=3.10
conda activate sophiavl
pip install -r requirements.txt
Quick Start
Download the model
We recommend using huggingface-cli to download the model. You can use the following command to download the model:
# download huggingface-cli
pip install -U huggingface_hub
huggingface-cli login
# download the trained thinking reward model
huggingface-cli download bunny127/SophiaVL-R1-Thinking-Reward-Model-3B --local-dir <local_dir>
Dataset
We provide the SophiaVL-R1-130k Dataset for RL training and the SophiaVL-R1-Thinking-156k Dataset for the training of thinking reward model.
Download dataset:
# download huggingface-cli
pip install -U huggingface_hub
huggingface-cli login
huggingface-cli download bunny127/SophiaVL-R1-130k --repo-type dataset --local-dir <local_dir>
Our SophiaVL-R1-130k dataset encompasses a wide range of reasoning data.
Custom Dataset for Training
We support text-dataset and image-text dataset both in parquet and json file format. To train on your own datasets, please register your dataset in verl/data/dataset_info.json in the following format:
"myDataset":{
"file_path":"/path/to/your/dataset",
"image_base_path":"/your/image/base/path",
"columns":{
"column_reponses_to_prompt":"prompt",
"column_reponses_to_answer":"answer",
"column_reponses_to_images":"images"
}
},
Training
Training Scripts
To begin training, you first need to launch the Thinking Reward Model server using vllm:
bash scripts/train_scripts/thinking_reward_model.sh
This script launches our trained Thinking Reward Model and exposes it in the OpenAI API format.
If you want to train your own reward model, you may refer to this issue: https://github.com/kxfan2002/SophiaVL-R1/issues/7
Next, set the following environment variables in scripts/train_scripts/run_train.sh so that the training script can access the reward model:
OPENAI_API_KEY: Key for Reward Model APIOPENAI_API_URL: URL for Reward Model APIREWARD_MODEL: Model name of Reward Model
Modify your training parameters in scripts/train_scripts/fullsets.yaml.
Finally, start the training process with:
bash scripts/train_scripts/run_train.sh
Merge Checkpoint in HuggingFace Format
After training, the saved checkpoints need to be merged before inference in EasyR1. This script will transfer the saved checkpoints to HuggingFace format.
python3 scripts/model_merger.py --local_dir checkpoints/easy_r1/exp_name/global_step_1/actor
Inference
We provide a simple inference script for you to test the model. The full script is here. Have a try with your data!
# download the trained reasoning model for direct inference
huggingface-cli download bunny127/SophiaVL-R1-7B --local-dir <local_dir>
# Modify the below fields to your test data
MODEL_PATH = "bunny127/SophiaVL-R1-7B" # or your local path
image_path = "/path/to/dataset/Math/CLEVR-Math/images/CLEVR_train_036427.png" # your local image path
prompt = "Subtract 0 cyan cubes. How many objects are left?"
question_type = "numerical" # numerical, multiple_choice, free-form, OCR
Evaluation
We use VLMEvalKit for evaluation of SophiaVL-R1. To register our model in VLMEvalKit, add model description in vlmeval/config.py:
"trained_model": partial(
Qwen2VLChat,
model_path="/path/to/model",
min_pixels=1280 * 28 * 28,
max_pixels=16384 * 28 * 28,
use_custom_prompt=False,
),
We use the following systemt prompt for the evalutation of all models:
system_prompt="You FIRST think about the reasoning process as an internal monologue and then provide the final answer. Please think about this question as if you were a human pondering deeply. Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc, or other natural language thought expressions. It's encouraged to include self-reflection or verification in the reasoning process.The reasoning process MUST BE enclosed within <think> </think> tagsdd. The final answer MUST BE enclosed within <answer> </answer> tags, for example <think>your_thinking_process</think><answer>your_final_answer</answer>. If you use formula, please use LaTeX format.",
Performance of SophiaVL-R1-7B
SophiaVL-R1-7B demonstrates strong performance across multiple MLLM benchmarks, including both mathematical reasoning and general capability tasks.
Training Curves
This figure shows the accuracy reward curves during training. It is evident that SophiaVL-R1, trained with thinking-level rewards and Trust-GRPO, achieves significantly better training performance.
More Reasoning Examples of SophiaVL-R1
Acknowledgements
We sincerely appreciate the contributions of the open-source community. This work is built upon EasyR1.
Citations
If you find our work helpful for your research, please consider citing our work.
@article{fan2025sophiavl,
title={SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward},
author={Fan, Kaixuan and Feng, Kaituo and Lyu, Haoming and Zhou, Dongzhan and Yue, Xiangyu},
journal={arXiv preprint arXiv:2505.17018},
year={2025}
}