M-STAR

July 13, 2025 · View on GitHub

:star: Project Page   

:hugs: HF Repo    :page_with_curl: Paper

Welcome to M-STAR (Multimodal Self-Evolving TrAining for Reasoning) Project!

This is the preview version of M-STAR. We will continue to update, please stay tuned!

What is M-STAR :star:?

M-STAR is a framework to improve the Multimodal Reasoning ability of Large Multimodal Models (LMMs) via Self-Evolving Training.

  • M-STAR Resources:

    ComponentDescription
    M-STAR ModelA strong LMM for multimodal reasoning, scoring 59.5 on MathVista, based on MiniCPM-V-2.5 with 8B parameters.
    M-STAR PRMA Multimodal Process Reward Model (MPRM) that evaluates the quality of multimodal reasoning data at the step level.
    M-STAR CoT DatasetA collection of 100K generated multimodal reasoning data with CoT, where the queries are sourced from MathV360K.
    M-STAR MPRM Training DatasetA set of 50K multimodal reasoning data designed for training MPRM.

Performance

Main Results

MathVistaFQAGPSMWPTQAVQA
Baselines
MiniCPM-V-2.552.459.244.750.553.848.0
   + warmup52.658.447.157.053.845.8
SFT54.858.750.556.555.750.8
ReSTEM55.159.149.565.655.148.0
Iterative RFT55.759.149.564.555.147.5
Static components only
Cont. Self-Evolving57.257.656.365.157.049.7
   + PRM Re-Rank59.259.1↑0.761.1↑1468.3↑11.355.1↑1.351.4↑5.6
Automatically tuning the temperature T
M-STAR (Reward-Pass@2)59.5 (+6.9)59.5↑1.159.1↑1265.6↑8.658.9↑5.154.2↑8.4
Reference
GPT-4o63.8-----
Gemini 1.5 Flash58.4-----
GPT-4T 2024-04-0958.1-----
Pixtral 12B58.0-----
InternLM-XComposer2-VL-7B57.655.063.073.756.339.7
Math-LLaVA-13B46.637.257.756.551.333.5
LLaVA-NeXT-34B46.5-----
ModelMathVistaM3CoTMMStar-RMMBench-RAI2DAverage
MiniCPM-V-2.552.441.244.672.664.455.0
   + warmup52.647.845.176.965.957.7
M-STAR59.5↑6.948.7↑0.950.7↑5.679.9↑369.1↑3.261.6↑3.9
Phi-3.5-vision46.539.442.556.847.546.5
   + warmup49.346.544.270.965.555.3
M-STAR54.5↑5.251.3↑4.848.8↑4.673.6↑2.767.9↑2.459.2↑3.9
InternVL2-2B46.416.720.014.233.526.2
   + warmup47.645.641.868.860.052.8
M-STAR50.3↑2.747.1↑1.542.0↑0.267.3↓1.559.7↓0.353.3↑0.5

Effectiveness of Adaptively Adjusting Exploration

Evaluating the effectiveness of adaptively adjusting exploration:

  • Reward-Pass@2: The percentage of samples for which there exist correct responses among the top 2 responses ranked by the reward model. This metric directly reflects the exploitation efficacy of the reward model for the current policy. We choose Pass@2 since our training strategy involves selecting the top 2 responses using the reward model.

"Static" refers to models trained without adaptive exploration, while "Dynamic" indicates those trained with this mechanism. All models shown were trained using the M-STAR framework with optimized components as explored in our paper.

:rocket: M-STAR Resources

ResourceLinkLicense
M-STAR Datasets
M-STAR CoT DatasetLinkMIT License
M-STAR MPRM Training DatasetLinkMIT License
M-STAR Models
M-STAR-8B-v1.0LinkMiniCPM Model License
M-STAR-PRM-8B-v1.0LinkMiniCPM Model License

Citation

If you find the content of this project helpful, please cite our paper as follows:

@misc{liu2024divingselfevolvingtrainingmultimodal,
      title={Diving into Self-Evolving Training for Multimodal Reasoning}, 
      author={Wei Liu and Junlong Li and Xiwen Zhang and Fan Zhou and Yu Cheng and Junxian He},
      year={2024},
      eprint={2412.17451},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.17451}, 
}