PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning
December 7, 2025 ยท View on GitHub
This is the evalaution module for our work Perception-Aware Policy Optimization for Multimodal Reasoning
- This module is also embedded into PAPO for convenient inference and evaluation
- Feel free to directly use PAPO for complete training-evaluation workflow!
๐ Evaluation for PAPO
1. Env Setup
We follow the environment setup instructions from LLaMA-Factory:
cd PAPO-Eval
conda env create -f env.yml
conda activate papo_eval
pip install -e ".[all]"
2. Data Preprocessing
All evaluation data can be downloaded from: https://huggingface.co/datasets/PAPO-Galaxy/PAPO_eval
Prepare evaluation dataset for PAPO evaluation:
-
Set the specific dataset(s) you would like to use for evaluation:
AUTO_UNZIP(bool): Whether to automatically upzip images- If set to
true, the downloaded image ZIP file will be automatically unzipped, and the ZIP file will be removed
- If set to
SPLIT_NAME(str): Which dataset to use for evalaution. Current available datasets:- hiyouga/geometry3k:
SPLIT_NAME="hiyouga_geometry3k" - AI4Math/MathVerse:
SPLIT_NAME="AI4Math_MathVerse" - AI4Math/MathVista:
SPLIT_NAME="AI4Math_MathVista" - We_Math/We_Math:
SPLIT_NAME="We_Math" - FanqingM/MMK12:
SPLIT_NAME="PAPO_MMK12" - Vision-dependent subset of AI4Math/MathVerse:
SPLIT_NAME="AI4Math_MathVerse_vision_dependent" - BUAADreamer/clevr_count_70k:
SPLIT_NAME="BUAADreamer_clevr_count_70k" - lscpku/LogicVista:
SPLIT_NAME="lscpku_LogicVista" - MMMU/MMMU_Pro:
SPLIT_NAME="MMMU_MMMU_Pro
- hiyouga/geometry3k:
-
Run data preprocessing
cd PAPO-Eval
bash papo_eval/preprocess/preprocess.sh
3. Run Evaluation
3.1 Run Model Inference
-
Please set the dataset and other eval parameters in
PAPO-Eval/papo_eval/run_infer.shDATASET(str): The dataset you would like to run inference on- hiyouga/geometry3k:
DATASET="hiyouga_geometry3k" - AI4Math/MathVerse:
DATASET="AI4Math_MathVerse" - AI4Math/MathVista:
DATASET="AI4Math_MathVista" - We_Math/We_Math:
DATASET="We-Math_We-Math" - FanqingM/MMK12:
SPLIT_NAME="PAPO_MMK12" - Vision-dependent subset of AI4Math/MathVerse:
DATASET="AI4Math_MathVerse_vision_dependent" - BUAADreamer/clevr_count_70k:
DATASET="BUAADreamer_clevr_count_70k" - lscpku/LogicVista:
DATASET="lscpku_LogicVista" - MMMU/MMMU_Pro:
DATASET="MMMU_MMMU_Pro"
- hiyouga/geometry3k:
Model(str): PAPO model you would like to run inference
-
Run inference:
cd PAPO-Eval bash papo_eval/run_infer.sh -
Inference outputs will be saved under
PAPO-Eval/infer_outputs- The first and last output line will also show the exact save path
3.2 Run Evaluation On Model Inference
-
Please set the dataset and other eval parameters in
PAPO-Eval/papo_eval/run_eval.shJSONL_PATH(str): Path to your to-be-eval inference results- JSONL path: Directly give the JSONL path if evaluate accuracy of a specific dataset inference results
- Model dir: Give only model dir without JSONL path if evaluate vision-dependent accuracy
N_ROLLOUT(int): Number of rollout- We set
N_ROLLOUT=8in our paper
- We set
-
Run evaluation:
cd PAPO-Eval bash papo_eval/run_eval.sh -
Detailed results will be saved to
./eval_results/<eval_output_name>.json- Results will also be printed out in the final section of the output, together with the exact save path of evaluation results
๐ฅฐ Acknowledgement
Huge thanks for providing this awesome codebase!
- We thank LLaMA-Factory team for providing this foundational codebase that we adapted to implement model inference and evaluation for PAPO.
๐ Citation
@article{wang2025perception,
title={Perception-Aware Policy Optimization for Multimodal Reasoning},
author={Wang, Zhenhailong and Guo, Xuehang and Stoica, Sofia and Xu, Haiyang and Wang, Hongru and Ha, Hyeonjeong and Chen, Xiusi and Chen, Yangyi and Yan, Ming and Huang, Fei and others},
journal={arXiv preprint arXiv:2507.06448},
year={2025}
}