PAPO: Perception-Aware Policy Optimization for Multimodal Reasoning

December 7, 2025 · View on GitHub

This is the evalaution module for our work Perception-Aware Policy Optimization for Multimodal Reasoning

This module is also embedded into PAPO for convenient inference and evaluation
Feel free to directly use PAPO for complete training-evaluation workflow!

🚀 Evaluation for PAPO

1. Env Setup

We follow the environment setup instructions from LLaMA-Factory:

cd PAPO-Eval
conda env create -f env.yml
conda activate papo_eval
pip install -e ".[all]"

2. Data Preprocessing

All evaluation data can be downloaded from: https://huggingface.co/datasets/PAPO-Galaxy/PAPO_eval

Prepare evaluation dataset for PAPO evaluation:

Set the specific dataset(s) you would like to use for evaluation:
- AUTO_UNZIP (bool): Whether to automatically upzip images
  - If set to true, the downloaded image ZIP file will be automatically unzipped, and the ZIP file will be removed
- SPLIT_NAME (str): Which dataset to use for evalaution. Current available datasets:
  - hiyouga/geometry3k: SPLIT_NAME="hiyouga_geometry3k"
  - AI4Math/MathVerse: SPLIT_NAME="AI4Math_MathVerse"
  - AI4Math/MathVista: SPLIT_NAME="AI4Math_MathVista"
  - We_Math/We_Math: SPLIT_NAME="We_Math"
  - FanqingM/MMK12: SPLIT_NAME="PAPO_MMK12"
  - Vision-dependent subset of AI4Math/MathVerse: SPLIT_NAME="AI4Math_MathVerse_vision_dependent"
  - BUAADreamer/clevr_count_70k: SPLIT_NAME="BUAADreamer_clevr_count_70k"
  - lscpku/LogicVista: SPLIT_NAME="lscpku_LogicVista"
  - MMMU/MMMU_Pro: SPLIT_NAME="MMMU_MMMU_Pro
Run data preprocessing

cd PAPO-Eval
bash papo_eval/preprocess/preprocess.sh

3. Run Evaluation

3.1 Run Model Inference

Please set the dataset and other eval parameters in PAPO-Eval/papo_eval/run_infer.sh
- DATASET (str): The dataset you would like to run inference on
  - hiyouga/geometry3k: DATASET="hiyouga_geometry3k"
  - AI4Math/MathVerse: DATASET="AI4Math_MathVerse"
  - AI4Math/MathVista: DATASET="AI4Math_MathVista"
  - We_Math/We_Math: DATASET="We-Math_We-Math"
  - FanqingM/MMK12: SPLIT_NAME="PAPO_MMK12"
  - Vision-dependent subset of AI4Math/MathVerse: DATASET="AI4Math_MathVerse_vision_dependent"
  - BUAADreamer/clevr_count_70k: DATASET="BUAADreamer_clevr_count_70k"
  - lscpku/LogicVista: DATASET="lscpku_LogicVista"
  - MMMU/MMMU_Pro: DATASET="MMMU_MMMU_Pro"
- Model (str): PAPO model you would like to run inference
  - For example: MODEL="PAPOGalaxy/PAPO-G-H-Qwen2.5-VL-7B"
  - Our model collection on Hugging Face: PAPO-Qwen
    - PAPO-GRPO model collection: PAPO-G
    - PAPO-DAPO model collection: PAPO-D

Run inference:

cd PAPO-Eval
bash papo_eval/run_infer.sh

Inference outputs will be saved under PAPO-Eval/infer_outputs
- The first and last output line will also show the exact save path

3.2 Run Evaluation On Model Inference

Please set the dataset and other eval parameters in PAPO-Eval/papo_eval/run_eval.sh
- JSONL_PATH (str): Path to your to-be-eval inference results
  - JSONL path: Directly give the JSONL path if evaluate accuracy of a specific dataset inference results
  - Model dir: Give only model dir without JSONL path if evaluate vision-dependent accuracy
- N_ROLLOUT (int): Number of rollout
  - We set N_ROLLOUT=8 in our paper

Run evaluation:

cd PAPO-Eval
bash papo_eval/run_eval.sh

Detailed results will be saved to ./eval_results/<eval_output_name>.json
- Results will also be printed out in the final section of the output, together with the exact save path of evaluation results

🥰 Acknowledgement

Huge thanks for providing this awesome codebase!

We thank LLaMA-Factory team for providing this foundational codebase that we adapted to implement model inference and evaluation for PAPO.

📝 Citation

@article{wang2025perception,
  title={Perception-Aware Policy Optimization for Multimodal Reasoning},
  author={Wang, Zhenhailong and Guo, Xuehang and Stoica, Sofia and Xu, Haiyang and Wang, Hongru and Ha, Hyeonjeong and Chen, Xiusi and Chen, Yangyi and Yan, Ming and Huang, Fei and others},
  journal={arXiv preprint arXiv:2507.06448},
  year={2025}
}