Spotlight on Token Perception for Multimodal Reinforcement Learning

January 30, 2026 Β· View on GitHub

Spotlight on Token Perception for Multimodal Reinforcement Learning

arXiv GitHub Hugging Face Hugging Face Hugging Face

This repository contains the official implementation of our work, which introduces a new paradigm for training large vision-language models. We argue that the key to unlocking advanced multimodal reasoning lies in moving beyond uniform, coarse learning signals and instead putting a spotlight on token-level visual perception.

Our proposed algorithm, Visually-Perceptive Policy Optimization (VPPO), is the first to directly implement this principle. It intelligently focuses policy updates on the critical moments of visually-grounded reasoning, leading to state-of-the-art performance, superior training efficiency, and a more robust learning process.


πŸš€ News

  • [2026-01-30] πŸŽ‰ Our paper has been accepted to ICLR 2026!
  • [2025-11-11] The training script for VPPO-8B is now available! We've updated our codebase to fully support training based on Qwen3-VL-8B-Instruct. You can find the script in examples/configs/train_vppo_8b.sh.
  • [2025-11-07] We have released VPPO-8B, a new model that achieves excellent performance. It was trained using our VPPO algorithm, starting from the Qwen3-VL-8B-Instruct model. Compared to our previous training runs, we increased the max response length and the Entropy Penalty Coefficient. You can find more details and access the model on our Hugging Face Models collection.

🎯 The Core Insight: Not All Tokens Are Created Equal

Standard reinforcement learning methods for LVLMs suffer from a fundamental flaw: they treat every token in a generated response as equally important. A single reward is broadcast indiscriminately, rewarding generic phrases just as much as the critical step where the model perceives a key detail from the image.

Our analysis reveals two key truths about multimodal reasoning:

  1. Token visual dependency is sparse: Only a small fraction of tokens in a reasoning chain are highly dependent on the visual input. These are the pivotal moments of visually-grounded reasoning.
  2. Trajectory visual dependency is heterogeneous: Not all correct solutions are equal. Some are robustly grounded in visual evidence, while others are "lucky guesses" based on linguistic priors.

Token Dependency Distribution Β  Trajectory Dependency Distribution

Left: Most tokens have low visual dependency. Right: Trajectories show a wide range of visual dependency. Standard RL treats them all the same.

This misalignment causes signal dilution, slowing down learning and preventing models from developing genuine multimodal perception and reasoning skills.

✨ Our Solution: Visually-Perceptive Policy Optimization (VPPO)

VPPO is a novel policy gradient algorithm designed to solve this problem by reshaping the learning signal at two levels of granularity:

  1. Macro-Level 🎯 Trajectory Advantage Shaping (TAS): We re-weight the advantage of each trajectory based on its average visual dependency. This prioritizes learning from robust, perception-grounded reasoning paths.

  2. Micro-Level πŸ”¦ Token Gradient Filtering (TGF): We construct a sparse gradient mask to focus policy updates exclusively on the top-k% most visually-dependent tokens. This puts a "spotlight" on what truly matters, reducing gradient variance and leading to more stable and effective training.

πŸ† Performance Highlights

By focusing the learning signal, VPPO establishes a new state-of-the-art across 8 challenging multimodal reasoning benchmarks.

State-of-the-Art Accuracy

Faster Convergence & Superior Stability

VPPO not only achieves a higher final performance but gets there faster and more reliably.

πŸš€ Getting Started

  • Python 3.10
  • PyTorch 2.8.0
  • CUDA 12.8

2. Environment Setup

# Create and activate a conda environment
conda create -n vppo python=3.10
conda activate vppo

# Clone the repository
git clone https://github.com/huaixuheqing/VPPO-RL
cd VPPO-RL

# Install dependencies
pip install -e .

3. Training with VPPO

The training pipeline is adapted from EasyR1. We provide example scripts for training 7B and 8B models with VPPO.

  • Hardware for Qwen2.5-VL-7B: 8 x H800 (80G) GPUs.
  • Hardware for Qwen2.5-VL-32B: 32 x H800 (80G) GPUs.
# To train the VPPO-7B model
bash examples/configs/train_vppo_7b.sh

# To train the VPPO-8B model
bash examples/configs/train_vppo_8b.sh

4. Evaluation

Our evaluation leverages the framework from PAPO-Eval. To replicate our results, you will need to download our specific evaluation data from the VPPO-Eval Hugging Face dataset. Please place the data folder from this dataset directly into your local PAPO-Eval repository. Once the data is in place, you can run the evaluation scripts by selecting the desired benchmark name. A complete list of available benchmark names can be found in the data/dataset_info.json file. All results in the paper are reported as average accuracy@8 with an inference temperature of 1.0.

πŸ“š Datasets

Training Data

Evaluation Data

BenchmarkHugging Face LinkFocus Domain
Geo3khiyouga/geometry3kGeometric Reasoning
We-MathWe-Math/We-MathMath Reasoning
MMK12FanqingM/MMK12Math Reasoning
MathVerseAI4Math/MathVerseMath Reasoning
MathVisionMathLLMs/MathVisionMath Reasoning
DynaMathDynaMath/DynaMath_SampleMath Reasoning
LogicVistalscpku/LogicVistaLogical Reasoning
MMMU-ProMMMU/MMMU_ProMulti-discipline

Note: We filter instances from MathVerse, MathVision, and DynaMath to ensure verifiable, exact-match evaluation. All datasets, including these filtered subsets, are publicly available on our Hugging Face.

πŸ“œ Citation

If you find our work on token perception and the VPPO algorithm useful in your research, please cite our paper:

@article{huang2025spotlight,
  title={Spotlight on Token Perception for Multimodal Reinforcement Learning},
  author={Huang, Siyuan and Qu, Xiaoye and Li, Yafu and Luo, Yun and He, Zefeng and Liu, Daizong and Cheng, Yu},
  journal={arXiv preprint arXiv:2510.09285},
  year={2025}
}

Acknowledgements

Our codebase is built upon the excellent work of EasyR1. We are grateful to the original authors for their valuable contributions.