Introducing Visual Perception Token into Multimodal Large Language Model

March 6, 2025 ยท View on GitHub

TL;DR (1) - Allow the MLLM to control its own vision perception process.

TL;DR (2) - Treat visual perception as a function calling process and control the perception process through Visual Perception Tokens. The MLLM outputs Visual Perception Tokens in the same manner as natural language tokens.

[Paper] [Models] [Datasets]

Graphical Abstract

. ๐Ÿ”ง The image illustrates two types of Visual Perception Tokens. The Region Selection Token carries explicit semantic information, representing important regions with bounding boxes, while the Vision Re-Encoding Token lacks semantic information. Instead, the Vision Projector extracts control information directly from its hidden state. The inference process incorporating Visual Perception Tokens can be divided into three stages. First, the MLLM generates Visual Perception Tokens based on the image and the given question. Next, the Vision Branches perform a second perception of the image guided by the Visual Perception Tokens. Finally, the MLLM utilizes the Vision Features obtained from both perception stages to answer the question.

. ๐Ÿ‘ Performance comparison of MLLMs with and without Visual Perception Tokens. Datasets marked with ``*'' are not used in the training process. A 2B model with Visual Perception Tokens can even outperform the 7B model without Visual Perception Tokens.

. ๐Ÿ‘ Examples collected from the testing sets. The responses were generated by the 7B model and the 2B+VPT model.

Table of Contents

  1. Inventory
  2. Environment Setup
  3. Training and Evaluation

Inventory

Models

ModelStageBase ModelFinetuned Modules#Total ParamsDownload Link
Qwen2-VL-2b-VPT-Det-AlignmentAfter AlignmentQwen/Qwen2-VL-2B-InstructProjector2.75B๐Ÿค— HuggingFace Model
Qwen2-VL-2b-VPT-Seg-AlignmentAfter AlignmentQwen/Qwen2-VL-2B-InstructProjector2.76B๐Ÿค— HuggingFace Model
Qwen2-VL-2b-VPT-CLIPAfter Instruction TuningQwen/Qwen2-VL-2B-InstructAll2.45B๐Ÿค— HuggingFace Model
Qwen2-VL-2b-VPT-DetAfter Instruction TuningQwen2-VL-2b-VPT-Det-AlignmentAll2.75B๐Ÿค— HuggingFace Model
Qwen2-VL-2b-VPT-Det-NoPromptAfter Instruction TuningQwen2-VL-2b-VPT-Det-AlignmentAll2.75B๐Ÿค— HuggingFace Model
Qwen2-VL-2b-VPT-SegAfter Instruction TuningQwen2-VL-2b-VPT-Seg-AlignmentAll2.76B๐Ÿค— HuggingFace Model
Qwen2-VL-7b-VPT-CLIPAfter Instruction TuningQwen/Qwen2-VL-7B-InstructLoRA-r5128.32B๐Ÿค— HuggingFace Model
  • Our model can be categorized into two types based on the training stage. In cases where an additional vision encoder is required, an extra alignment step is performed. The aligned model then serves as the starting point for the subsequent instruction tuning process.
  • The 7B model is fine-tuned using LoRA. The released model has already been merged.
  • The NoPrompt model corresponds to the Free Choice model. It does not rely on specific prompts to trigger the Visual Perception Token; instead, the model autonomously decides whether to use the Visual Perception Token and which type to apply.

Datasets

Our training and evaluation data are hosted on ๐Ÿค— VPT Datasets. The entire dataset is divided into multiple splits, with each split corresponding to a separate JSON file. For the evaluation datasets, we have annotated the reasoning process involving the use of the Visual Perception Token. During evaluation, adjustments need to be made based on different models.

DatasetStageCompatible ModelTraning/Evaluation#Samples
MixVRT_CLIP_FullInstruction TuningCLIPTraining829k
MixVRT_Detection_FullInstruction TuningDetTraining829k
MixVRT_Seg_FullInstruction TuningSegTraining829k
CUB_Birds_action_testInstruction TuningAllEvaluation0.5k
DocVQA_region_testInstruction TuningAllEvaluation0.9k
DUDE_region_testInstruction TuningAllEvaluation0.6k
Flickr30k_action_testInstruction TuningAllEvaluation1.5k
LLaVA_COCO_free_action_testInstruction TuningAllEvaluation1k
LLaVA_COCO_single_action_testInstruction TuningAllEvaluation1k
OI_region_testInstruction TuningAllEvaluation1k
POPE_action_testInstruction TuningAllEvaluation3k
TextCap_region_testInstruction TuningAllEvaluation8.5k
TextVQA_region_testInstruction TuningAllEvaluation0.5k
VSR_region_testInstruction TuningAllEvaluation0.4k
llava_alignment_detection_qwen_response_trainAlignmentDetTraining585k
llava_alignment_seg_qwen_response_trainAlignmentSegTraining585k
llava_alignment_detection_qwen_response_evalAlignmentDetEvaluation5k
llava_alignment_seg_qwen_response_evalAlignmentSegEvaluation5k

Code Development Note

Our code is primarily based on Transformers and Llama-Factory. DevNote outlines the core modifications we made to the original transformers and Llama-Factory libraries. It serves as an overview to help readers understand our code structure and provides a starting point for exploration.

Environment Setup

Prepare Enviroment

The process of setting up our environment primarily involves installing modified versions of Transformers 4.45.2 and LLaMA-Factory 0.9.1.dev0. The following are the steps to create our environment.

# clone transformers 4.45.2
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout v4.45.2
# copy modeling_qwen2_vl_vpt.py
mkdir -p "src/transformers/models/qwen2_vl_vpt"
cp "VPT/transformers/src/transformers/models/qwen2_vl_vpt/modeling_qwen2_vl_vpt.py" "src/transformers/models/qwen2_vl_vpt/"
# modify the path of transformers and llama-factory in env.yml file
# create environment
cd /to/this/folder
conda env create -f env.yml

Prepare Dataset

Prepare the following datasets and adjust the image path in the json files downloaded from ๐Ÿค— VPT Datasets.

  • COCO2017
  • CUB_200_2011
  • DocVQA
  • DUDE
  • Flickr30k
  • GQA
  • OCRVQA
  • OpenImage
  • TextVQA
  • VG
  • VSR

Training and Evaluation

Our training and evaluation are supported by LLaMA-Factory. Adjust the path in the configs/*.yaml files before run the code.

cd LLaMA-Factory
# Detection Projector Alignment
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Det-Alignment.yaml
# Segmentation Projector Alignment
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Seg-Alignment.yaml
# 2b Model + VPT-original vision encoder
llamafactory-cli train configs/Qwen2-VL-2b-VPT-CLIP.yaml
# 2b Model + VPT-DINO
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Det.yaml
# 2b Model + VPT-SAM
llamafactory-cli train configs/Qwen2-VL-2b-VPT-Seg.yaml
# 7b Model + VPT-original vision encoder
llamafactory-cli train configs/Qwen2-VL-7b-VPT-CLIP.yaml

The folder evaluation contains the code for evaluation. Run python evaluation.py to finish the evaluation process.

Citation

If you find our work useful, please cite using this BibTeX:

@misc{yu2025vpt,
      title={Introducing Visual Perception Token into Multimodal Large Language Model}, 
      author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
      year={2025},
      eprint={2502.17425},
      archivePrefix={arXiv},
}