V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

April 4, 2026 ยท View on GitHub

The official implementation of the paper "V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding".

drawing

[๐Ÿ†• Blog] [๐Ÿ“œ Paper] [๐Ÿค— HF Models] [๐Ÿ“– HF Datasets]

๐Ÿ“– Summary

The main contributions of this work are as follows:

  • We construct mixed datasets for VLMs' long-context training and evaluation by augmenting existing multimodal instruction tuning datasets and conduct a thorough investigation into why current VLMs struggle with long-context multimodal inputs, revealing that directly applying LLM positional encoding to visual tokens is ineffective.
  • We propose Variable Visual Position Encoding (V2PE), a novel positional encoding strategy that employs variable and smaller increments for visual tokens, significantly enhancing VLMs' ability to understand and reason over long multimodal contexts.
  • We apply our V2PE method and extend training data on the open-source VLM, InternVL2-2B. The fine-tuned VLM performs exceptionally well on both general multimodal benchmarks and long-context multimodal tasks, with the capacity to handle sequences of up to 1M tokens.

๐Ÿ› ๏ธ Installation

See INSTALLATION.md

In addition, using this codebase requires executing the following steps:

  • Install other requirements:

    pip install --upgrade pip  # enable PEP 660 support
    pip install -e .
    

๐Ÿ“ฆ Model Preparation

Our models are built from InternVL2-2B. Please download the above model weights and place them in the pretrained/ folder.

model nametypedownloadsize
InternVL2-2BVLM๐Ÿค— HF link4.4 GB
cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B

The directory structure is:

pretrained
โ””โ”€โ”€ InternVL2-2B/

๐Ÿ”ฅ Supervised Fine-tuning

Prepare Training Datasets

  1. Download training and validation dataset from HuggingFace

  2. Organize the data as follows in dataset/:

`dataset/` Directory Structure
dataset
โ”œโ”€โ”€ annotation
โ”‚ย ย  โ”œโ”€โ”€ long_mr_128k/
โ”‚ย ย  โ”œโ”€โ”€ long_mr_256k/
โ”‚ย ย  โ”œโ”€โ”€ long_mr_32k/
โ”‚ย ย  โ”œโ”€โ”€ long_vqa_32k/
โ”‚ย ย  โ”œโ”€โ”€ milebench_16k/
โ”‚ย ย  โ””โ”€โ”€ milebench_nh/
โ”œโ”€โ”€ image
โ”‚ย ย  โ”œโ”€โ”€ long_mr
โ”‚ย ย  โ”‚   โ”œโ”€โ”€ train/
โ”‚ย ย  โ”‚   โ””โ”€โ”€ val/
โ”‚ย ย  โ”œโ”€โ”€ long_vqa
โ”‚ย ย  โ”‚   โ”œโ”€โ”€ image
โ”‚   โ”‚   โ”‚ย ย  โ”œโ”€โ”€ deepform
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚   โ”‚ย ย  โ”œโ”€โ”€ docvqa
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚   โ”‚ย ย  โ”œโ”€โ”€ infovqa
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚   โ”‚ย ย  โ”œโ”€โ”€ kleistercharity
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚   โ”‚ย ย  โ”œโ”€โ”€ svqa
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚   โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ visualmrc
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย   ย ย  โ”œโ”€โ”€ train/
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย   ย ย  โ””โ”€โ”€ val/
โ”‚ย ย  โ”‚   โ””โ”€โ”€ paste
โ”‚   โ”‚       โ”œโ”€โ”€ chartqa
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ clevr
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ dvqa
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ gqa
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ ocrvqa
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ okvqa
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ tabfact
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ textcaps
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ textvqa
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ”œโ”€โ”€ vizwiz
โ”‚   โ”‚       โ”‚ย ย  โ”œโ”€โ”€ train/
โ”‚   โ”‚       โ”‚ย ย  โ””โ”€โ”€ val/
โ”‚   โ”‚       โ””โ”€โ”€ wikitablequestions
โ”‚   โ”‚           โ”œโ”€โ”€ train/
โ”‚   โ”‚           โ””โ”€โ”€ val/
โ”‚   โ””โ”€โ”€ milebench
โ”‚       โ”œโ”€โ”€ clevr
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ gpr
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ iedit
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ mmcoqa
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ mmqa
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ nh
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ objintercn
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ ocrvqa
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ percept
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ slidevqa
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ spotdiff
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ sta_charades
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ star
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ”œโ”€โ”€ tqa
โ”‚       โ”‚ย ย  โ””โ”€โ”€ train/
โ”‚       โ””โ”€โ”€ webqa
โ”‚           โ””โ”€โ”€ train/
โ””โ”€โ”€ val
    โ”œโ”€โ”€ long_mr_128k/
    โ”œโ”€โ”€ long_mr_1m/
    โ”œโ”€โ”€ long_mr_256k/
    โ”œโ”€โ”€ long_mr_512k/
    โ”œโ”€โ”€ long_vqa_32k/
    โ”œโ”€โ”€ long_vqa_40k/
    โ”œโ”€โ”€ long_vqa_48k/
    โ”œโ”€โ”€ long_vqa_56k/
    โ””โ”€โ”€ long_vqa_64k/

Start Training

We provide slurm scripts for multi-node multi-GPU training. You can use 32 GPUs to train this model, and it will take approximately 48 hours.

# using 32 GPUs
PARTITION='your partition' GPUS=32 sh shell/internlm2_2b/internvl_chat_v2_internlm2_2b_dynamic_res_v2pe_32k.sh

Training using ring-attention

When training on 256k length or longer dataset, you may need using ring attention to limit GPU memory usage. To use ring attention, you need to set two variables in the training script:

  --chunk_num 8 \
  --attn_type 'ring' \

Here, chunk_num specifies the number of chunks each sample is split into, which are distributed across chunk_num GPUs. The use_chunkTrainer flag indicates that ring attention is used during training.

We provide an example training script that utilizes ring attention at: shell/internlm2_2b/internvl_chat_v2_internlm2_2b_dynamic_res_v2pe_256k.sh. You can run this script with the following command:

# using 32 GPUs
PARTITION='your partition' GPUS=32 sh shell/internlm2_2b/internvl_chat_v2_internlm2_2b_dynamic_res_v2pe_256k.sh

๐Ÿ“Š Evaluation

Evaluation results in paper

General MLLM Benchmarks

img.png

Long-Context MLLM Benchmarks

img.png

Evaluation results of our released model

After organizing our codebase and training a released model, we renewed our evaluation results of the released model as follows:

General MLLM Benchmarks

Model#ParamChartQADocVQAAI2DInfoVQASQAPOPEMMMUvalMMBenchENSEEDIAvg
InternVL2-2B2.0B71.786.974.158.994.185.236.373.470.972.4
DeepSeek-VL-1.3B2.0B47.4-51.5-68.485.933.866.466.0-
Qwen2-VL-2B2.0B73.590.174.765.5--41.174.9--
Aquila-VL-2B2.2B32.085.075.158.395.183.146.979.073.969.8
MiniCPM-V-22.8B55.671.962.9-80.786.338.264.167.1-
Vintern-3B-beta3.7B68.3-69.1-75.087.446.770.670.0-
Llama 3.2 11B11B83.488.491.1---50.768.0--
Qwen2-VL-72B73B88.396.588.184.591.287.264.586.977.985.0
GPT-4o-85.792.884.7-90.197.269.182.176.7-
InternVL2-V2PE-32K2.0B76.483.973.255.994.988.836.673.571.272.5

Long-Context MLLM Benchmarks

Model#ParamMM-NIAH/ImageMM-NIAH/TextMM-NIAH/AvgMilebench/TMilebench/SMilebench/NIMilebench/AvgVideoMMEMVBench
InternVL2-2B2.0B23.018.921.058.254.537.049.9--
Phi-3-Vision2.7B---46.950.0----
OmChat3.9B---51.452.0--45.950.2
LongLLaVA9B---47.346.8--43.749.1
LongLLaVA13B---52.752.1--51.654.6
VILA13B14.540.527.5------
Gemini-1.5-28.582.155.250.258.397.968.869.6-
GPT-4V--84.1-45.658.999.468.059.943.5
GPT-4o----56.263.5--64.7-
Claude3-Opus----37.448.185.356.959.7-
InternVL2-V2PE-32K2.0B78.185.781.865.556.497.272.550.765.6

โ“ How to Evaluate

Preparing General MLLM Benchmarks

ChartQA test-human & test-augmented

Data Preparation
mkdir -p data/chartqa && cd data/chartqa

# download images from https://drive.google.com/file/d/1Lm_w6zeET1Hyl_9ks6w5nEsgpoyPHalV/view

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/train_augmented.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_human.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/chartqa/test_augmented.jsonl

cd ../..

DocVQA val & test

Data Preparation
mkdir -p data/docvqa && cd data/docvqa

# download images and annotations
wget https://datasets.cvc.uab.es/rrc/DocVQA/train.tar.gz --no-check-certificate # (optional)
wget https://datasets.cvc.uab.es/rrc/DocVQA/val.tar.gz --no-check-certificate
wget https://datasets.cvc.uab.es/rrc/DocVQA/test.tar.gz --no-check-certificate

# unzip files
tar -zxvf train.tar.gz
tar -zxvf val.tar.gz
tar -zxvf test.tar.gz

# download converted jsonl files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/train.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/val.jsonl
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/docvqa/test.jsonl
cd ../..

AI2D test

Data Preparation
mkdir -p data/ai2diagram && cd data/ai2diagram
# download converted files
wget https://huggingface.co/OpenGVLab/InternVL/raw/main/ai2d_test_vlmevalkit.jsonl -O test_vlmevalkit.jsonl
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/AI2D_TEST.zip && unzip AI2D_TEST.zip

# download images from Google drive (optional, provided by InternLM-XComposer)
# https://drive.google.com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing
# images should be placed in `data/ai2diagram/ai2d/abc_images` and `data/ai2diagram/ai2d/images`
cd ../..

InfoVQA

Data Preparation

Please refer to https://rrc.cvc.uab.es/?ch=17 for details

ScienceQA test

Data Preparation
mkdir -p data/scienceqa/images && cd data/scienceqa/images

# download images
wget https://scienceqa.s3.us-west-1.amazonaws.com/images/test.zip && unzip test.zip

cd ..

# download original questions
wget https://github.com/lupantech/ScienceQA/blob/main/data/scienceqa/problems.json

# download converted files
wget https://ofasys-wlcb.oss-cn-wulanchabu.aliyuncs.com/Qwen-VL/evaluation/scienceqa/scienceqa_test_img.jsonl

cd ../..

POPE

Data Preparation
mkdir -p data/pope && cd data/pope

# make sure you have downloaded COCO images
ln -s ../coco/val2014 ./
wget https://github.com/OpenGVLab/InternVL/releases/download/data/llava_pope_test.jsonl

# download `coco` from POPE
mkdir -p coco && cd coco
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_adversarial.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_popular.json
wget https://github.com/AoiDragon/POPE/raw/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco/coco_pope_random.json
cd ../../..

MMMU

Data Preparation

The evaluation code will automatically download the dataset from huggingface.

MMBench dev & test

Data Preparation
mkdir -p data/mmbench && cd data/mmbench

# download csv files of mmbench
wget http://opencompass.openxlab.space/utils/MMBench/CCBench_legacy.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_en_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_cn_20231003.tsv
wget https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_test_en_20231003.tsv

cd ../..

SEED

Data Preparation
mkdir -p data/SEED && cd data/SEED
# 1. Follow the official instructions [Data Preparation for SEED-Bench-1](https://github.com/AILab-CVC/SEED-Bench/blob/main/DATASET.md#data-preparation-for-seed-bench-1)
#    to download the images and the videos. Put images under `./data/SEED/SEED-Bench-image`.
# 2. Extract the video frame in the middle from the downloaded videos, and put them under `./data/SEED/SEED-Bench-image`.
#    LLaVA provided the script [`extract_video_frames.py`](../internvl_chat/tools/extract_video_frames.py) modified from the official one.

wget https://huggingface.co/OpenGVLab/InternVL/raw/main/seed.jsonl
cd ../..

Preparing Long-Context MLLM Benchmarks

MM-NIAH

Data Preparation
  1. Download MM-NIAH dataset from HuggingFace and put the files in dataset/benchmark/MM-NIAH folder.

  2. Unzip images using the following command

    tar -xzvf dataset/benchmark/MM-NIAH/mm_niah_test/images.tar.gz -C dataset/benchmark/MM-NIAH/mm_niah_test/
    tar -xzvf dataset/benchmark/MM-NIAH/mm_niah_val/annotations.tar.gz -C dataset/benchmark/MM-NIAH/mm_niah_val/
    
  3. The directory structure should look like this:

    dataset
    โ””โ”€โ”€ benchmark
        โ””โ”€โ”€ MM-NIAH
            โ”œโ”€โ”€ mm_niah_test
            โ”‚   โ”œโ”€โ”€ annotations/
            โ”‚   โ””โ”€โ”€ images/
            โ””โ”€โ”€ mm_niah_val/
                โ”œโ”€โ”€ annotations/
                โ””โ”€โ”€ images/
    

Milebench

Data Preparation
  1. Download milebench dataset from hugging face

  2. Unzip them using the following command

    for file in MileBench_part*.tar.gz
    do
    tar -xzvf "$file"
    done
    
  3. Put the unzipped files in dataset/benchmark/MileBench folder. The directory structure should look like this:

    dataset
    โ””โ”€โ”€ benchmark
        โ””โ”€โ”€ MileBench
            โ”œโ”€โ”€ ActionLocalization
            โ”‚   โ”œโ”€โ”€ images/
            โ”‚   โ””โ”€โ”€ ActionLocalization.json
            โ”œโ”€โ”€ ActionPrediction
            โ”‚   โ”œโ”€โ”€ images/
            โ”‚   โ””โ”€โ”€ ActionPrediction.json 
            |โ”€โ”€ ActionSequence
            โ”‚   ...
    

Evaluation Steps

Evaluating General MLLM Benchmarks

Evaluation

For all general MLLM benchmarks, you can only run this one scripts to get all results.

# use STRIDE=64 as an example
STRIDE=64 sh scripts/evaluate_auto.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64

Evaluating Long-Context MLLM Benchmarks

Evaluation for milebench
# use STRIDE=64 as an example
STRIDE=64 sh scripts/evaluate_milebench.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64
Evaluation for mm_niah
# use STRIDE=64 as an example
STRIDE=64 sh scripts/evaluate_mmniah.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64
Evaluation for mm_niah-1M
# use STRIDE=64 as an example
STRIDE=64 sh scripts/evaluate_mmniah_long.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64
Evaluation for long-vqa
# use STRIDE=64 as an example
STRIDE=64 GROUP=32 GPUS_PER_TASK=1 sh scripts/evaluate_longvqa.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64
STRIDE=64 GROUP=40 GPUS_PER_TASK=2 sh scripts/evaluate_longvqa.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64
STRIDE=64 GROUP=48 GPUS_PER_TASK=2 sh scripts/evaluate_longvqa.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64
STRIDE=64 GROUP=56 GPUS_PER_TASK=4 sh scripts/evaluate_longvqa.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64
STRIDE=64 GROUP=64 GPUS_PER_TASK=4 sh scripts/evaluate_longvqa.sh <checkpoint> --rope_pos_id_version v2pe_fix --rope_pos_id_stride 64

๐ŸŽซ License

This project is released under the MIT License.

๐Ÿ–Š๏ธ Citation

If you find this work helpful in your research, please consider citing:

@misc{ge2024v2peimprovingmultimodallongcontext,
      title={V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding}, 
      author={Junqi Ge and Ziyi Chen and Jintao Lin and Jinguo Zhu and Xihui Liu and Jifeng Dai and Xizhou Zhu},
      year={2024},
      eprint={2412.09616},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.09616}, 
}