FlexSelect: Flexible Token Selection for Efficient Long Video Understanding (NeurIPS 2025)
September 19, 2025 Β· View on GitHub
Created by Yunzhu Zhang*, Yu Lu*, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu
The official repository for paper "FlexSelect: Flexible Token Selection for Efficient Long Video Understanding".
Webpage ο½ Paper | Huggingface
News
[2025-09-18]: π Our paper FlexSelect has been accepted by NeurIPS 2025!
[2025-6-01]: Source Code uploaded.
[2025-5-20]: Code repository created.
Introduction

We present FlexSelect, a flexible and efficient token selection method that leverages cross-modal attention scores in VideoLLMs to identify query-relevant visual tokens. Our approach combines: (1) training-free attention-based token ranking, and (2) a lightweight selector for fast filtering.
Performance
We conduct experiments on three video LLMs (LLaVA-video, Qwen2.5VL, InternVL2.5) under for benchmarks: LongVideoBench, VideoMME, LVbench, MLVU.
| Model | Size | VideoMME (Long) | VideoMME (Overall) | MLVU (M-Avg) | LongVB (Val) | LVBench (Test) |
|---|---|---|---|---|---|---|
| Proprietary Models | ||||||
| GPT-4o | - | 65.3 | 71.9 | 64.6 | 66.7 | 34.7 |
| Gemini-1.5-Pro | - | 67.4 | 75.0 | - | 64.0 | 33.1 |
| Open-Source VideoLLMs | ||||||
| mPLUG-Owl3 | 7B | 50.1 | 59.3 | 63.7 | 52.1 | 43.5 |
| Qwen2-VL | 7B | 53.8 | 63.3 | 66.9 | 55.6 | 42.4 |
| NVILA | 8B | 54.8 | 64.2 | 70.1 | 57.7 | - |
| VideoLLaMA3 | 7B | - | 66.2 | 73.0 | 59.8 | 45.3 |
| Aria | 8Γ3.5B | 58.8 | 67.6 | 70.6 | 65.3 | - |
| Oryx-1.5 | 34B | 59.3 | 67.3 | 72.3 | 62.0 | 30.8 |
| Video-XL-Pro | 3B | - | 60.0 | 70.6 | 56.7 | - |
| SF-LLaVA-1.5 | 7B | - | 63.9 | 71.5 | 62.5 | 45.3 |
| TPO | 7B | 55.4 | 65.6 | 71.1 | 60.1 | - |
| Quato | 7B | 55.7 | 65.9 | 71.9 | 59.0 | - |
| ViLAMP | 7B | 57.8 | 67.5 | 72.6 | 61.2 | 45.2 |
| LLaVA-Video | 7B | 52.9 | 64.4 | 68.6 | 58.2 | 43.1 |
| + FlexSelect | 7B | 59.8 (β6.9) | 68.9 (β4.5) | 73.2 (β4.6) | 61.9 (β3.7) | 52.9 (β9.8) |
| + FlexSelect-Lite | 7B | 58.3 (β5.4) | 68.3 (β3.9) | 71.8 (β3.2) | 60.7 (β2.5) | 52.2 (β9.1) |
| InternVL2.5 | 8B | 52.8 | 64.2 | 68.9 | 59.5 | 43.4 |
| + FlexSelect | 8B | 58.1 (β5.3) | 67.0 (β2.8) | 71.9 (β3.0) | 60.1 (β0.6) | 49.7 (β6.3) |
| + FlexSelect-Lite | 8B | 57.9 (β5.1) | 67.2 (β3.0) | 71.9 (β3.0) | 61.2 (β1.7) | 49.9 (β6.5) |
| Qwen2.5-VL | 7B | 55.6 | 65.4 | 70.2 | 59.5 | 45.3 |
| + FlexSelect | 7B | 59.3 (β3.7) | 68.2 (β2.8) | 72.5 (β2.3) | 62.4 (β2.9) | 51.2 (β5.9) |
| + FlexSelect-Lite | 7B | 58.6 (β3.0) | 67.4 (β2.0) | 70.3 (β0.1) | 61.9 (β2.4) | 50.0 (β4.7) |
| LLaVA-Video | 72B | 61.9 | 70.0 | 71.2 | 62.4 | 45.5 |
| + FlexSelect | 72B | 66.1 (β4.2) | 73.1 (β3.1) | 76.0 (β4.8) | 66.9 (β4.5) | 55.5 (β10.0) |
| Qwen2.5 VL | 72B | 63.9 | 73.4 | 76.3 | 66.2 | 47.3 |
| + FlexSelect | 72B | 66.9 (β3.0) | 74.4 (β1.0) | 76.6 (β0.3) | 66.4 (β0.2) | 56.6 (β9.3) |
Benchmark Data Preparation
All four used benchmarks can be downloaded from huggingface website: LongVideoBench, VideoMME, MLVU, and LVBench.
Prepare Data For VideoMME
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download lmms-lab/Video-MME --local-dir lmms-lab/Video-MME --local-dir-use-symlinks False
- Unzip the videos
cd lmms-lab/Video-MME
unzip 'videos_chunked_*.zip' -d videos/
- Move the data to eval directory
ln -s lmms-lab/Video-MME/videos flexselect/eval/data/videomme/data
ln -s lmms-lab/Video-MME/videomme/test-00000-of-00001.parquet flexselect/eval/data/videomme/test-00000-of-00001.parquet
Prepare Data For MLVU
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download sy1998/MLVU_dev --local-dir sy1998/MLVU_dev --local-dir-use-symlinks False
- Unzip the videos
cd sy1998/MLVU_dev
unzip 'video_part_*.zip' -d videos/
- Move the data to eval directory
ln -s sy1998/MLVU_dev/videos flexselect/eval/data/mlvu_test/data
ln -s sy1998/MLVU_dev/mlvu/test-00000-of-00001.parquet flexselect/eval/data/mlvu_test/test-00000-of-00001.parquet
Prepare Data For LVbench
-
Download the videos and files. Follow instructions here for downloading videos:
LVBenchThe flexselect/eval/data/lvbench/test.jsonl is the test file that we have compiled and conforms to the lmms-eval supported format. -
Move or Link the videos dir under flexselect/eval/data/lvbench
-
We reorangnize the test files to support lmms eval evaluation. You can download it from
hereand move or link it underdata/lvbench/dir.
Prepare Data For LongVideoBench
- Download the videos.
huggingface-cli download --repo-type dataset --resume-download longvideobench/LongVideoBench --local-dir longvideobench/LongVideoBench --local-dir-use-symlinks False
- Untar the videos
cd longvideobench/LongVideoBench
cat videos.tar.part.* > videos.tar
tar -xvf videos_merged.tar -C videos
- Move the data to eval directory
ln -s longvideobench/LongVideoBench/videos flexselect/eval/data/longvideobench/data
ln -s longvideobench/LongVideoBench/test-00000-of-00001.parquet flexselect/eval/data/longvideobench/test-00000-of-00001.parquet
Pretrained Model
The pretrained model can be found in their respective repositories: LLaVA-Video-7B, LLaVA-Video-72B, InternVL2.5-8B, Qwen2.5VL-7B and Qwen2.5VL-72B.
Evaluation
FlexSelect works in two modes: training-free mode and lightweight mode. We evaluate them using LMMS-Eval. We follow the environment installation guideline of LMMS-EVAL. You can setup a environment by running:
sh setup.sh
You should download the token selector weights into flexselect/eval/models from huggingface:
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_llava_video --local-dir flexselect/eval/models/flexselect_llava_video
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_qwen2.5vl --local-dir flexselect/eval/models/flexselect_qwen2.5vl
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_internvl2.5 --local-dir flexselect/eval/models/flexselect_internvl2.5
Then you can reproduce our results:
cd flexselect/eval
sh scripts/eval_llavavideo.sh
sh scripts/eval_internvl2_5.sh
sh scripts/eval_qwenvl2_5.sh
Here are explanations of variants in our eval scripts:
| Parameter | Type | Options / Notes | Default |
|---|---|---|---|
use_token_selector | boolean | - true: Enable FlexSelect token selection- false: Disable (standard eval) | false |
token_selector_path | string | - "self": Training-free mode- "path/to/token selector model": Lightweight mode | "self" |
token_selector_layer | integer | reference layer numberοΌonly effective in Training-free modeοΌ | -1 |
drop_func_name | string | ways to get semantic relevance score - "token_selection": average on head and text dimension- "token_selection_argmax": argmax on head and text dimension | "token_selection" |
tkn_budget | integer | max selected tokens | 6720 |
Here are explanations of some commandline choice:
1. Model Selection (--model)
Specify the evaluation model with the following options:
| Value | Model Evaluated |
|---|---|
llava_vid | LLaVA-Video-7B |
internvl2 | InternVL2.5 |
qwen2_5_vl | Qwen2.5VL |
2. Task Selection (--tasks)
| Value | Task Name | Notes |
|---|---|---|
videomme | Video-MME | Standard video evaluation |
mlvu_dev | MLVU | Multi-language video understanding |
lvbench | LVBench | Short-video benchmark |
longvideobench_val_v | LongVideoBench | Default variant (e.g., for LLaVA) |
longvideobench_val_v_sub | LongVideoBench | InternVL series only (uses caption) |
token selector training
FlexSelect trains 0.5B token selector for LLaVA-Video-7B, Qwen2.5VL-7B and InternVL2.5-8B.
We follow the environment installation guideline of corresponding project to construct training environment:
- LLaVA-Video: https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file#2-install-the-inference-package
- Qwen2.5VL: https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-finetune/README.md
- InternVL2.5: https://internvl.readthedocs.io/en/latest/internvl2.5/finetune.html
You should first download the dataset LLaVA-Video-178K.
We randomly select 5% data from it for training.
You can download the our sampled QA pairs from token_selector_train_data and put the data under the right directory.
# Train LLaVA-Video selector
mv output_rnd05.yaml flexselect/train/LLaVA-Video
cd flexselect/train/LLaVA-Video && sh scripts/train_selector.sh
# Train Qwen2.5-VL selector
mv llava_video_178k.json flexselect/train/Qwen2.5-VL/
cd flexselect/train/Qwen2.5-VL/qwen-vl-finetune && sh scripts/sft_7b.sh
# Train InternVL selector
mv llava_video_178k.jsonl flexselect/train/InternVL/internvl_chat
cd flexselect/train/InternVL/internvl_chat && sh shell/internvl2.5/2nd_finetune/internvl2_5_8b_dynamic_res_2nd_finetune_full.sh
Visualization Result
We identify the reference layer in VideoLLM where cross-modal attention scores best reflect the text-visual semantical relevance. If you are interested in the visualization results in the paper, you can run the following command to reproduce them after setup conda virtual environment.
Recall@K result
Download needle data from needle data and unzip it.
python3 flexselect/visualization/LLaVA-NeXT/needle_llava.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}
python3 flexselect/visualization/QwenVL/needle_qwen2_5vl.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}
python3 flexselect/visualization/InternVL2/needle_internvl2_5.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}
Attention Scores Heatmap
python3 flexselect/visualization/LLaVA-NeXT/draw_heatmap_llava.py
python3 flexselect/visualization/InternVL2/draw_heatmap_internvl.py
PCA Visualization
python3 flexselect/visualization/LLaVA-NeXT/pca_llava.py
Acknowledgement
This repository is built upon LMMS-EVAL, LLaVA-Video, InternVL2.5, and Qwen2.5VL. Thanks for those well-organized codebases.
License
FlexSelect is released under the CC BY-NC-SA 4.0 license.
Citation
@misc{zhang2025flexselectflexibletokenselection,
title={FlexSelect: Flexible Token Selection for Efficient Long Video Understanding},
author={Yunzhu Zhang and Yu Lu and Tianyi Wang and Fengyun Rao and Yi Yang and Linchao Zhu},
year={2025},
eprint={2506.00993},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.00993},
}