FlexSelect: Flexible Token Selection for Efficient Long Video Understanding (NeurIPS 2025)

September 19, 2025 Β· View on GitHub

Created by Yunzhu Zhang*, Yu Lu*, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu

The official repository for paper "FlexSelect: Flexible Token Selection for Efficient Long Video Understanding". Webpage | Paper | Huggingface

News

[2025-09-18]: πŸŽ‰ Our paper FlexSelect has been accepted by NeurIPS 2025!

[2025-6-01]: Source Code uploaded.

[2025-5-20]: Code repository created.

Introduction

Framework

We present FlexSelect, a flexible and efficient token selection method that leverages cross-modal attention scores in VideoLLMs to identify query-relevant visual tokens. Our approach combines: (1) training-free attention-based token ranking, and (2) a lightweight selector for fast filtering.

Performance

We conduct experiments on three video LLMs (LLaVA-video, Qwen2.5VL, InternVL2.5) under for benchmarks: LongVideoBench, VideoMME, LVbench, MLVU.

ModelSizeVideoMME (Long)VideoMME (Overall)MLVU (M-Avg)LongVB (Val)LVBench (Test)
Proprietary Models
GPT-4o-65.371.964.666.734.7
Gemini-1.5-Pro-67.475.0-64.033.1
Open-Source VideoLLMs
mPLUG-Owl37B50.159.363.752.143.5
Qwen2-VL7B53.863.366.955.642.4
NVILA8B54.864.270.157.7-
VideoLLaMA37B-66.273.059.845.3
Aria8Γ—3.5B58.867.670.665.3-
Oryx-1.534B59.367.372.362.030.8
Video-XL-Pro3B-60.070.656.7-
SF-LLaVA-1.57B-63.971.562.545.3
TPO7B55.465.671.160.1-
Quato7B55.765.971.959.0-
ViLAMP7B57.867.572.661.245.2
LLaVA-Video7B52.964.468.658.243.1
+ FlexSelect7B59.8 (↑6.9)68.9 (↑4.5)73.2 (↑4.6)61.9 (↑3.7)52.9 (↑9.8)
+ FlexSelect-Lite7B58.3 (↑5.4)68.3 (↑3.9)71.8 (↑3.2)60.7 (↑2.5)52.2 (↑9.1)
InternVL2.58B52.864.268.959.543.4
+ FlexSelect8B58.1 (↑5.3)67.0 (↑2.8)71.9 (↑3.0)60.1 (↑0.6)49.7 (↑6.3)
+ FlexSelect-Lite8B57.9 (↑5.1)67.2 (↑3.0)71.9 (↑3.0)61.2 (↑1.7)49.9 (↑6.5)
Qwen2.5-VL7B55.665.470.259.545.3
+ FlexSelect7B59.3 (↑3.7)68.2 (↑2.8)72.5 (↑2.3)62.4 (↑2.9)51.2 (↑5.9)
+ FlexSelect-Lite7B58.6 (↑3.0)67.4 (↑2.0)70.3 (↑0.1)61.9 (↑2.4)50.0 (↑4.7)
LLaVA-Video72B61.970.071.262.445.5
+ FlexSelect72B66.1 (↑4.2)73.1 (↑3.1)76.0 (↑4.8)66.9 (↑4.5)55.5 (↑10.0)
Qwen2.5 VL72B63.973.476.366.247.3
+ FlexSelect72B66.9 (↑3.0)74.4 (↑1.0)76.6 (↑0.3)66.4 (↑0.2)56.6 (↑9.3)

Benchmark Data Preparation

All four used benchmarks can be downloaded from huggingface website: LongVideoBench, VideoMME, MLVU, and LVBench.

Prepare Data For VideoMME

  1. Download the videos.
huggingface-cli download --repo-type dataset --resume-download lmms-lab/Video-MME --local-dir lmms-lab/Video-MME --local-dir-use-symlinks False
  1. Unzip the videos
cd lmms-lab/Video-MME
unzip 'videos_chunked_*.zip' -d videos/
  1. Move the data to eval directory
ln -s lmms-lab/Video-MME/videos flexselect/eval/data/videomme/data
ln -s lmms-lab/Video-MME/videomme/test-00000-of-00001.parquet flexselect/eval/data/videomme/test-00000-of-00001.parquet 

Prepare Data For MLVU

  1. Download the videos.
huggingface-cli download --repo-type dataset --resume-download sy1998/MLVU_dev --local-dir sy1998/MLVU_dev --local-dir-use-symlinks False

  1. Unzip the videos
cd sy1998/MLVU_dev
unzip 'video_part_*.zip' -d videos/
  1. Move the data to eval directory
ln -s sy1998/MLVU_dev/videos flexselect/eval/data/mlvu_test/data
ln -s sy1998/MLVU_dev/mlvu/test-00000-of-00001.parquet flexselect/eval/data/mlvu_test/test-00000-of-00001.parquet 

Prepare Data For LVbench

  1. Download the videos and files. Follow instructions here for downloading videos: LVBench The flexselect/eval/data/lvbench/test.jsonl is the test file that we have compiled and conforms to the lmms-eval supported format.

  2. Move or Link the videos dir under flexselect/eval/data/lvbench

  3. We reorangnize the test files to support lmms eval evaluation. You can download it from here and move or link it under data/lvbench/ dir.

Prepare Data For LongVideoBench

  1. Download the videos.
huggingface-cli download --repo-type dataset --resume-download longvideobench/LongVideoBench --local-dir longvideobench/LongVideoBench --local-dir-use-symlinks False
  1. Untar the videos
cd longvideobench/LongVideoBench 
cat videos.tar.part.* > videos.tar
tar -xvf videos_merged.tar -C videos
  1. Move the data to eval directory
ln -s longvideobench/LongVideoBench/videos flexselect/eval/data/longvideobench/data
ln -s longvideobench/LongVideoBench/test-00000-of-00001.parquet  flexselect/eval/data/longvideobench/test-00000-of-00001.parquet

Pretrained Model

The pretrained model can be found in their respective repositories: LLaVA-Video-7B, LLaVA-Video-72B, InternVL2.5-8B, Qwen2.5VL-7B and Qwen2.5VL-72B.

Evaluation

FlexSelect works in two modes: training-free mode and lightweight mode. We evaluate them using LMMS-Eval. We follow the environment installation guideline of LMMS-EVAL. You can setup a environment by running:

sh setup.sh

You should download the token selector weights into flexselect/eval/models from huggingface:

huggingface-cli download --resume-download yunzhuyunzhu/flexselect_llava_video --local-dir flexselect/eval/models/flexselect_llava_video
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_qwen2.5vl --local-dir flexselect/eval/models/flexselect_qwen2.5vl
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_internvl2.5 --local-dir flexselect/eval/models/flexselect_internvl2.5

Then you can reproduce our results:

cd flexselect/eval
sh scripts/eval_llavavideo.sh
sh scripts/eval_internvl2_5.sh
sh scripts/eval_qwenvl2_5.sh

Here are explanations of variants in our eval scripts:

ParameterTypeOptions / NotesDefault
use_token_selectorboolean- true: Enable FlexSelect token selection
- false: Disable (standard eval)
false
token_selector_pathstring- "self": Training-free mode
- "path/to/token selector model": Lightweight mode
"self"
token_selector_layerintegerreference layer number(only effective in Training-free modeοΌ‰-1
drop_func_namestringways to get semantic relevance score
- "token_selection": average on head and text dimension
- "token_selection_argmax": argmax on head and text dimension
"token_selection"
tkn_budgetintegermax selected tokens6720

Here are explanations of some commandline choice:

1. Model Selection (--model)

Specify the evaluation model with the following options:

ValueModel Evaluated
llava_vidLLaVA-Video-7B
internvl2InternVL2.5
qwen2_5_vlQwen2.5VL

2. Task Selection (--tasks)

ValueTask NameNotes
videommeVideo-MMEStandard video evaluation
mlvu_devMLVUMulti-language video understanding
lvbenchLVBenchShort-video benchmark
longvideobench_val_vLongVideoBenchDefault variant (e.g., for LLaVA)
longvideobench_val_v_subLongVideoBenchInternVL series only (uses caption)

token selector training

FlexSelect trains 0.5B token selector for LLaVA-Video-7B, Qwen2.5VL-7B and InternVL2.5-8B.

We follow the environment installation guideline of corresponding project to construct training environment:

You should first download the dataset LLaVA-Video-178K. We randomly select 5% data from it for training. You can download the our sampled QA pairs from token_selector_train_data and put the data under the right directory.

# Train LLaVA-Video selector
mv output_rnd05.yaml flexselect/train/LLaVA-Video
cd flexselect/train/LLaVA-Video && sh scripts/train_selector.sh

# Train Qwen2.5-VL selector
mv llava_video_178k.json flexselect/train/Qwen2.5-VL/
cd flexselect/train/Qwen2.5-VL/qwen-vl-finetune && sh scripts/sft_7b.sh

# Train InternVL selector 
mv llava_video_178k.jsonl flexselect/train/InternVL/internvl_chat
cd flexselect/train/InternVL/internvl_chat && sh shell/internvl2.5/2nd_finetune/internvl2_5_8b_dynamic_res_2nd_finetune_full.sh

Visualization Result

We identify the reference layer in VideoLLM where cross-modal attention scores best reflect the text-visual semantical relevance. If you are interested in the visualization results in the paper, you can run the following command to reproduce them after setup conda virtual environment.

Recall@K result

Download needle data from needle data and unzip it.

python3 flexselect/visualization/LLaVA-NeXT/needle_llava.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}
python3 flexselect/visualization/QwenVL/needle_qwen2_5vl.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet} 
python3 flexselect/visualization/InternVL2/needle_internvl2_5.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}

Attention Scores Heatmap

python3 flexselect/visualization/LLaVA-NeXT/draw_heatmap_llava.py 
python3 flexselect/visualization/InternVL2/draw_heatmap_internvl.py

PCA Visualization

python3 flexselect/visualization/LLaVA-NeXT/pca_llava.py

Acknowledgement

This repository is built upon LMMS-EVAL, LLaVA-Video, InternVL2.5, and Qwen2.5VL. Thanks for those well-organized codebases.

License

FlexSelect is released under the CC BY-NC-SA 4.0 license.

Citation

@misc{zhang2025flexselectflexibletokenselection,
      title={FlexSelect: Flexible Token Selection for Efficient Long Video Understanding}, 
      author={Yunzhu Zhang and Yu Lu and Tianyi Wang and Fengyun Rao and Yi Yang and Linchao Zhu},
      year={2025},
      eprint={2506.00993},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.00993}, 
}