FlexSelect: Flexible Token Selection for Efficient Long Video Understanding (NeurIPS 2025)

September 19, 2025 · View on GitHub

Created by Yunzhu Zhang*, Yu Lu*, Tianyi Wang, Fengyun Rao, Yi Yang, Linchao Zhu

The official repository for paper "FlexSelect: Flexible Token Selection for Efficient Long Video Understanding". Webpage ｜ Paper | Huggingface

News

[2025-09-18]: 🎉 Our paper FlexSelect has been accepted by NeurIPS 2025!

[2025-6-01]: Source Code uploaded.

[2025-5-20]: Code repository created.

We present FlexSelect, a flexible and efficient token selection method that leverages cross-modal attention scores in VideoLLMs to identify query-relevant visual tokens. Our approach combines: (1) training-free attention-based token ranking, and (2) a lightweight selector for fast filtering.

Performance

We conduct experiments on three video LLMs (LLaVA-video, Qwen2.5VL, InternVL2.5) under for benchmarks: LongVideoBench, VideoMME, LVbench, MLVU.

Model	Size	VideoMME (Long)	VideoMME (Overall)	MLVU (M-Avg)	LongVB (Val)	LVBench (Test)
Proprietary Models
GPT-4o	-	65.3	71.9	64.6	66.7	34.7
Gemini-1.5-Pro	-	67.4	75.0	-	64.0	33.1
Open-Source VideoLLMs
mPLUG-Owl3	7B	50.1	59.3	63.7	52.1	43.5
Qwen2-VL	7B	53.8	63.3	66.9	55.6	42.4
NVILA	8B	54.8	64.2	70.1	57.7	-
VideoLLaMA3	7B	-	66.2	73.0	59.8	45.3
Aria	8×3.5B	58.8	67.6	70.6	65.3	-
Oryx-1.5	34B	59.3	67.3	72.3	62.0	30.8
Video-XL-Pro	3B	-	60.0	70.6	56.7	-
SF-LLaVA-1.5	7B	-	63.9	71.5	62.5	45.3
TPO	7B	55.4	65.6	71.1	60.1	-
Quato	7B	55.7	65.9	71.9	59.0	-
ViLAMP	7B	57.8	67.5	72.6	61.2	45.2
LLaVA-Video	7B	52.9	64.4	68.6	58.2	43.1
+ FlexSelect	7B	59.8 (↑6.9)	68.9 (↑4.5)	73.2 (↑4.6)	61.9 (↑3.7)	52.9 (↑9.8)
+ FlexSelect-Lite	7B	58.3 (↑5.4)	68.3 (↑3.9)	71.8 (↑3.2)	60.7 (↑2.5)	52.2 (↑9.1)
InternVL2.5	8B	52.8	64.2	68.9	59.5	43.4
+ FlexSelect	8B	58.1 (↑5.3)	67.0 (↑2.8)	71.9 (↑3.0)	60.1 (↑0.6)	49.7 (↑6.3)
+ FlexSelect-Lite	8B	57.9 (↑5.1)	67.2 (↑3.0)	71.9 (↑3.0)	61.2 (↑1.7)	49.9 (↑6.5)
Qwen2.5-VL	7B	55.6	65.4	70.2	59.5	45.3
+ FlexSelect	7B	59.3 (↑3.7)	68.2 (↑2.8)	72.5 (↑2.3)	62.4 (↑2.9)	51.2 (↑5.9)
+ FlexSelect-Lite	7B	58.6 (↑3.0)	67.4 (↑2.0)	70.3 (↑0.1)	61.9 (↑2.4)	50.0 (↑4.7)
LLaVA-Video	72B	61.9	70.0	71.2	62.4	45.5
+ FlexSelect	72B	66.1 (↑4.2)	73.1 (↑3.1)	76.0 (↑4.8)	66.9 (↑4.5)	55.5 (↑10.0)
Qwen2.5 VL	72B	63.9	73.4	76.3	66.2	47.3
+ FlexSelect	72B	66.9 (↑3.0)	74.4 (↑1.0)	76.6 (↑0.3)	66.4 (↑0.2)	56.6 (↑9.3)

Benchmark Data Preparation

All four used benchmarks can be downloaded from huggingface website: LongVideoBench, VideoMME, MLVU, and LVBench.

Prepare Data For VideoMME

Download the videos.

huggingface-cli download --repo-type dataset --resume-download lmms-lab/Video-MME --local-dir lmms-lab/Video-MME --local-dir-use-symlinks False

Unzip the videos

cd lmms-lab/Video-MME
unzip 'videos_chunked_*.zip' -d videos/

Move the data to eval directory

ln -s lmms-lab/Video-MME/videos flexselect/eval/data/videomme/data
ln -s lmms-lab/Video-MME/videomme/test-00000-of-00001.parquet flexselect/eval/data/videomme/test-00000-of-00001.parquet

Prepare Data For MLVU

Download the videos.

huggingface-cli download --repo-type dataset --resume-download sy1998/MLVU_dev --local-dir sy1998/MLVU_dev --local-dir-use-symlinks False

Unzip the videos

cd sy1998/MLVU_dev
unzip 'video_part_*.zip' -d videos/

Move the data to eval directory

ln -s sy1998/MLVU_dev/videos flexselect/eval/data/mlvu_test/data
ln -s sy1998/MLVU_dev/mlvu/test-00000-of-00001.parquet flexselect/eval/data/mlvu_test/test-00000-of-00001.parquet

Prepare Data For LVbench

Download the videos and files. Follow instructions here for downloading videos: LVBench The flexselect/eval/data/lvbench/test.jsonl is the test file that we have compiled and conforms to the lmms-eval supported format.
Move or Link the videos dir under flexselect/eval/data/lvbench
We reorangnize the test files to support lmms eval evaluation. You can download it from here and move or link it under data/lvbench/ dir.

Prepare Data For LongVideoBench

Download the videos.

huggingface-cli download --repo-type dataset --resume-download longvideobench/LongVideoBench --local-dir longvideobench/LongVideoBench --local-dir-use-symlinks False

Untar the videos

cd longvideobench/LongVideoBench 
cat videos.tar.part.* > videos.tar
tar -xvf videos_merged.tar -C videos

Move the data to eval directory

ln -s longvideobench/LongVideoBench/videos flexselect/eval/data/longvideobench/data
ln -s longvideobench/LongVideoBench/test-00000-of-00001.parquet  flexselect/eval/data/longvideobench/test-00000-of-00001.parquet

Pretrained Model

The pretrained model can be found in their respective repositories: LLaVA-Video-7B, LLaVA-Video-72B, InternVL2.5-8B, Qwen2.5VL-7B and Qwen2.5VL-72B.

Evaluation

FlexSelect works in two modes: training-free mode and lightweight mode. We evaluate them using LMMS-Eval. We follow the environment installation guideline of LMMS-EVAL. You can setup a environment by running:

sh setup.sh

You should download the token selector weights into flexselect/eval/models from huggingface:

huggingface-cli download --resume-download yunzhuyunzhu/flexselect_llava_video --local-dir flexselect/eval/models/flexselect_llava_video
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_qwen2.5vl --local-dir flexselect/eval/models/flexselect_qwen2.5vl
huggingface-cli download --resume-download yunzhuyunzhu/flexselect_internvl2.5 --local-dir flexselect/eval/models/flexselect_internvl2.5

Then you can reproduce our results:

cd flexselect/eval
sh scripts/eval_llavavideo.sh
sh scripts/eval_internvl2_5.sh
sh scripts/eval_qwenvl2_5.sh

Here are explanations of variants in our eval scripts:

Parameter	Type	Options / Notes	Default
`use_token_selector`	`boolean`	- `true`: Enable FlexSelect token selection - `false`: Disable (standard eval)	`false`
`token_selector_path`	`string`	- `"self"`: Training-free mode - `"path/to/token selector model"`: Lightweight mode	`"self"`
`token_selector_layer`	`integer`	reference layer number（only effective in Training-free mode）	`-1`
`drop_func_name`	`string`	ways to get semantic relevance score - `"token_selection"`: average on head and text dimension - `"token_selection_argmax"`: argmax on head and text dimension	`"token_selection"`
`tkn_budget`	`integer`	max selected tokens	`6720`

Here are explanations of some commandline choice:

1. Model Selection (`--model`)

Specify the evaluation model with the following options:

Value	Model Evaluated
`llava_vid`	LLaVA-Video-7B
`internvl2`	InternVL2.5
`qwen2_5_vl`	Qwen2.5VL

2. Task Selection (`--tasks`)

Value	Task Name	Notes
`videomme`	Video-MME	Standard video evaluation
`mlvu_dev`	MLVU	Multi-language video understanding
`lvbench`	LVBench	Short-video benchmark
`longvideobench_val_v`	LongVideoBench	Default variant (e.g., for LLaVA)
`longvideobench_val_v_sub`	LongVideoBench	InternVL series only (uses caption)

token selector training

FlexSelect trains 0.5B token selector for LLaVA-Video-7B, Qwen2.5VL-7B and InternVL2.5-8B.

We follow the environment installation guideline of corresponding project to construct training environment:

LLaVA-Video: https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file#2-install-the-inference-package
Qwen2.5VL: https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-finetune/README.md
InternVL2.5: https://internvl.readthedocs.io/en/latest/internvl2.5/finetune.html

You should first download the dataset LLaVA-Video-178K. We randomly select 5% data from it for training. You can download the our sampled QA pairs from token_selector_train_data and put the data under the right directory.

# Train LLaVA-Video selector
mv output_rnd05.yaml flexselect/train/LLaVA-Video
cd flexselect/train/LLaVA-Video && sh scripts/train_selector.sh

# Train Qwen2.5-VL selector
mv llava_video_178k.json flexselect/train/Qwen2.5-VL/
cd flexselect/train/Qwen2.5-VL/qwen-vl-finetune && sh scripts/sft_7b.sh

# Train InternVL selector 
mv llava_video_178k.jsonl flexselect/train/InternVL/internvl_chat
cd flexselect/train/InternVL/internvl_chat && sh shell/internvl2.5/2nd_finetune/internvl2_5_8b_dynamic_res_2nd_finetune_full.sh

Visualization Result

We identify the reference layer in VideoLLM where cross-modal attention scores best reflect the text-visual semantical relevance. If you are interested in the visualization results in the paper, you can run the following command to reproduce them after setup conda virtual environment.

Recall@K result

Download needle data from needle data and unzip it.

python3 flexselect/visualization/LLaVA-NeXT/needle_llava.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}
python3 flexselect/visualization/QwenVL/needle_qwen2_5vl.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet} 
python3 flexselect/visualization/InternVL2/needle_internvl2_5.py --needle_data {path to needle data path} --data_root {path to videomme video dir} --data {path to videomme test.parquet}

Attention Scores Heatmap

python3 flexselect/visualization/LLaVA-NeXT/draw_heatmap_llava.py 
python3 flexselect/visualization/InternVL2/draw_heatmap_internvl.py

PCA Visualization

python3 flexselect/visualization/LLaVA-NeXT/pca_llava.py

@misc{zhang2025flexselectflexibletokenselection,
      title={FlexSelect: Flexible Token Selection for Efficient Long Video Understanding}, 
      author={Yunzhu Zhang and Yu Lu and Tianyi Wang and Fengyun Rao and Yi Yang and Linchao Zhu},
      year={2025},
      eprint={2506.00993},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.00993}, 
}