README.md
June 12, 2026 · View on GitHub
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
CVPR 2026
This is the official implementation of GroundVTS, a Vid-LLM architecture that performs query-guided visual token sampling for video temporal grounding. GroundVTS introduces a Visual Token Sampling (VTS) module that dynamically selects the most informative visual tokens conditioned on the textual query, enabling fine-grained and efficient temporal grounding.

News
- [2026/06] Model checkpoints are available on Hugging Face and Model Scope.
- [2026/06] Code released.
- [2026/04] Paper available on arXiv.
- [2026/02] GroundVTS accepted at CVPR 2026.
- [2025/11] The Grounding-FT dataset is available on Hugging Face and Model Scope.
Overview
GroundVTS addresses the limitation of uniform frame sampling in existing Vid-LLMs by introducing a query-guided visual token sampling mechanism. Key features:
- Visual Token Sampling (VTS) module: Computes token-query similarity scores and performs weighted differentiable top-K sampling to retain the most informative tokens.
- Progressive optimization strategy: A three-stage training pipeline (VTS Warm-up → Joint LoRA Adaptation → Grounding Fine-tuning) that enables stable integration of VTS into existing Vid-LLMs.
- Architecture-agnostic: Applicable to different Vid-LLM backbones (demonstrated on Qwen2.5-VL and InternVL3.5).
Benchmarks
Moment Retrieval
| Method | Charades-STA | ActivityNet-Captions | ||||||
|---|---|---|---|---|---|---|---|---|
| R1@.3 | R1@.5 | R1@.7 | mIoU | R1@.3 | R1@.5 | R1@.7 | mIoU | |
| Qwen2.5VL-7B | 34.2 | 18.8 | 8.6 | 22.1 | 25.3 | 11.5 | 4.4 | 17.1 |
| GroundVTS-Q | 71.5 | 57.5 | 34.2 | 50.1 | 51.3 | 33.6 | 21.4 | 36.0 |
| InternVL3.5-8B | 35.5 | 25.7 | 13.2 | 24.6 | 22.1 | 12.0 | 5.6 | 15.8 |
| GroundVTS-I | 61.2 | 44.2 | 23.7 | 41.6 | 37.9 | 22.4 | 10.3 | 25.7 |
Highlight Detection (QVHighlights)
| Method | MR R1@.5 | MR R1@.7 | HD mAP | HD Hit@1 |
|---|---|---|---|---|
| GroundVTS-Q | 23.6 | 12.3 | 35.7 | 58.8 |
| GroundVTS-I | 63.6 | 40.7 | 52.5 | 88.4 |
Datasets
Training Data
- Stage 1 & 2: LLaVA-Video-178K — large-scale video dataset for multimodal pretraining.
- Stage 3: Grounding-FT — curated from Charades-STA, QVHighlights, and ActivityNet-Captions training splits (70K annotated video-query pairs).
Evaluation Benchmarks
| Benchmark | Task | Split |
|---|---|---|
| Charades-STA | Moment Retrieval | test |
| ActivityNet-Captions | Moment Retrieval | test |
| QVHighlights | MR + Highlight Detection | val |
| NExT-GQA | Grounded Video QA | test |
Models
| Model | Base Model | VTS Hidden Dim | Token Ratio |
|---|---|---|---|
| GroundVTS-Q | Qwen2.5-VL-7B-Instruct | 512 | 0.5 |
| GroundVTS-I | InternVL3.5-8B | 128 | 0.5 |
Installation
We recommend setting up a conda environment for the project.
For Qwen2.5-VL based model (GroundVTS-Q):
conda env create -f requirements/environment_qwen.yml
conda activate VTS_qwen
For InternVL3.5 based model (GroundVTS-I):
conda env create -f requirements/environment_intern.yml
conda activate VTS_intern
Alternatively, install from requirements files:
pip install -r requirements/requirements_qwen.txt # for GroundVTS-Q
pip install -r requirements/requirements_intern.txt # for GroundVTS-I
Usage
Data Preparation
- Download training data: Prepare LLaVA-Video-178K and the VTG benchmark training splits.
- Generate Grounding-FT dataset: Convert raw annotations to the LLaMA-Factory format:
python train/FT_data/data_generation/charades_to_LF.py
python train/FT_data/data_generation/qvhighlights_to_LF.py
python train/FT_data/data_generation/qvhighlights_to_LF_HD.py
python train/FT_data/data_generation/activitynetcap_to_LF.py
Update the paths inside each script before running.
Training
GroundVTS follows a three-stage progressive optimization strategy:
| Stage | Description | Config (Qwen) | Config (InternVL) |
|---|---|---|---|
| 1 | VTS Warm-up | qwen_stage1_vts_warmup.yaml | intern_stage1_vts_warmup.yaml |
| 2 | Joint LoRA Adaptation | qwen_stage2_joint_lora.yaml | intern_stage2_joint_lora.yaml |
| 3 | Grounding Fine-tuning | qwen_stage3_grounding_ft.yaml | intern_stage3_grounding_ft.yaml |
Update paths in the YAML configs (see placeholders), then run:
# Stage 1: VTS Warm-up
torchrun --nproc_per_node 8 train/src/train.py train/config/train/qwen_stage1_vts_warmup.yaml
# Stage 2: Joint LoRA Adaptation
torchrun --nproc_per_node 8 train/src/train.py train/config/train/qwen_stage2_joint_lora.yaml
# Stage 3: Grounding Fine-tuning
torchrun --nproc_per_node 8 train/src/train.py train/config/train/qwen_stage3_grounding_ft.yaml
Inference
Run inference on evaluation benchmarks. Predictions are written to the directory
given by --pred_path (file name is derived from the dataset/split/fps/frames).
Use --model_type qwen_qts for GroundVTS-Q or --model_type intern_qts for GroundVTS-I.
# General video temporal grounding benchmarks
python -m eval.infer_auto \
--model_type qwen_qts \
--dataset charades_sta \
--base_model_path <path/to/model> \
--pred_path <path/to/output_dir>
# QVHighlights (moment retrieval + highlight detection)
python -m eval.infer_qvhighlights \
--model_type qwen_qts \
--base_model_path <path/to/model> \
--pred_path <path/to/output_dir>
Evaluation
Evaluate saved predictions. Here --pred_path is the directory that holds the
prediction file, and --pred_name is the file name (without extension) produced by
the inference step.
# Moment retrieval evaluation
python -m eval.eval_auto \
--pred_path <path/to/output_dir> \
--pred_name output_charades_sta_test_1.0_8
# QVHighlights evaluation
python -m eval.eval_qvhighlights \
--pred_path <path/to/output_dir> \
--pred_name output_qvhighlights_valid_2.0_8 \
--anno_path <path/to/qvhighlights_val.jsonl>
LoRA Merging
Merge LoRA adapters into the base model for deployment:
python -m eval.merge_lora \
--base <path/to/base_model> \
--lora <path/to/lora_adapter> \
--out <path/to/merged_model>
Project Structure
GroundVTS/
├── models/ # Model architectures
│ ├── module/
│ │ └── vts_module.py # Visual Token Sampling (VTS) module
│ ├── vts_qwen2_5_vl/ # GroundVTS-Q (Qwen2.5-VL based)
│ ├── vts_internvl_3/ # GroundVTS-I (InternVL3.5 based)
│ ├── qwen2_5_vl/ # Base Qwen2.5-VL builder
│ └── internvl3_5/ # Base InternVL3.5 builder
├── train/ # Training pipeline
│ ├── config/
│ │ ├── deepspeed/ # DeepSpeed configs
│ │ └── train/ # Training stage configs
│ ├── FT_data/data_generation/ # Dataset conversion scripts
│ └── src/ # LLaMA-Factory based training
├── eval/ # Evaluation pipeline
│ ├── dataset/ # Benchmark dataset loaders
│ ├── utils/ # Evaluation utilities
│ ├── infer_auto.py # Multi-benchmark inference
│ ├── eval_auto.py # Multi-benchmark evaluation
│ ├── infer_qvhighlights.py # QVHighlights inference
│ └── eval_qvhighlights.py # QVHighlights evaluation
└── requirements/ # Environment configs
Citation
If you find this work useful, please cite our paper:
@inproceedings{fan2026groundvts,
title={GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding},
author={Fan, Rong and Xiao, Kaiyan and Zhu, Minghao and Wang, Liuyi and Dai, Kai and Yang, Zhao},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10408--10418},
year={2026}
}
License
This project is released under the Apache 2.0 License.
Acknowledgements
This project builds upon several excellent open-source projects:
- LLaMA-Factory for the training framework
- Qwen2.5-VL and InternVL as base models
- VideoMind for evaluation scripts