README.md

June 12, 2026 · View on GitHub

GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

CVPR 2026

This is the official implementation of GroundVTS, a Vid-LLM architecture that performs query-guided visual token sampling for video temporal grounding. GroundVTS introduces a Visual Token Sampling (VTS) module that dynamically selects the most informative visual tokens conditioned on the textual query, enabling fine-grained and efficient temporal grounding.

Figure 1

News

  • [2026/06] Model checkpoints are available on Hugging Face and Model Scope.
  • [2026/06] Code released.
  • [2026/04] Paper available on arXiv.
  • [2026/02] GroundVTS accepted at CVPR 2026.
  • [2025/11] The Grounding-FT dataset is available on Hugging Face and Model Scope.

Overview

GroundVTS addresses the limitation of uniform frame sampling in existing Vid-LLMs by introducing a query-guided visual token sampling mechanism. Key features:

  • Visual Token Sampling (VTS) module: Computes token-query similarity scores and performs weighted differentiable top-K sampling to retain the most informative tokens.
  • Progressive optimization strategy: A three-stage training pipeline (VTS Warm-up → Joint LoRA Adaptation → Grounding Fine-tuning) that enables stable integration of VTS into existing Vid-LLMs.
  • Architecture-agnostic: Applicable to different Vid-LLM backbones (demonstrated on Qwen2.5-VL and InternVL3.5).

Benchmarks

Moment Retrieval

MethodCharades-STAActivityNet-Captions
R1@.3R1@.5R1@.7mIoUR1@.3R1@.5R1@.7mIoU
Qwen2.5VL-7B34.218.88.622.125.311.54.417.1
GroundVTS-Q71.557.534.250.151.333.621.436.0
InternVL3.5-8B35.525.713.224.622.112.05.615.8
GroundVTS-I61.244.223.741.637.922.410.325.7

Highlight Detection (QVHighlights)

MethodMR R1@.5MR R1@.7HD mAPHD Hit@1
GroundVTS-Q23.612.335.758.8
GroundVTS-I63.640.752.588.4

Datasets

Training Data

  • Stage 1 & 2: LLaVA-Video-178K — large-scale video dataset for multimodal pretraining.
  • Stage 3: Grounding-FT — curated from Charades-STA, QVHighlights, and ActivityNet-Captions training splits (70K annotated video-query pairs).

Evaluation Benchmarks

BenchmarkTaskSplit
Charades-STAMoment Retrievaltest
ActivityNet-CaptionsMoment Retrievaltest
QVHighlightsMR + Highlight Detectionval
NExT-GQAGrounded Video QAtest

Models

ModelBase ModelVTS Hidden DimToken Ratio
GroundVTS-QQwen2.5-VL-7B-Instruct5120.5
GroundVTS-IInternVL3.5-8B1280.5

Installation

We recommend setting up a conda environment for the project.

For Qwen2.5-VL based model (GroundVTS-Q):

conda env create -f requirements/environment_qwen.yml
conda activate VTS_qwen

For InternVL3.5 based model (GroundVTS-I):

conda env create -f requirements/environment_intern.yml
conda activate VTS_intern

Alternatively, install from requirements files:

pip install -r requirements/requirements_qwen.txt   # for GroundVTS-Q
pip install -r requirements/requirements_intern.txt  # for GroundVTS-I

Usage

Data Preparation

  1. Download training data: Prepare LLaVA-Video-178K and the VTG benchmark training splits.
  2. Generate Grounding-FT dataset: Convert raw annotations to the LLaMA-Factory format:
 python train/FT_data/data_generation/charades_to_LF.py
 python train/FT_data/data_generation/qvhighlights_to_LF.py
 python train/FT_data/data_generation/qvhighlights_to_LF_HD.py
 python train/FT_data/data_generation/activitynetcap_to_LF.py

Update the paths inside each script before running.

Training

GroundVTS follows a three-stage progressive optimization strategy:

StageDescriptionConfig (Qwen)Config (InternVL)
1VTS Warm-upqwen_stage1_vts_warmup.yamlintern_stage1_vts_warmup.yaml
2Joint LoRA Adaptationqwen_stage2_joint_lora.yamlintern_stage2_joint_lora.yaml
3Grounding Fine-tuningqwen_stage3_grounding_ft.yamlintern_stage3_grounding_ft.yaml

Update paths in the YAML configs (see placeholders), then run:

# Stage 1: VTS Warm-up
torchrun --nproc_per_node 8 train/src/train.py train/config/train/qwen_stage1_vts_warmup.yaml

# Stage 2: Joint LoRA Adaptation
torchrun --nproc_per_node 8 train/src/train.py train/config/train/qwen_stage2_joint_lora.yaml

# Stage 3: Grounding Fine-tuning
torchrun --nproc_per_node 8 train/src/train.py train/config/train/qwen_stage3_grounding_ft.yaml

Inference

Run inference on evaluation benchmarks. Predictions are written to the directory given by --pred_path (file name is derived from the dataset/split/fps/frames). Use --model_type qwen_qts for GroundVTS-Q or --model_type intern_qts for GroundVTS-I.

# General video temporal grounding benchmarks
python -m eval.infer_auto \
    --model_type qwen_qts \
    --dataset charades_sta \
    --base_model_path <path/to/model> \
    --pred_path <path/to/output_dir>

# QVHighlights (moment retrieval + highlight detection)
python -m eval.infer_qvhighlights \
    --model_type qwen_qts \
    --base_model_path <path/to/model> \
    --pred_path <path/to/output_dir>

Evaluation

Evaluate saved predictions. Here --pred_path is the directory that holds the prediction file, and --pred_name is the file name (without extension) produced by the inference step.

# Moment retrieval evaluation
python -m eval.eval_auto \
    --pred_path <path/to/output_dir> \
    --pred_name output_charades_sta_test_1.0_8

# QVHighlights evaluation
python -m eval.eval_qvhighlights \
    --pred_path <path/to/output_dir> \
    --pred_name output_qvhighlights_valid_2.0_8 \
    --anno_path <path/to/qvhighlights_val.jsonl>

LoRA Merging

Merge LoRA adapters into the base model for deployment:

python -m eval.merge_lora \
    --base <path/to/base_model> \
    --lora <path/to/lora_adapter> \
    --out <path/to/merged_model>

Project Structure

GroundVTS/
├── models/                          # Model architectures
│   ├── module/
│   │   └── vts_module.py            # Visual Token Sampling (VTS) module
│   ├── vts_qwen2_5_vl/             # GroundVTS-Q (Qwen2.5-VL based)
│   ├── vts_internvl_3/             # GroundVTS-I (InternVL3.5 based)
│   ├── qwen2_5_vl/                 # Base Qwen2.5-VL builder
│   └── internvl3_5/                # Base InternVL3.5 builder
├── train/                           # Training pipeline
│   ├── config/
│   │   ├── deepspeed/              # DeepSpeed configs
│   │   └── train/                  # Training stage configs
│   ├── FT_data/data_generation/    # Dataset conversion scripts
│   └── src/                        # LLaMA-Factory based training
├── eval/                            # Evaluation pipeline
│   ├── dataset/                    # Benchmark dataset loaders
│   ├── utils/                      # Evaluation utilities
│   ├── infer_auto.py               # Multi-benchmark inference
│   ├── eval_auto.py                # Multi-benchmark evaluation
│   ├── infer_qvhighlights.py       # QVHighlights inference
│   └── eval_qvhighlights.py        # QVHighlights evaluation
└── requirements/                    # Environment configs

Citation

If you find this work useful, please cite our paper:

@inproceedings{fan2026groundvts,
  title={GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding},
  author={Fan, Rong and Xiao, Kaiyan and Zhu, Minghao and Wang, Liuyi and Dai, Kai and Yang, Zhao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10408--10418},
  year={2026}
}

License

This project is released under the Apache 2.0 License.

Acknowledgements

This project builds upon several excellent open-source projects: