Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
November 20, 2025 · View on GitHub
This repository contains the official implementation of the paper Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
Overview
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on finallayer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full finetuning is often impractical, parameter-efficient fine-tuning–and particularly side-tuning (ST)– has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention–a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods

Data preparation
Download the pre-extracted features from here and modify the link of the docker container initialization from below.
Installation
Docker Setup
To set up the environment using Docker, follow these steps:
-
Build the Docker Image:
docker build -t sdst_image:latest . -
Run the Docker Container:
docker run --gpus 'all' -it --rm --shm-size 200gb -v ./:/SDST -v ./model_results:/SDST/model_results -v <path_to_data>:/data sdst_imageModify the data path for the path that you saved the data into (see the Data Preparation below)
Installing additional dependencies
This part refers to the installation of additional dependencies like RoiAlign. See the original repository for more details.
cd models/ops; python setup.py build_ext --inplace; cd ../..
Training from scratch
To train the model from scratch, run the following command, where CONFIG_PATH with the path to your desired experiment configuration file:
python tools/launch.py -c ./configs/CONFIG_PATH --exp_name <experiment_name>
Concretely, to train on QVHighlights:
python tools/launch.py ./configs/qvhighlights/sdst_qvhighlights.py --exp_name debug
To train on Charades-STA:
python tools/launch.py ./configs/charades/sdst_charades.py --exp_name debug
Or to train on TACOS:
python tools/launch.py ./configs/tacos/sdst_tacos.py --exp_name debug
Evaluate
To evaluate the performance of a given model, run the following command:
python tools/launch.py <path-to-config> --checkpoint <path-to-checkpoint> --eval
For QVHighlights:
python tools/launch.py configs/qvhighlights/sdst_qvhighlights.py --checkpoint /SDST/checkpoints_sdst/checkpoint_qvhighlights.pth --eval
For Charades-STA:
python tools/launch.py configs/charades/sdst_charades.py --checkpoint /SDST/checkpoints_sdst/checkpoint_charades_sta.pth --eval
For Tacos:
python tools/launch.py configs/tacos/sdst_tacos.py --checkpoint /SDST/checkpoints_sdst/checkpoint_tacos.pth --eval
Generate a submission
To generate a submission give a trained model, run the following command:
python tools/launch.py <path-to-config> --checkpoint <path-to-checkpoint> --dump
For instance, to do so for QVHighlgihts:
python tools/launch.py configs/qvhighlights/sdst_qvhighlights.py --checkpoint /SDST/checkpoints_sdst/checkpoint_qvhighlights.pth --dump
Contact
For any questions or inquiries, please contact david dot pujolperich at gmail dot com
Acknowledgments:
This implementation is based on the excellent work of R2-Tuning.
Citation
If you find this work useful, please cite our paper:
@inproceedings{pujol2025sparse,
title={Sparse-dense side-tuner for efficient video temporal grounding},
author={Pujol-Perich, David and Escalera, Sergio and Clap{\'e}s, Albert},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={21515--21524},
year={2025}
}