π₯ [ICCV2025] DyTo
October 24, 2025 Β· View on GitHub
π₯ [ICCV2025] DyTo
A Training-Free Method for Zero-Shot Video Understanding
π£ News
- (2025.06.29): β¨Our paper is accepted to ICCV2025βοΈ
- (2024.12.15): β¨Code has been releasedβοΈ
π Overview
DyTo is a Dynamic Token merging framework for zero-shot video understanding that optimizes token efficiency while preserving scene details through hierarchical frame selection and bipartite token merging.
Our paper: Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
π Quick Start
Environment
- CUDA 11.7
- Python 3.10.12+
- PyTorch 2.1.0+
Setup Guide
- Environment Setup
# Create and activate conda environment
conda create -n dyto python=3.10
conda activate dyto
# Install dependencies
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir
apt-get update
apt-get install git-lfs
git-lfs install
- API Configuration
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
export OPENAI_ORG=$YOUR_OPENAI_ORG # Optional
- Model Download
# Get LLaVA-NeXT weights
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b
π Data Setup
Ground Truth QA Files
The QA files for most datasets can be downloaded from here. For VideMME dataset, please download the QA files from here.
You should prepare the QA files for the datasets you want to use. The expmple of the QA file is in the playground/gt_qa_files/ folder.
python scripts/data/prepare_${DATASET}_qa_file.py --qa_file $PATH_TO_CSV_FILE
Video Datasets
- Download directly from dataset providers:
βοΈ Configuration
Key parameters in yaml config:
SCRIPT: Task selectionDATA_DIR&CONV_MODE: Data paths and promptsNUM_FRAMES: Frame sampling countTEMPORAL_AGGREGATION: Dynamic Token Merge pathway settings
π Running the Model
Evaluation
cd DYTO
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE
Demo
python run_demo.py \
--video_path $PATH_TO_VIDEO \
--model_path $PATH_TO_YOUR_MODEL \
--question "Describe this video in details"
π Output Structure
outputs/
βββ artifacts/ # Inference outputs
βββ eval_save_dir/ # GPT-3.5-turbo intermediate results
βββ logs/ # Evaluation results
π Citation
If you are using the data/code/model provided here in a publication, please cite our paper:
@InProceedings{Zhang_2025_ICCV,
author = {Zhang, Yiming and Zhao, Zhuokai and Chen, Zhaorun and Ding, Zenghui and Yang, Xianjun and Sun, Yining},
title = {Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {22046-22055}
}