πŸŽ₯ [ICCV2025] DyTo

October 24, 2025 Β· View on GitHub

πŸŽ₯ [ICCV2025] DyTo

A Training-Free Method for Zero-Shot Video Understanding

πŸ“£ News

  • (2025.06.29): ✨Our paper is accepted to ICCV2025❗️
  • (2024.12.15): ✨Code has been released❗️

πŸ“– Overview

DyTo is a Dynamic Token merging framework for zero-shot video understanding that optimizes token efficiency while preserving scene details through hierarchical frame selection and bipartite token merging.

Our paper: Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

πŸš€ Quick Start

Environment

  • CUDA 11.7
  • Python 3.10.12+
  • PyTorch 2.1.0+

Setup Guide

  1. Environment Setup
# Create and activate conda environment
conda create -n dyto python=3.10
conda activate dyto

# Install dependencies
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir

apt-get update
apt-get install git-lfs
git-lfs install
  1. API Configuration
export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
export OPENAI_ORG=$YOUR_OPENAI_ORG  # Optional
  1. Model Download
# Get LLaVA-NeXT weights
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b

πŸ“Š Data Setup

Ground Truth QA Files

The QA files for most datasets can be downloaded from here. For VideMME dataset, please download the QA files from here.

You should prepare the QA files for the datasets you want to use. The expmple of the QA file is in the playground/gt_qa_files/ folder.

python scripts/data/prepare_${DATASET}_qa_file.py --qa_file $PATH_TO_CSV_FILE

Video Datasets

βš™οΈ Configuration

Key parameters in yaml config:

  • SCRIPT: Task selection
  • DATA_DIR & CONV_MODE: Data paths and prompts
  • NUM_FRAMES: Frame sampling count
  • TEMPORAL_AGGREGATION: Dynamic Token Merge pathway settings

πŸ”„ Running the Model

Evaluation

cd DYTO
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE

Demo

python run_demo.py \
    --video_path $PATH_TO_VIDEO \
    --model_path $PATH_TO_YOUR_MODEL \
    --question "Describe this video in details"

πŸ“‚ Output Structure

outputs/
β”œβ”€β”€ artifacts/      # Inference outputs
β”œβ”€β”€ eval_save_dir/  # GPT-3.5-turbo intermediate results
└── logs/          # Evaluation results

πŸ“š Citation

If you are using the data/code/model provided here in a publication, please cite our paper:

@InProceedings{Zhang_2025_ICCV,
    author    = {Zhang, Yiming and Zhao, Zhuokai and Chen, Zhaorun and Ding, Zenghui and Yang, Xianjun and Sun, Yining},
    title     = {Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {22046-22055}
}