D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition

March 18, 2026 · View on GitHub

A training-free framework that adapts image-pretrained VLMs to video understanding — achieving SOTA on 7 benchmarks through dynamic compression and question decomposition, with no fine-tuning required.

EMNLP 2025 arXiv Paper Project Page Code License Python PyTorch

Key Results

D-CoDe achieves state-of-the-art performance across 7 video understanding benchmarks — all without any training.

Multiple-Choice VideoQA (↑ higher is better)

MethodNExT-QAEgoSchemaIntentQA
SF-LLaVA64.247.260.1
TS-LLaVA66.550.261.7
D-CoDe68.358.064.2

Open-Ended VideoQA — Accuracy (↑ higher is better)

MethodMSVDMSRVTTTGIFANet
SF-LLaVA79.165.878.755.5
TS-LLaVA79.065.177.756.7
D-CoDe80.064.279.156.4

Highlight: On the challenging long-video benchmark EgoSchema, D-CoDe achieves 58.0% accuracy — a +7.8% improvement over the previous best training-free method (TS-LLaVA 50.2%).

Quick Start

from Dcode import generate_subquestions, supp_frame_selection, token_select_and_merge, load_clip_model

# 1. Question Decomposition (requires OPENAI_API_KEY environment variable)
subquestions = generate_subquestions(
    question="What did the person do after picking up the cup?",
    prompt_variant="original"
)

# 2. Frame Selection (based on semantic diversity)
clip_processor, clip_model = load_clip_model()
selected_frames, frame_idxs = supp_frame_selection(
    video_frames,           # List of PIL Images
    N=15,                   # Number of frames to select
    uniform_ratio=0.85,     # Ratio for uniform sampling
    clip_model=clip_model,
    clip_processor=clip_processor
)

# 3. Token Selection and Merge
merged_features = token_select_and_merge(
    image_features,                  # Tensor (T, N, D)
    top_k=288,                       # Tokens to keep per frame
    merge_strategy="mean",           # Options: "mean", "max", "weighted_mean"
    similarity_threshold=0.8         # Similarity threshold for merging
)

Run Full Evaluation

# Multiple-Choice VideoQA
bash scripts/run_eval_egoschema.sh
bash scripts/run_eval_nextqa.sh
bash scripts/run_eval_intentqa.sh

# Open-Ended VideoQA
bash scripts/run_eval_msvd.sh
bash scripts/run_eval_msrvtt.sh
bash scripts/run_eval_tgif.sh
bash scripts/run_eval_activitynet.sh

Installation

conda create -n d_code python=3.10.12
conda activate d_code
bash setup_env.sh

Set up your OpenAI API key for question decomposition:

export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY

Download pre-trained LLaVA-NeXT weights:

git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b

Data Preparation

Click to expand full data setup instructions

Ground-Truth QA Files

GT question and answer CSV files are already included in playground/gt_qa_files: MSVD-QA, MSRVTT-QA, TGIF-QA, ActivityNet-QA, NExT-QA, EgoSchema, IntentQA.

Download Raw Videos

Expected Directory Structure

playground/data/
├── video_qa/
│   ├── MSVD_Zero_Shot_QA/videos/
│   ├── MSRVTT_Zero_Shot_QA/videos/all/
│   ├── TGIF_Zero_Shot_QA/mp4/
│   └── Activitynet_Zero_Shot_QA/all_test/
└── multiple_choice_qa/
    ├── NExTQA/video/
    ├── EgoSchema/video/
    └── IntentQA/video/

Detailed Results

Module Ablation (EgoSchema)
ModuleAcc. (%)
Baseline44.8
+ Dynamic Spatial Token Compression50.6
+ Dynamic Temporal Frame Selection51.8
+ Question Decomposition58.0
Full Module Ablation (All Benchmarks)
ModuleNExT-QAIntentQAMSVDMSRVTTTGIFANet
Baseline65.461.377.8/4.062.8/3.576.9/4.054.2/3.3
+ Spatial Compression66.762.279.4/4.063.6/3.578.9/4.155.4/3.3
+ Temporal Selection67.062.980.0/4.164.2/3.579.1/4.156.4/3.4
+ Question Decomposition68.364.272.4/3.862.2/3.575.7/4.053.8/3.3
Efficiency Analysis (EgoSchema)
ModuleAcc. (%)s/sample
Baseline44.83.927
+ Dynamic Compression51.86.115
+ Question Decomposition58.037.395

Core Components

The core implementation is in Dcode.py:

FunctionDescriptionPaper Method
generate_subquestions()Decompose questions into sub-questions using GPT-3.5Question Decomposition
supp_frame_selection()Select frames based on CLIP semantic similarityDynamic Compression (Frame)
token_select_and_merge()Select and merge visual tokens to reduce redundancyDynamic Compression (Token)

Acknowledgement

We extend our gratitude to the following projects: LLaVA, IG-VLM, Video-LLaVA, SF-LLaVA and TS-LLaVA.

Citation

If you find this work useful, please cite our paper:

@inproceedings{huang-etal-2025-code,
    title = "{D}-{C}o{D}e: Scaling Image-Pretrained {VLM}s to Video via Dynamic Compression and Question Decomposition",
    author = "Huang, Yiyang  and
      Wang, Yizhou  and
      Fu, Yun",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    pages = "11798--11811",
}

arXiv version:

@article{huang2025d,
    title={D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition},
    author={Huang, Yiyang and Wang, Yizhou and Fu, Yun},
    journal={arXiv preprint arXiv:2510.08818},
    year={2025}
}

License

This project is released under the Apache 2.0 License.