Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

April 26, 2026 · View on GitHub

This is the official PyTorch implementation of the paper "Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration" (Accepted by CVPR 2026 Findings).

🚀 Overview

Multimodal Large Language Models (MLLMs) face a "less is more" paradox in Video-QA: excessive frames can degrade performance due to context dilution, while incurring prohibitive token costs. We identify that even state-of-the-art keyframe selectors produce prompts suffering from significant temporal redundancy — a challenge unique to video that we term 'visual echoes'. Our repository provides a plug-and-play post-processing framework designed to drastically reduce token consumption for any keyframe selectors. It takes initial keyframes selected by any upstream baseline (e.g., VSLS, Uniform Sampling) and optimizes them through:

Adaptive Frame-Pruning (AFP): Consolidates redundant frames ('visual echoes') using multi-modal feature clustering.
Textual Semantic Graph: Compensates for potential visual information loss with a lightweight, LLM-generated semantic context.

AFP Framework

🔥 Highlights: The "Less is More" Phenomenon

Our approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 82.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. Extensively validated on LongVideoBench and Video-MME with multiple state-of-the-art keyframe selectors, our method consistently delivers high accuracy at a low frame count across diverse MLLMs. See the paper for the full quantitative analysis.

🛠️ Installation

# Clone the repository
git clone https://github.com/shaoguangwang/Adaptive-Frame-Pruning.git
cd Adaptive-Frame-Pruning

# Create conda environment
conda create -n afp_videoqa python=3.9 -y
conda activate afp_videoqa

# Install PyTorch via pip (recommended — much faster than conda's solver)
# Adjust the CUDA tag (cu121 / cu118 / cpu) to match your environment.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install remaining requirements
pip install -r requirements.txt

Note on PyTorch installation. We recommend pip over conda install pytorch ... because the classic conda solver can take 30+ minutes (or hang) when resolving the pytorch channel. If you prefer conda, switch to the libmamba solver first:
conda install -n base conda-libmamba-solver -y
conda config --set solver libmamba
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y

📁 Data Preparation

Since our method acts as a universal refinement module, it is dataset-agnostic. You simply need to prepare your data (including the initial keyframes selected by your upstream baseline) in a JSON format. Please refer to example_input.json in our repository for the expected format:

[
  {
    "video_id": "fFjv93ACGo8",
    "video_path": "/path/to/your/video.mp4",
    "question": "When demonstrating the Germany modern Christmas tree is initially decorated with apples, candles and berries, which kind of the decoration has the largest number?",
    "options": "A) Apples.\nB) Candles.\nC) Berries.\nD) The three kinds are of the same number.",
    "answer": "C",
    "keyframe_timestamps":[10, 25, 26, 27, 40, 55, 56, 80]
  }
]

Note: keyframe_timestamps represents the timestamps (in frames or seconds, depending on your extraction logic) of the keyframes initially selected by your baseline.

🏃 Pipeline & Quick Start

Our pipeline consists of three simple steps: Pruning, Graph Generation, and Downstream QA.

Before you start: Open example_input.json and replace the placeholder "video_path": "/path/to/your/video.mp4" with the actual path to a local video file on your machine. The script reads frames directly from this file.

Step 1: Adaptive Frame-Pruning (AFP)

Use cluster.py to prune redundant frames. It extracts fused ResNet50 + CLIP features and performs adaptive clustering.

python scripts/cluster.py \
    --input_json ./example_input.json \
    --output_dir ./data/afp_output \
    --num 32  # Number of initial keyframes to process

Output: Generates a JSON file at ./data/afp_output/example_input_cluster_centroid_32keyframes.json. The output filename follows the pattern {input_basename}_cluster_centroid_{num}keyframes.json, so it changes if you pass a different --num. Each entry gains a new key "frame_cluster" with the pruned timestamps.

Step 2: Semantic Graph Generation

For this step we release only the prompt template rather than a runnable client, so you can plug in any LLM you have access to (OpenAI API, Azure, Claude, local vLLM/Ollama, etc.). The prompt template — together with the expected output JSON schema — is documented in:

📄 prompts/semantic_graph_prompt.md

Recommended workflow:

For each item in the JSON produced by Step 1, fill the prompt's {question} and {options} placeholders with the item's question and options.
Send the prompt to a capable LLM (we used GPT-4o).
Parse the LLM output into the schema described in prompts/semantic_graph_prompt.md and append "grounding_objects" and "graph" to each item.

Output: The final JSON will now include "grounding_objects" and "graph" containing the semantic relations, ready for Step 3.

Step 3: Downstream Video-QA (MLLM Inference)

With the pruned frames and the semantic graph ready, you can feed them into your preferred MLLM (e.g., GPT-4o, Qwen2.5-VL, LLaVA-Video). To reproduce our results, please strictly use the following prompt template to construct your VLM input:

Select the best answer to the following multiple-choice question based on the video.
<image>
<image>
...
[Here is an additional semantic graph context for this Video-QA]            
Objects in video context: <Objects>                                                                 
Relationships between objects: <Relations>

Question: <Question>
Options: <Options>
Answer with the option’s letter from the given choices directly.
Your response format should be strictly an upper case letter A,B,C,D or E.

(Note: Replace <image> placeholders with the actual images extracted using the timestamps from "frame_cluster". Replace <Objects> and <Relations> with the data parsed from the generated "graph" in Step 2).

📑 Citation

If you find our work helpful, please consider citing:

@article{wang2025less,
  title={Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration},
  author={Wang, Shaoguang and Guo, Weiyu and Chen, Ziyang and Xu, Yijie and Hu, Xuming and Xiong, Hui},
  journal={arXiv preprint arXiv:2508.03337},
  year={2025}
}

📜 License

This project is released under the MIT License.

✉️ Contact

For any questions, please contact Shaoguang Wang at shaoguangwang9@gmail.com.