Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration
April 26, 2026 Β· View on GitHub
This is the official PyTorch implementation of the paper "Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration" (Accepted by CVPR 2026 Findings).
π Overview
Multimodal Large Language Models (MLLMs) face a "less is more" paradox in Video-QA: excessive frames can degrade performance due to context dilution, while incurring prohibitive token costs. We identify that even state-of-the-art keyframe selectors produce prompts suffering from significant temporal redundancy β a challenge unique to video that we term 'visual echoes'. Our repository provides a plug-and-play post-processing framework designed to drastically reduce token consumption for any keyframe selectors. It takes initial keyframes selected by any upstream baseline (e.g., VSLS, Uniform Sampling) and optimizes them through:
- Adaptive Frame-Pruning (AFP): Consolidates redundant frames ('visual echoes') using multi-modal feature clustering.
- Textual Semantic Graph: Compensates for potential visual information loss with a lightweight, LLM-generated semantic context.
π₯ Highlights: The "Less is More" Phenomenon
Our approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 82.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. Extensively validated on LongVideoBench and Video-MME with multiple state-of-the-art keyframe selectors, our method consistently delivers high accuracy at a low frame count across diverse MLLMs. See the paper for the full quantitative analysis.
π οΈ Installation
# Clone the repository
git clone https://github.com/shaoguangwang/Adaptive-Frame-Pruning.git
cd Adaptive-Frame-Pruning
# Create conda environment
conda create -n afp_videoqa python=3.9 -y
conda activate afp_videoqa
# Install PyTorch via pip (recommended β much faster than conda's solver)
# Adjust the CUDA tag (cu121 / cu118 / cpu) to match your environment.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install remaining requirements
pip install -r requirements.txt
Note on PyTorch installation. We recommend
pipoverconda install pytorch ...because the classic conda solver can take 30+ minutes (or hang) when resolving thepytorchchannel. If you prefer conda, switch to the libmamba solver first:conda install -n base conda-libmamba-solver -y conda config --set solver libmamba conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
π Data Preparation
Since our method acts as a universal refinement module, it is dataset-agnostic. You simply need to prepare your data (including the initial keyframes selected by your upstream baseline) in a JSON format. Please refer to example_input.json in our repository for the expected format:
[
{
"video_id": "fFjv93ACGo8",
"video_path": "/path/to/your/video.mp4",
"question": "When demonstrating the Germany modern Christmas tree is initially decorated with apples, candles and berries, which kind of the decoration has the largest number?",
"options": "A) Apples.\nB) Candles.\nC) Berries.\nD) The three kinds are of the same number.",
"answer": "C",
"keyframe_timestamps":[10, 25, 26, 27, 40, 55, 56, 80]
}
]
Note: keyframe_timestamps represents the timestamps (in frames or seconds, depending on your extraction logic) of the keyframes initially selected by your baseline.
π Pipeline & Quick Start
Our pipeline consists of three simple steps: Pruning, Graph Generation, and Downstream QA.
Before you start: Open
example_input.jsonand replace the placeholder"video_path": "/path/to/your/video.mp4"with the actual path to a local video file on your machine. The script reads frames directly from this file.
Step 1: Adaptive Frame-Pruning (AFP)
Use cluster.py to prune redundant frames. It extracts fused ResNet50 + CLIP features and performs adaptive clustering.
python scripts/cluster.py \
--input_json ./example_input.json \
--output_dir ./data/afp_output \
--num 32 # Number of initial keyframes to process
Output: Generates a JSON file at ./data/afp_output/example_input_cluster_centroid_32keyframes.json. The output filename follows the pattern {input_basename}_cluster_centroid_{num}keyframes.json, so it changes if you pass a different --num. Each entry gains a new key "frame_cluster" with the pruned timestamps.
Step 2: Semantic Graph Generation
For this step we release only the prompt template rather than a runnable client, so you can plug in any LLM you have access to (OpenAI API, Azure, Claude, local vLLM/Ollama, etc.). The prompt template β together with the expected output JSON schema β is documented in:
Recommended workflow:
- For each item in the JSON produced by Step 1, fill the prompt's
{question}and{options}placeholders with the item's question and options. - Send the prompt to a capable LLM (we used GPT-4o).
- Parse the LLM output into the schema described in
prompts/semantic_graph_prompt.mdand append"grounding_objects"and"graph"to each item.
Output: The final JSON will now include "grounding_objects" and "graph" containing the semantic relations, ready for Step 3.
Step 3: Downstream Video-QA (MLLM Inference)
With the pruned frames and the semantic graph ready, you can feed them into your preferred MLLM (e.g., GPT-4o, Qwen2.5-VL, LLaVA-Video). To reproduce our results, please strictly use the following prompt template to construct your VLM input:
Select the best answer to the following multiple-choice question based on the video.
<image>
<image>
...
[Here is an additional semantic graph context for this Video-QA]
Objects in video context: <Objects>
Relationships between objects: <Relations>
Question: <Question>
Options: <Options>
Answer with the optionβs letter from the given choices directly.
Your response format should be strictly an upper case letter A,B,C,D or E.
(Note: Replace <image> placeholders with the actual images extracted using the timestamps from "frame_cluster". Replace <Objects> and <Relations> with the data parsed from the generated "graph" in Step 2).
π Citation
If you find our work helpful, please consider citing:
@article{wang2025less,
title={Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration},
author={Wang, Shaoguang and Guo, Weiyu and Chen, Ziyang and Xu, Yijie and Hu, Xuming and Xiong, Hui},
journal={arXiv preprint arXiv:2508.03337},
year={2025}
}
π License
This project is released under the MIT License.
βοΈ Contact
For any questions, please contact Shaoguang Wang at shaoguangwang9@gmail.com.