🎞️ LVNet: Hierarchical Keyframe Selector for Long Video

February 10, 2026 · View on GitHub

[Main Conference @ EACL'26] [Workshop @ NeurIPS'24]

Official Code for Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA

We propose LVNet, a modular and training-free hierarchical keyframe selector for very long videos.

Paper Link | Daily Papers in HF | Model Ckpt

Comparison with Other Keyframe Selection Methods

  • The videos below compare LVNet to two representative two-stage keyframe selection baselines, VideoAgent and VideoTree, on the same 3-minute EgoSchema example.
  • LVNet’s hierarchical keyframe selector (TSC → CKD → FKD) concentrates the caption budget on the most relevant moments, which leads to selecting the correct option (e) Phones.
  • In contrast, VideoAgent and VideoTree extract many less-informative frames for this query, and end up selecting incorrect options (c), (d).

⬇️ LVNet (Ours) selects (e) - Correct

LVNet demo (EgoSchema)

⬇️ VideoAgent selects (d) - Wrong

videoagent demo (EgoSchema)

⬇️ VideoTree selects (c) - Wrong

videotree demo (EgoSchema)

This Figure compares keyframe selection stages of LVNet and VideoAgent shwon in the demo above. LVNet starts with uniformly sampled frames, then selects keyframes non-uniformly through TSC, CKD, and FKD to highlight relevant content. This yields 12 frames, 8 of which show “phone usage,” the correct activity. In contrast, VideoAgent continues uniform sampling due to insufficient initial frames, yeilding 0 relevant frames out of 9 and ultimately choosing the wrong answer.

vis_comparison

Abstract

Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely related. Therefore, when performing long-form video question answering (LVQA), all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature leverage large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Motivated by this inefficiency, we propose LVNet, a modular and training-free framework featuring a novel Hierarchical Keyframe Selector (HKS) that efficiently selects a minimal set of informative frames tailored to each question. LVNet's modularity allows easy integration with existing approaches for more efficient LVQA. We achieve state-of-the-art performance among similarly configured models across four benchmark LVQA datasets: EgoSchema, NExT-QA, IntentQA, VideoMME.

Accuracy vs Captions on the EgoSchema Subset

  • LVNet shows a SOTA 68.2% accuracy, merely at 12 captions.
  • The result highlights the quality of keyframes from the LVNet's hierarchical keyframe selector.. acc_captions

Hierarchical Keyframe Selector: Structural Overview

  • Overall strategy: Generate captions by hierarchical keyframe selector and feed them to the separate LLM to answer the question.
  • Temporal Scene Clustering (TSC): Divides the long-video into scenes, enabling per-scene subsampling.
  • Coarse Keyframe Detector (CKD): Selects frames best-aligned with keywords relevant to the query.
  • Fine Keyframe detector (FKD): Selects frames by refining keyword alignements through a templated visual prompting. acc_captions

Hierarchical Keyframe Selector: Operational Visualization

  • Temporal Scene Clustering (TSC): 900 frames get clustered into scenes and uniformly subsampled within each scene to output around 280 frames.
  • Coarse Keyframe Detector (CKD): Coarse Keyframe Detector selects only 32 frames out of them, based on the alignment with keywords which are from options.
  • Visual Templating: Coarsely refined keyframes are then ordered according to confidence scores and temporal orders, and grouped them into 4 groups of 8 frames each.
  • Fine Keyframe Detector (FKD): Selects 12 frames by refining keyword alignments in visual templates. acc_captions

Experiments: Comparison to Keyframe Selection Methods on VideoMME

Table shows LVNet’s performance on VideoMME’s long split, which features videos up to an hour long. In combination with an open-source LLM (DeepSeek-V3), LVNet outperforms the single-stage keyframe selection methods VideoChat-T and Frame-Voyager without any video-level training. Moreover, LVNet outperforms the two-stage keyframe selectors VideoAgent and VideoTree under the identical GPT-4o and 24-frame budget settings.

videomme

Efficiency

  • On a single NVIDIA RTX A5000 GPU with a 180-frame input video, LVNet finishes in 25.6 s (7.0 FPS), achieving 3.4Ă— and 2.5Ă— speedups over VideoAgent (87.3 s, 2.1 FPS) and VideoTree (64.2 s, 2.8 FPS). The main gap comes from LVNet’s lightweight HKS (TSC+CKD+FKD) to select keyframes quickly and applies the expensive captioner+LLM only once on the final selected frames. runtime

  • LVNet processes 1,800 frames in 61.9 s (29.1 FPS) on a single GPU, while Qwen2.5-VL-72B encounters out of memory (OOM) at 24 frames. Despite architectural and hardware differences, it shows that LVNet’s keyframe selector is scalable to process large number of frames in long-form video as it is lightweight, single-GPU-friendly, and provides high FPS. runtime

  • Table shows the inference cost and performance against purely GPT-4o prompting. LVNet delivers competitive accuracy with over 10Ă— lower per-video LLM cost, thanks to efficient keyframe selection. apicost

Evaluation

Generate Answers Using LLM

You can easily run the LLM to generate answers for the questions using the pre-generated captions.

  1. Download the Captions for Dataset
  • EgoSchema: bash scripts/get_ES_captions.sh
  1. Run LLM bash scripts/eval_ES.sh

Generate captions using our provided modules

Hierarchical Keyframe Selector (HKS)

  • Temporal Scene Clustering (TSC): temporalSceneClustering.py
  • Coarse Keyframe Detector (CKD): coarseKeyframeDetector.py
  • Fine Keyframe detector (FKD): fineKeyframeDetector.py
  1. EgoSchema keyframe selection from images: bash config/run.sh

  2. Generate captions based on the keyframes: bash scripts/create_caption.sh

Data

Hierarchical Keyframe Selector hyper-parameters & paths

coarseKeyframeDetector.py CLIP model checkpoint

Citation

@inproceedings{Park26LVNet,
  title={Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA},
  author={Jongwoo Park and Kanchana Ranasinghe and Kumara Kahatapitiya and Wonjeong Ryoo and Donghyun Kim and Michael S. Ryoo},
  booktitle={European Chapter of the Association for Computational Linguistics}, 
  year={2026}
}