Chain-of-Frames
June 4, 2025 · View on GitHub
Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning
Paper | CoF-Data | CoF-Models | Quick Start | Acknowledgements
We propose chain-of-frames (CoF) to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant frames (see example in the left figure above).
We first create a large dataset of diverse questions, answers, and reasoning traces with references to frame IDs from both natural and synthetic videos. Then, we fine-tune existing video LLMs on this chain-of-frames data (CoF-Data). Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks or complex inference pipelines.
Our CoF-InternVL2.5-4B and CoF-InternVL3-8B models, based on CoF, outperform the baselines across several benchmarks (right figure above). Moreover, they generate interpretable reasoning traces that accurately refer to the key frames to answer the given question.
CoF-Data
The figure below summarizes the CoF-Data generation process, which yields our video annotations.
Checkpoints
Quick Start
The model loading and evaluation procedures are similar to those used in the InternVL repository; please refer to the InternVL documentation for additional details.
- To load our models:
import torch
from transformers import AutoTokenizer, AutoModel
model_path = "path/to/CoF-8B"
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
generation_config = dict(max_new_tokens=2048, do_sample=False)
- Evaluation scripts for the video benchmarks:
bash scripts/eval/eval.sh
Acknowledgements
This work leverages the code and resources from InternVL repository.
We thank the authors for making their work publicly available and contributing to the research community.
Citation
If you use our code or models, please consider citing our work using the following BibTex entry:
@article{ghazanfari2025chainofframes,
title={Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning},
author={Sara Ghazanfari and Francesco Croce and Nicolas Flammarion and Prashanth Krishnamurthy and Farshad Khorrami and Siddharth Garg},
year={2025},
journal={arXiv preprint arxiv:2506.00318}
}