Chain-of-Frames

June 4, 2025 · View on GitHub

Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Paper | CoF-Data | CoF-Models | Quick Start | Acknowledgements

We propose chain-of-frames (CoF) to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant frames (see example in the left figure above).

We first create a large dataset of diverse questions, answers, and reasoning traces with references to frame IDs from both natural and synthetic videos. Then, we fine-tune existing video LLMs on this chain-of-frames data (CoF-Data). Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks or complex inference pipelines.

Our CoF-InternVL2.5-4B and CoF-InternVL3-8B models, based on CoF, outperform the baselines across several benchmarks (right figure above). Moreover, they generate interpretable reasoning traces that accurately refer to the key frames to answer the given question.

To load our models:

import torch
from transformers import AutoTokenizer, AutoModel
model_path = "path/to/CoF-8B"
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
generation_config = dict(max_new_tokens=2048, do_sample=False)

Evaluation scripts for the video benchmarks:

bash scripts/eval/eval.sh

Acknowledgements

This work leverages the code and resources from InternVL repository.

We thank the authors for making their work publicly available and contributing to the research community.

Citation

If you use our code or models, please consider citing our work using the following BibTex entry:

@article{ghazanfari2025chainofframes,
      title={Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning}, 
      author={Sara Ghazanfari and Francesco Croce and Nicolas Flammarion and Prashanth Krishnamurthy and Farshad Khorrami and Siddharth Garg},
      year={2025},
      journal={arXiv preprint arxiv:2506.00318} 
}

Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

Paper | CoF-Data | CoF-Models | Quick Start | Acknowledgements

CoF-Data

Checkpoints

Quick Start

Acknowledgements

Citation