Video-CoM: Interactive Video Reasoning via Chain of Manipulations (🔥🔥 CVPR 2026)

April 22, 2026 · View on GitHub

Oryx Video-ChatGPT

Hanoona Rasheed, Mohammed Zumri, Muhammad Maaz, Ming-Hsuan Yang, Fahad Khan, Salman Khan

MBZUAI, University of California Merced, Linköping University, Australian National University

📣 Announcements

🔥🔥 Update: We release the code, the Video-CoM models, and the datasets used in the development of Video-CoM.

Paper: https://arxiv.org/abs/2511.23477
Code: for training: SFT + GRPO
Models:
- Video-CoM (final model): https://huggingface.co/MBZUAI/Video-CoM
- Video-CoM-SFT (after SFT): https://huggingface.co/MBZUAI/Video-CoM-SFT
Dataset: https://huggingface.co/datasets/MBZUAI/Video-CoM-Dataset

Video-CoM introduces a new paradigm for interactive video reasoning, enabling models to think with videos instead of merely thinking about them. Instead of relying on a single static video encoding, Video-CoM performs iterative visual actions (segment finding, frame selection, and spatial zooming) to actively gather evidence through a Chain of Manipulations (CoM).

Highlight Figure
Video-Com reasons with videos through a coherent chain of manipulations, actively gathering and integrating visual evidence throughout reasoning.

🔥 Highlights

Interactive Video Reasoning Framework: Moves beyond passive video encoding by enabling the model to actively rewatch specific moments, pause on key frames, and zoom into fine details throughout its reasoning trajectory, allowing it to gather evidence step by step rather than relying on a single static video representation.
Chain of Manipulations (CoM): A structured, interpretable reasoning mechanism where each step involves retrieving new visual evidence before continuing textual reasoning.
Video-CoM-Instruct (18K) - Manipulation-Driven Dataset: Carefully curated videos + dense annotations designed specifically for active visual reasoning.
Reasoning-Aware GRPO (RA-GRPO): Unlike accuracy-only RL, RA-GRPO provides step-level reasoning rewards, guiding consistent and visually grounded reasoning.
Srong Performance: We show strong performance across five reasoning benchmarks and two generic video-understanding benchmarks, along with significant gains on our manipulation-focused benchmark, demonstrating the effectiveness of interactive visual reasoning.

📊 Dataset: Video-Com-Instruct-18K

The Video-CoM-Instruct is constructed through three key stages:

Curating information-dense videos suited for fine-grained reasoning
Generating manipulation-targeted QA pairs that require segment revisiting, frame inspection, and spatial zooming
Dense temporal and spatial annotations to enable step-level reinforcement learning

Building on this foundation, each example follows a structured reasoning format that alternates between exploratory reasoning, where the model infers which moment or region likely contains the needed evidence; visual manipulation, where it executes targeted actions such as find-segment, find-frame, or spatial-zoom to retrieve new visual input; and observation, where it interprets the newly revealed evidence and integrates it into the next step.

⚡ Reasoning-Aware GRPO (RA-GRPO)

Most existing video reasoning models rely solely on final-answer rewards, offering no guidance on whether intermediate reasoning steps are visually grounded or correct. To address this, we introduce reasoning-aware rewards enabled by our dense temporal and spatial annotations, allowing the model to receive feedback at every manipulation step. Reasoning-Aware GRPO (RA-GRPO) enhances interactive video reasoning by providing step-level rewards by evaluating the correctness of predicted manipulations.

👁️ Attention to Visual Cues

Video-CoM maintains dynamic visual attention throughout its reasoning process, re-engaging with frames and regions whenever new evidence is needed. Unlike prior models that tend to drift toward text tokens and rely on world knowledge, Video-CoM’s attention consistently anchors to vision tokens at each manipulation step, whether locating a segment, isolating a frame, or zooming into fine details.

Installation

A minimal environment for demo and evaluation:

conda create -n video-com python=3.12 -y
conda activate video-com
pip install -U pip

# We use torch v2.7.0, torchvision v0.22.0 and transformers v2.51.1 in the development of Video-CoM
# Please see requirements.txt for more details
pip install -r requirements.txt

Training

See train/README.md for details.

bash train/scripts/train_sft.sh
bash train/scripts/train_grpo.sh

📜 Citation

@article{rasheed2025videocom,
    title={Video-CoM: Interactive Video Reasoning via Chain of Manipulations},
    author={Rasheed, Hanoona and Zumri, Mohammed and Maaz, Muhammad and Yang, Ming-Hsuan and Khan, Fahad S. and Khan, Salman},
    journal={arXiv preprint arXiv:2511.23477},
    year={2025}
}