ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
September 21, 2025 ยท View on GitHub
Yiyang Zhou*, Yangfan He*, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao
ReAgent-V is a modular, extensible, and reward-aware video reasoning framework designed to elevate video question answering and reasoning through:
- ๐ง Flexible Tool Integration โ Plug-and-play support for OCR, ASR, object detection, scene graph generation, captioning, and more
- ๐ง Reward-Guided Inference โ Enables real-time self-correction via structured reward signals
- ๐ฏ Adaptive Model Alignment โ Aligns models dynamically based on inference-time feedback
- ๐๏ธ High-Quality Data Selection โ Facilitates sample-efficient learning using reflective evaluation
- ๐ Entropy-Calibrated Frame Selection โ Prioritizes key frames for focused reasoning
- ๐ Multi-Perspective Reflection โ Refines answers through debate among conservative, neutral, and aggressive viewpoints
News
๐ Update [Sept 19, 2025]: Our paper (arXiv:2506.01300) has been accepted to NeurIPS 2025!
๐ฅ [June 2, 2025] Our latest paper is now live on arXiv: arXiv:2506.01300!
๐ Overview

๐ Applications
ReAgent-V supports a range of real-world tasks via dedicated application modules:
๐งญ VLA Alignment
Aligns Vision-Language-Action (VLA) models using Trajectory-wise Preference Optimization (TPO) guided by ReAgent-Vโs reward feedback. Specifically, ReAgent-V evaluates each trajectory across multiple axesโsuch as task success, temporal stability, visual grounding, and semantic precisionโand performs multi-agent reflection to produce refined, high-fidelity reward scores for alignment.
- ๐ Module:
Application/VLA-Alignment - ๐ Instructions: VLA Alignment README
๐ฅ Video Understanding
-
Entropy-Calibrated Frame Selection
Efficiently selects the most informative frames for video reasoning. -
Tool-Augmented Inference
Dynamically integrates multimodal tools including OCR, ASR, object detection, scene graph generation, and captioning, etc. -
Multi-Agent Reflection
Iteratively refines outputs by encouraging disagreement and consensus among diverse agent personas (conservative / neutral / aggressive). -
๐ Module:
ReAgent-V -
๐ Instructions: Video Understanding README
๐ Reward-Aware Data Curation and Collection for SFT, DPO, GRPO, and Beyond
ReAgent-V enables inference-time data curation by leveraging real-time rewards and reflection-based diagnostics depending on the optimization paradigm, the extraction strategy varies:
๐งช For SFT (Supervised Fine-Tuning)
ReAgent-V can directly collect samples with high reward scores (from the evaluation report) without requiring additional reflection.
- โ These samples indicate that the modelโs initial reasoning is reliable.
- ๐ฅ Stored as supervised training pairs with accompanying scalar reward labels from the critic agent.
Simple, scalable, and label-efficient: reward scores enable dynamic filtering without manual annotation.
๐ For GRPO (Group Relative Policy Optimization)
To curate high-value training data for GRPO, ReAgent-V employs a reflection-triggering mechanism grounded in importance scoring, effectively identifying challenging yet informative video-text samples during the video understanding phase.
- ๐ฅ Each input is a (video, text) pair, typically comprising a video and its initial response.
- ๐ During inference, ReAgent-V computes an importance score (denoted as
E.importance_score) based on the critic agentโs overall assessment of reasoning sufficiency. - โ If this importance score falls below a threshold (e.g.,
< 5 out of 10), the sample is considered difficult, meaning the model struggled with initial reasoning and likely required further refinement. - ๐ฅ The resulting (video, text) samples are labeled as reflection-worthy and collected as valuable candidates for GRPO training.
โ๏ธ For DPO (Direct Preference Optimization)
ReAgent-V supports Direct Preference Optimization (DPO) by reframing itself from a video reasoning agent into a reward-generating agent. This is achieved through a task template modification that emphasizes evaluating answer quality rather than producing a single correct answer.
-
๐ง Transforms the task from โanswer the video questionโ into โscore the video from different perspectivesโ based on visual evidence.
-
โป๏ธ Uses multi-perspective reflection outputs (e.g., conservative, neutral, aggressive) to generate candidate rewards.
-
๐ Each candidate reward is generated along customizable reward dimensions, such as:
- ๐ฏ Visual alignment
- โฑ๏ธ Temporal accuracy
- ๐ฌ Linguistic precision
- ๐ง Reasoning specificity
- ๐ Option disambiguation
-
โ The system identifies the answer with the higher reflection reward as the preferred choice.
-
๐ Constructs (preferred, rejected) pairs from these outputs to serve as DPO training data.
Unlike static or hand-crafted rewards, ReAgent-Vโs feedback is context-aware, multi-dimensional, and fully dynamic, adapting to each video-question instance.
๐ Unified Insight: ReAgent-V closes the data curation loop. Its multi-agent reward pipeline not only improves current inference but continuously supplies high-fidelity data for future optimizationโmaking learning systems self-refining in the wild.
- ๐ Module:
ReAgent-V - ๐ Instructions: Video Understanding README
๐งโ๐ป Getting Started
Each subfolder contains its own README.md with detailed installation, setup, and training instructions. To get started:
- Clone the repository
- Follow the environment setup and requirements in each module
- Explore the demo scripts and customize as needed
๐ฌ If you have questions or encounter any issues, feel free to open an issue or contact the maintainers.
๐ Citation
If you find ReAgent-V helpful in your research or projects, please consider citing:
@article{zhou2025reagent,
title={ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding},
author={Zhou, Yiyang and He, Yangfan and Su, Yaofeng and Han, Siwei and Jang, Joel and Bertasius, Gedas and Bansal, Mohit and Yao, Huaxiu},
journal={arXiv preprint arXiv:2506.01300},
year={2025}
}