ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

September 21, 2025 ยท View on GitHub

Yiyang Zhou*, Yangfan He*, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao

*Equal Contribution

ReAgent-V is a modular, extensible, and reward-aware video reasoning framework designed to elevate video question answering and reasoning through:

  • ๐Ÿ”ง Flexible Tool Integration โ€” Plug-and-play support for OCR, ASR, object detection, scene graph generation, captioning, and more
  • ๐Ÿง  Reward-Guided Inference โ€” Enables real-time self-correction via structured reward signals
  • ๐ŸŽฏ Adaptive Model Alignment โ€” Aligns models dynamically based on inference-time feedback
  • ๐Ÿ—‚๏ธ High-Quality Data Selection โ€” Facilitates sample-efficient learning using reflective evaluation
  • ๐Ÿ“Š Entropy-Calibrated Frame Selection โ€” Prioritizes key frames for focused reasoning
  • ๐Ÿ” Multi-Perspective Reflection โ€” Refines answers through debate among conservative, neutral, and aggressive viewpoints

News

๐Ÿš€ Update [Sept 19, 2025]: Our paper (arXiv:2506.01300) has been accepted to NeurIPS 2025!

๐Ÿ”ฅ [June 2, 2025] Our latest paper is now live on arXiv: arXiv:2506.01300!


๐Ÿ“Œ Overview

Framework Overview

๐Ÿš€ Applications

ReAgent-V supports a range of real-world tasks via dedicated application modules:

๐Ÿงญ VLA Alignment

Aligns Vision-Language-Action (VLA) models using Trajectory-wise Preference Optimization (TPO) guided by ReAgent-Vโ€™s reward feedback. Specifically, ReAgent-V evaluates each trajectory across multiple axesโ€”such as task success, temporal stability, visual grounding, and semantic precisionโ€”and performs multi-agent reflection to produce refined, high-fidelity reward scores for alignment.

๐ŸŽฅ Video Understanding

  • Entropy-Calibrated Frame Selection
    Efficiently selects the most informative frames for video reasoning.

  • Tool-Augmented Inference
    Dynamically integrates multimodal tools including OCR, ASR, object detection, scene graph generation, and captioning, etc.

  • Multi-Agent Reflection
    Iteratively refines outputs by encouraging disagreement and consensus among diverse agent personas (conservative / neutral / aggressive).

  • ๐Ÿ“ Module: ReAgent-V

  • ๐Ÿ“˜ Instructions: Video Understanding README

๐Ÿ“ˆ Reward-Aware Data Curation and Collection for SFT, DPO, GRPO, and Beyond

ReAgent-V enables inference-time data curation by leveraging real-time rewards and reflection-based diagnostics depending on the optimization paradigm, the extraction strategy varies:

๐Ÿงช For SFT (Supervised Fine-Tuning)

ReAgent-V can directly collect samples with high reward scores (from the evaluation report) without requiring additional reflection.

  • โœ… These samples indicate that the modelโ€™s initial reasoning is reliable.
  • ๐Ÿ“ฅ Stored as supervised training pairs with accompanying scalar reward labels from the critic agent.

Simple, scalable, and label-efficient: reward scores enable dynamic filtering without manual annotation.

๐Ÿ”„ For GRPO (Group Relative Policy Optimization)

To curate high-value training data for GRPO, ReAgent-V employs a reflection-triggering mechanism grounded in importance scoring, effectively identifying challenging yet informative video-text samples during the video understanding phase.

  • ๐ŸŽฅ Each input is a (video, text) pair, typically comprising a video and its initial response.
  • ๐Ÿ“Š During inference, ReAgent-V computes an importance score (denoted as E.importance_score) based on the critic agentโ€™s overall assessment of reasoning sufficiency.
  • โ— If this importance score falls below a threshold (e.g., < 5 out of 10), the sample is considered difficult, meaning the model struggled with initial reasoning and likely required further refinement.
  • ๐Ÿ“ฅ The resulting (video, text) samples are labeled as reflection-worthy and collected as valuable candidates for GRPO training.

โš–๏ธ For DPO (Direct Preference Optimization)

ReAgent-V supports Direct Preference Optimization (DPO) by reframing itself from a video reasoning agent into a reward-generating agent. This is achieved through a task template modification that emphasizes evaluating answer quality rather than producing a single correct answer.

  • ๐Ÿง  Transforms the task from โ€œanswer the video questionโ€ into โ€œscore the video from different perspectivesโ€ based on visual evidence.

  • โ™ป๏ธ Uses multi-perspective reflection outputs (e.g., conservative, neutral, aggressive) to generate candidate rewards.

  • ๐Ÿ“Š Each candidate reward is generated along customizable reward dimensions, such as:

    • ๐ŸŽฏ Visual alignment
    • โฑ๏ธ Temporal accuracy
    • ๐Ÿ’ฌ Linguistic precision
    • ๐Ÿง  Reasoning specificity
    • ๐Ÿ” Option disambiguation
  • โœ… The system identifies the answer with the higher reflection reward as the preferred choice.

  • ๐Ÿ”— Constructs (preferred, rejected) pairs from these outputs to serve as DPO training data.

Unlike static or hand-crafted rewards, ReAgent-Vโ€™s feedback is context-aware, multi-dimensional, and fully dynamic, adapting to each video-question instance.


๐ŸŒ€ Unified Insight: ReAgent-V closes the data curation loop. Its multi-agent reward pipeline not only improves current inference but continuously supplies high-fidelity data for future optimizationโ€”making learning systems self-refining in the wild.

๐Ÿง‘โ€๐Ÿ’ป Getting Started

Each subfolder contains its own README.md with detailed installation, setup, and training instructions. To get started:

  1. Clone the repository
  2. Follow the environment setup and requirements in each module
  3. Explore the demo scripts and customize as needed

๐Ÿ’ฌ If you have questions or encounter any issues, feel free to open an issue or contact the maintainers.


๐Ÿ“š Citation

If you find ReAgent-V helpful in your research or projects, please consider citing:

@article{zhou2025reagent,
  title={ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding},
  author={Zhou, Yiyang and He, Yangfan and Su, Yaofeng and Han, Siwei and Jang, Joel and Bertasius, Gedas and Bansal, Mohit and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2506.01300},
  year={2025}
}