Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level

April 4, 2025 ยท View on GitHub

[CVPR 2025]
[Project Page] [Paper] [Supp] [Dataset]


๐Ÿง  Overview

We introduces a new task: Motion-Grounded Video Reasoning, where models must answer motion-related questions using spatiotemporal segmentation masks as visual responses.

This task addresses key limitations in prior video understanding research by introducing:

  • โ“ Implicit question-based reasoning
  • ๐Ÿ•’ Motion-aware temporal localization
  • ๐Ÿง Object-level visual grounding
  • ๐ŸŽฏ Pixel-level mask generation across time
  • ๐Ÿงฉ Four question types: Causal, Sequential, Counterfactual, and Descriptive

๐Ÿ“Œ Comparison to Prior Tasks

Figure 1: Comparison to other motion understanding tasks

Figure 1: GROUNDMORE fills the gap between referring segmentation, temporal grounding, and reasoning by combining implicit QA with visual spatiotemporal output.


๐Ÿ“‹ Task Definition

The Motion-Grounded Video Reasoning task requires models to:

  • Input:

    • A video clip V โˆˆ โ„แต—หฃสฐหฃสทหฃยณ
    • A motion-related question Q
  • Output:

    • Spatiotemporal segmentation masks M โˆˆ โ„แต—โ€ฒหฃสฐหฃสท highlighting the target object

This output represents the reasoning result visually by grounding the answer over space and time.


๐Ÿงช Dataset Details

We collect a new benchmark dataset: GROUNDMORE, designed to evaluate fine-grained motion reasoning.

  • 1.7K high-resolution video clips
  • 7.6K question-answer pairs
  • 249K object-level spatiotemporal masks
  • Diverse video categories: family scene, animal, ball game, and outdoor activity

โœ”๏ธ Task Coverage Comparison

Table 1: Comparison of motion understanding tasks

Table 1: Motion-Grounded Video Reasoning supports all dimensions: spatial & temporal context, motion abstraction, pixel-level output, and implicit reasoning.


๐Ÿ“Š Dataset Statistics

Table 2: Dataset statistics

Table 2: GROUNDMORE contains more dense QA + segmentation annotations than prior benchmarks, especially in motion-related reasoning.


๐Ÿง  MoRA: Motion-Grounded Reasoning Assistant

We propose a baseline model called MoRA, built for this task. It integrates:

  • LLaVA for multimodal reasoning
  • SAM decoder for spatial mask decoding
  • [SEG] token for object semantic embedding
  • [LOC] token for temporal localization of motion events

๐Ÿงฑ Model Architecture

Figure 3: MoRA Model Architecture

Figure 3: MoRA outputs pixel-level segmentation masks as response for the input motion-related question.


๐Ÿ“ˆ Results on GROUNDMORE

๐Ÿฅ‡ Zero-shot Evaluation

Table 3: Benchmark Results

Table 3: MoRA achieves SOTA on all question types, outperforming previous baseline models.


๐Ÿ” Ablation Study

Table 5: Temporal localization ablation

Table 5: Temporal localization via [LOC] token significantly improves performance.


โš™๏ธ Installation

git clone https://github.com/groundmore/GROUNDMORE.git
cd GROUNDMORE
conda create -n groundmore python=3.10
conda activate groundmore
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

๐Ÿš€ Usage

GroundMoRe Download

GroundMoRe is available at: https://huggingface.co/datasets/groundmore/GroundMoRe

Training

Before training, you need to obtain LISA and SAM for model initialization.

Put SAM pretrained weights at ./pretrain_weights/

Zero-Shot Training

We use Refer-YouTube-VOS, MeViS dataset for zero-shot training.

bash run.sh

GroundMoRe Evaluation

python evaluate_groundmore.py

โœ… TODO

  • Release MoRA-FT-LISA7B
  • Release MoRA-ZS-LISA13B
  • Release MoRA-FT-LISA13B

๐Ÿ“ฃ Citation

If this work is useful for your research, please cite:

@inproceedings{deng2025groundmore,
  title={Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level},
  author={Deng, Andong and Chen, Tongjia and Yu, Shoubin and Yang, Taojiannan and Spencer, Lincoln and Tian, Yapeng and Mian, Ajmal Saeed and Bansal, Mohit and Chen, Chen},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}

๐Ÿ™ Acknowledgements

This work is built upon LISA and SAM.

We also appreciate the valuable help from Wenshuo Chen and Erhang Zhang during the GroundMoRe data collection.