README.md

November 16, 2025 Β· View on GitHub

MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding

Ming Dai1, Sen Yang2, Boqiang Duan2, Wankou Yang1, Jingdong Wang2

1Southeast University; 2Baidu VIS


Demo Animation

MomentSeg is a unified MLLM for pixel-level vision–language understanding, designed with a moment-centric sampling strategy to better capture fine-grained semantics in video. It flexibly supports a range of tasks, including referring and reasoning image/video segmentation, video temporal grounding, and image/video question answering.

πŸ”₯ News

  • 2025.10.12 πŸ”₯ Our paper and video demo has been released.

πŸ•’ Open-Source Plan

  • Paper and Video Demo
  • Model Weights and Inference Instructions β€” Coming soon
  • Training Code and Detailed Documentation β€” To be released in a later phase

πŸŽ₯ Demo

Demo 1 Input Video (Source: Internet):

Error

Instruction: "Please segment the monkey that is scratching its ear."

Demo 2 Input Video (Source: Internet):

Error

Instruction: "Please segment the person standing in the center wearing blue clothes."

πŸ† Performance

πŸ–ΌοΈ Image-level Segmentation

(Referring Image Segmentation & Reasoning Segmentation)

BenchmarkEvaluation Results (3B/7B)
RefCOCO (RES)val: 82.1/82.6 testA: 83.7/85.1 testB: 79.2/80.2
RefCOCO+ (RES)val: 76.9/78.2 testA: 81.1/81.9 testB: 71.8/71.3
RefCOCOg (RES)val(U): 78.8/80.1 test(U): 79.2/80.1
ReasonSegval: 62.0/63.3 test: 64.3/65.5
GCGval: 67.0/67.8 test: 65.9/67.9
🎬 Video-level Segmentation

(Referring Video Object Segmentation)

BenchmarkEvaluation Results (3B/7B)
ReVOSJ: 59.7/61.9 F: 64.4/66.1 J&F: 62.1/64.0
ReasonVOSJ: 58.2/59.2 F: 65.3/66.1 J&F: 61.7/62.7
MeViS (val_u)J: 58.1/58.7 F: 65.9/66.5 J&F: 62.0/62.6
MeViS (val)J: 51.7/53.9 F: 58.0/60.2 J&F: 54.8/57.1
Ref-YouTube-VOSJ: 69.8/70.1 F: 74.3/74.5 J&F: 72.0/72.3
Ref-DAVIS17J: 72.2/73.2 F: 80.6/81.7 J&F: 76.4/77.4
Ref-SAVJ: 79.2/80.1 F: 80.6/81.4 J&F: 79.9/80.8
⏱️ Temporal Sentence Grounding

(Temporal Sentence Grounding)

BenchmarkEvaluation Results (3B)
Charades-STAR@0.3: 76.1 R@0.5: 58.2 mIoU: 50.0
ActivityNet-GroundingR@0.3: 65.6 R@0.5: 45.6 mIoU: 45.1

πŸ€– Model Zoo (TODO)

Model NameBase MLLMHF Link
MomentSeg-3BQwen2.5-VL-3B[πŸ€— link]
MomentSeg-7BQwen2.5-VL-7B[πŸ€— link]

πŸš€ Training (TODO)

πŸ“Š Evaluation (TODO)

πŸ“– Citation

Please kindly cite our paper if you find this project helpful.

@misc{momentseg,
      title={MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding}, 
      author={Ming Dai and Sen Yang and Boqiang Duan and Wankou Yang and Jingdong Wang},
      year={2025},
      eprint={2510.09274},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.09274}, 
}