README.md

November 16, 2025 · View on GitHub

MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding

Ming Dai¹, Sen Yang², Boqiang Duan², Wankou Yang¹, Jingdong Wang²

¹Southeast University; ²Baidu VIS

MomentSeg is a unified MLLM for pixel-level vision–language understanding, designed with a moment-centric sampling strategy to better capture fine-grained semantics in video. It flexibly supports a range of tasks, including referring and reasoning image/video segmentation, video temporal grounding, and image/video question answering.

🔥 News

2025.10.12 🔥 Our paper and video demo has been released.

🕒 Open-Source Plan

Paper and Video Demo
Model Weights and Inference Instructions — Coming soon
Training Code and Detailed Documentation — To be released in a later phase

🎥 Demo

Demo 1

Input Video (Source: Internet):

Error

Instruction: "Please segment the monkey that is scratching its ear."

Demo 2

Input Video (Source: Internet):

Error

Instruction: "Please segment the person standing in the center wearing blue clothes."

🏆 Performance

🖼️ Image-level Segmentation

(Referring Image Segmentation & Reasoning Segmentation)

Benchmark	Evaluation Results (3B/7B)
RefCOCO (RES)	`val: 82.1/82.6` `testA: 83.7/85.1` `testB: 79.2/80.2`
RefCOCO+ (RES)	`val: 76.9/78.2` `testA: 81.1/81.9` `testB: 71.8/71.3`
RefCOCOg (RES)	`val(U): 78.8/80.1` `test(U): 79.2/80.1`
ReasonSeg	`val: 62.0/63.3` `test: 64.3/65.5`
GCG	`val: 67.0/67.8` `test: 65.9/67.9`

🎬 Video-level Segmentation

(Referring Video Object Segmentation)

Benchmark	Evaluation Results (3B/7B)
ReVOS	`J: 59.7/61.9` `F: 64.4/66.1` `J&F: 62.1/64.0`
ReasonVOS	`J: 58.2/59.2` `F: 65.3/66.1` `J&F: 61.7/62.7`
MeViS (val_u)	`J: 58.1/58.7` `F: 65.9/66.5` `J&F: 62.0/62.6`
MeViS (val)	`J: 51.7/53.9` `F: 58.0/60.2` `J&F: 54.8/57.1`
Ref-YouTube-VOS	`J: 69.8/70.1` `F: 74.3/74.5` `J&F: 72.0/72.3`
Ref-DAVIS17	`J: 72.2/73.2` `F: 80.6/81.7` `J&F: 76.4/77.4`
Ref-SAV	`J: 79.2/80.1` `F: 80.6/81.4` `J&F: 79.9/80.8`

⏱️ Temporal Sentence Grounding

(Temporal Sentence Grounding)

Benchmark	Evaluation Results (3B)
Charades-STA	`R@0.3: 76.1` `R@0.5: 58.2` `mIoU: 50.0`
ActivityNet-Grounding	`R@0.3: 65.6` `R@0.5: 45.6` `mIoU: 45.1`

🤖 Model Zoo (TODO)

Model Name	Base MLLM	HF Link
MomentSeg-3B	Qwen2.5-VL-3B	[🤗 link]
MomentSeg-7B	Qwen2.5-VL-7B	[🤗 link]

🚀 Training (TODO)

📊 Evaluation (TODO)

📖 Citation

Please kindly cite our paper if you find this project helpful.

@misc{momentseg,
      title={MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding}, 
      author={Ming Dai and Sen Yang and Boqiang Duan and Wankou Yang and Jingdong Wang},
      year={2025},
      eprint={2510.09274},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.09274}, 
}