README.md
November 16, 2025 Β· View on GitHub
MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding
Ming Dai1, Sen Yang2, Boqiang Duan2, Wankou Yang1, Jingdong Wang2
1Southeast University; 2Baidu VIS
MomentSeg is a unified MLLM for pixel-level visionβlanguage understanding, designed with a moment-centric sampling strategy to better capture fine-grained semantics in video. It flexibly supports a range of tasks, including referring and reasoning image/video segmentation, video temporal grounding, and image/video question answering.

π₯ News
2025.10.12π₯ Our paper and video demo has been released.
π Open-Source Plan
- Paper and Video Demo
- Model Weights and Inference Instructions β Coming soon
- Training Code and Detailed Documentation β To be released in a later phase
π₯ Demo
Demo 1
Input Video (Source: Internet):
Instruction: "Please segment the monkey that is scratching its ear."
Demo 2
Input Video (Source: Internet):
Instruction: "Please segment the person standing in the center wearing blue clothes."
π Performance
πΌοΈ Image-level Segmentation
(Referring Image Segmentation & Reasoning Segmentation)
| Benchmark | Evaluation Results (3B/7B) |
|---|---|
| RefCOCO (RES) | val: 82.1/82.6βtestA: 83.7/85.1βtestB: 79.2/80.2 |
| RefCOCO+ (RES) | val: 76.9/78.2βtestA: 81.1/81.9βtestB: 71.8/71.3 |
| RefCOCOg (RES) | val(U): 78.8/80.1βtest(U): 79.2/80.1 |
| ReasonSeg | val: 62.0/63.3βtest: 64.3/65.5 |
| GCG | val: 67.0/67.8βtest: 65.9/67.9 |
π¬ Video-level Segmentation
(Referring Video Object Segmentation)
| Benchmark | Evaluation Results (3B/7B) |
|---|---|
| ReVOS | J: 59.7/61.9βF: 64.4/66.1βJ&F: 62.1/64.0 |
| ReasonVOS | J: 58.2/59.2βF: 65.3/66.1βJ&F: 61.7/62.7 |
| MeViS (val_u) | J: 58.1/58.7βF: 65.9/66.5βJ&F: 62.0/62.6 |
| MeViS (val) | J: 51.7/53.9βF: 58.0/60.2βJ&F: 54.8/57.1 |
| Ref-YouTube-VOS | J: 69.8/70.1βF: 74.3/74.5βJ&F: 72.0/72.3 |
| Ref-DAVIS17 | J: 72.2/73.2βF: 80.6/81.7βJ&F: 76.4/77.4 |
| Ref-SAV | J: 79.2/80.1βF: 80.6/81.4βJ&F: 79.9/80.8 |
β±οΈ Temporal Sentence Grounding
(Temporal Sentence Grounding)
| Benchmark | Evaluation Results (3B) |
|---|---|
| Charades-STA | R@0.3: 76.1βR@0.5: 58.2βmIoU: 50.0 |
| ActivityNet-Grounding | R@0.3: 65.6βR@0.5: 45.6βmIoU: 45.1 |
π€ Model Zoo (TODO)
| Model Name | Base MLLM | HF Link |
|---|---|---|
| MomentSeg-3B | Qwen2.5-VL-3B | [π€ link] |
| MomentSeg-7B | Qwen2.5-VL-7B | [π€ link] |
π Training (TODO)
π Evaluation (TODO)
π Citation
Please kindly cite our paper if you find this project helpful.
@misc{momentseg,
title={MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding},
author={Ming Dai and Sen Yang and Boqiang Duan and Wankou Yang and Jingdong Wang},
year={2025},
eprint={2510.09274},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.09274},
}