README.md
June 19, 2026 Β· View on GitHub
MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding
Ming Dai1, Sen Yang2, Boqiang Duan2, Wankou Yang1, Jingdong Wang2
1Southeast University; 2Baidu VIS
MomentSeg is a unified MLLM for pixel-level visionβlanguage understanding, designed with a moment-centric sampling strategy to better capture fine-grained semantics in video. It flexibly supports a range of tasks, including referring and reasoning segmentation for images and videos, temporal sentence grounding, and image/video question answering.

π₯ News
2026.06.19π₯ We release the training code, inference code, evaluation code, data package, and checkpoints for MomentSeg-3B and MomentSeg-7B.2026.06.18π MomentSeg was accepted to ECCV 2026.2025.10.12π₯ We released the paper and video demo.
π Release Status
- Paper and Video Demo
- Model Checkpoints and Inference Instructions
- Training Code and Detailed Documentation
- Data Release
π₯ Demo
Demo 1
Input video (source: Internet):
Prompt: "Please segment the monkey that is scratching its ear."
Demo 2
Input video (source: Internet):
Prompt: "Please segment the person standing in the center wearing blue clothes."
π Performance
πΌοΈ Image-level Segmentation
(Referring Image Segmentation & Reasoning Segmentation)
| Benchmark | Evaluation Results (3B/7B) |
|---|---|
| RefCOCO (RES) | val: 82.1/82.6βtestA: 83.7/85.1βtestB: 79.2/80.2 |
| RefCOCO+ (RES) | val: 76.9/78.2βtestA: 81.1/81.9βtestB: 71.8/72.3 |
| RefCOCOg (RES) | val(U): 78.8/80.1βtest(U): 79.2/80.1 |
| ReasonSeg | val: 62.0/63.3βtest: 64.3/65.5 |
| GCG | val: 67.0/67.8βtest: 65.9/67.9 |
π¬ Video-level Segmentation
(Referring and Reasoning Video Object Segmentation)
| Benchmark | Evaluation Results (3B/7B) |
|---|---|
| ReVOS (overall) | J: 60.0/62.3βF: 65.2/67.8βJ&F: 62.6/65.1 |
| ReasonVOS | J: 58.2/59.2βF: 65.3/66.1βJ&F: 61.7/62.7 |
| MeViS (val_u) | J: 58.1/58.7βF: 65.9/66.5βJ&F: 62.0/62.6 |
| MeViS (val) | J: 51.7/53.9βF: 58.0/60.2βJ&F: 54.8/57.1 |
| Ref-YouTube-VOS | J: 69.8/70.1βF: 74.3/74.5βJ&F: 72.0/72.3 |
| Ref-DAVIS17 | J: 72.2/73.2βF: 80.6/81.7βJ&F: 76.4/77.4 |
| Ref-SAV | J: 62.9/--βF: 65.2/--βJ&F: 64.0/-- |
β±οΈ Temporal Sentence Grounding
(Temporal Sentence Grounding)
| Benchmark | Evaluation Results (3B) |
|---|---|
| Charades-STA | R@0.3: 76.1βR@0.5: 58.2βR@0.7: 25.8βmIoU: 50.2 |
| ActivityNet-Grounding | R@0.3: 67.5βR@0.5: 44.7βR@0.7: 23.2βmIoU: 45.4 |
π€ Model Zoo
| Model Name | Base MLLM | Mask Decoder | Checkpoint |
|---|---|---|---|
| MomentSeg-3B | Qwen2.5-VL-3B-Instruct | SAM2-Hiera-Large | Checkpoint |
| MomentSeg-7B | Qwen2.5-VL-7B-Instruct | SAM2-Hiera-Large | Checkpoint |
π οΈ Installation
Create the environment and install the project dependencies:
conda create -n momentseg python=3.10 -y
conda activate momentseg
# Install the PyTorch build that matches your CUDA version first.
pip install -r requirements.txt
Prepare the base MLLM and mask decoder checkpoints under pretrained/:
pretrained/
βββ Qwen2.5-VL-3B-Instruct/
βββ Qwen2.5-VL-7B-Instruct/
βββ sam2_hiera_large.pt
For example:
huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct \
--local-dir pretrained/Qwen2.5-VL-3B-Instruct
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct \
--local-dir pretrained/Qwen2.5-VL-7B-Instruct
curl -L \
https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt \
-o pretrained/sam2_hiera_large.pt
π Data Preparation
The released data package is available at MomentSeg Data.
If you prepare the data manually, organize the original datasets and processed
annotations under data/ as follows:
data/
βββ video_datas/
β βββ revos/
β βββ mevis/
β β βββ train/
β βββ rvos/
β βββ ref_sav_eval/
β β βββ videos/
β β βββ meta_expressions_valid.json
β β βββ mask_dict.json
β βββ chat_univi/
β β βββ Activity_Videos/
β β βββ video_chat.json
β βββ sam_v_full/
β βββ sam_v_final_custom.json
βββ ref_seg/
β βββ refcoco/
β βββ refcoco+/
β βββ refcocog/
βββ reason_seg/
βββ glamm_data/
β βββ images/
β βββ annotations/
βββ llava_data/
β βββ llava_images/
β βββ LLaVA-Instruct-150K/
β βββ LLaVA-Pretrain/
βββ VTG/
βββ NumPro_FT/
βββ videos_1FPS/
βββ train.caption_coco_format.json
βββ activitynet_captions_train.json
The main training configs read paths relative to ./data/. Please keep the
folder names consistent with the layout above, or update the corresponding
paths in:
projects/qwenvl_sam2/configs/momentseg-3B.py
projects/qwenvl_sam2/configs/momentseg-7B.py
π Training
MomentSeg provides training configs for both model scales:
projects/qwenvl_sam2/configs/momentseg-3B.py
projects/qwenvl_sam2/configs/momentseg-7B.py
Use the root training launcher:
bash train.sh
train.sh reads configs from CONFIG_LIST and uses NUM_GPUS and PORT for
distributed launch. Edit CONFIG_LIST to select the model scale, or override
the launch settings from the command line:
NUM_GPUS=8 PORT=29500 bash train.sh
You can also launch a config manually:
bash tools/dist.sh train projects/qwenvl_sam2/configs/momentseg-3B.py 8
After training, convert the checkpoint to Hugging Face format if needed:
python projects/qwenvl_sam2/hf/convert_to_hf_qwenv2.py \
projects/qwenvl_sam2/configs/momentseg-3B.py \
--pth-model work_dirs/momentseg-3B/iter_xxx.pth \
--save-path work_dirs/momentseg-3B/hf_model
π¬ Demo Usage
Run inference on an image, a video, or a folder of frames with demo/demo.py.
The demo uses checkpoints/MomentSeg-3B by default and saves visualized results
to output/demo unless --work-dir is specified:
CUDA_VISIBLE_DEVICES=0 python demo/demo.py \
<IMAGE_OR_VIDEO_OR_FRAME_FOLDER> \
--model_path <MODEL_DIR> \
--work-dir output/demo \
--text "xxx"
<MODEL_DIR> can be a released MomentSeg checkpoint directory or a locally converted
Hugging Face model directory, such as:
checkpoints/MomentSeg-3B
checkpoints/MomentSeg-7B
π Evaluation
For full evaluation, update MODEL_PATHS, NUM_GPUS, and MASTER_PORT in
the provided script, then run:
bash test.sh
The main video segmentation evaluation entry point is:
torchrun --nproc_per_node=4 --master_port=29506 \
projects/qwenvl_sam2/evaluation/ref_vos_eval.py \
<MODEL_DIR> \
--dataset REVOS \
--launcher pytorch \
--work_dir <OUTPUT_DIR> \
--frame_num 8 \
--inference_mode multi-frame \
--video_max_frames 100
Metric scripts are provided under tools/eval/, including:
tools/eval/eval_revos.py
tools/eval/eval_mevis.py
tools/eval/eval_davis.py
tools/eval/eval_ref_sav.py
tools/eval/eval_reasonvos.py
tools/eval/eval_tvg.py
For example:
python tools/eval/eval_revos.py <OUTPUT_DIR>/REVOS.json \
--save_json_name revos_valid.json
π Acknowledgements
This project builds on the foundation of Sa2VA. We sincerely thank the Sa2VA team for releasing their codebase and model framework.
π Citation
Please cite our paper if you find this project helpful.
@misc{momentseg,
title={MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding},
author={Ming Dai and Sen Yang and Boqiang Duan and Wankou Yang and Jingdong Wang},
year={2025},
eprint={2510.09274},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.09274},
}