README.md

June 19, 2026 Β· View on GitHub

MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding

Ming Dai1, Sen Yang2, Boqiang Duan2, Wankou Yang1, Jingdong Wang2

1Southeast University; 2Baidu VIS


Demo Animation

MomentSeg is a unified MLLM for pixel-level vision–language understanding, designed with a moment-centric sampling strategy to better capture fine-grained semantics in video. It flexibly supports a range of tasks, including referring and reasoning segmentation for images and videos, temporal sentence grounding, and image/video question answering.

πŸ”₯ News

πŸ•’ Release Status

  • Paper and Video Demo
  • Model Checkpoints and Inference Instructions
  • Training Code and Detailed Documentation
  • Data Release

πŸŽ₯ Demo

Demo 1 Input video (source: Internet):

Demo 1

Prompt: "Please segment the monkey that is scratching its ear."

Demo 2 Input video (source: Internet):

Demo 2

Prompt: "Please segment the person standing in the center wearing blue clothes."

πŸ† Performance

πŸ–ΌοΈ Image-level Segmentation

(Referring Image Segmentation & Reasoning Segmentation)

BenchmarkEvaluation Results (3B/7B)
RefCOCO (RES)val: 82.1/82.6 testA: 83.7/85.1 testB: 79.2/80.2
RefCOCO+ (RES)val: 76.9/78.2 testA: 81.1/81.9 testB: 71.8/72.3
RefCOCOg (RES)val(U): 78.8/80.1 test(U): 79.2/80.1
ReasonSegval: 62.0/63.3 test: 64.3/65.5
GCGval: 67.0/67.8 test: 65.9/67.9
🎬 Video-level Segmentation

(Referring and Reasoning Video Object Segmentation)

BenchmarkEvaluation Results (3B/7B)
ReVOS (overall)J: 60.0/62.3 F: 65.2/67.8 J&F: 62.6/65.1
ReasonVOSJ: 58.2/59.2 F: 65.3/66.1 J&F: 61.7/62.7
MeViS (val_u)J: 58.1/58.7 F: 65.9/66.5 J&F: 62.0/62.6
MeViS (val)J: 51.7/53.9 F: 58.0/60.2 J&F: 54.8/57.1
Ref-YouTube-VOSJ: 69.8/70.1 F: 74.3/74.5 J&F: 72.0/72.3
Ref-DAVIS17J: 72.2/73.2 F: 80.6/81.7 J&F: 76.4/77.4
Ref-SAVJ: 62.9/-- F: 65.2/-- J&F: 64.0/--
⏱️ Temporal Sentence Grounding

(Temporal Sentence Grounding)

BenchmarkEvaluation Results (3B)
Charades-STAR@0.3: 76.1 R@0.5: 58.2 R@0.7: 25.8 mIoU: 50.2
ActivityNet-GroundingR@0.3: 67.5 R@0.5: 44.7 R@0.7: 23.2 mIoU: 45.4

πŸ€– Model Zoo

Model NameBase MLLMMask DecoderCheckpoint
MomentSeg-3BQwen2.5-VL-3B-InstructSAM2-Hiera-LargeCheckpoint
MomentSeg-7BQwen2.5-VL-7B-InstructSAM2-Hiera-LargeCheckpoint

πŸ› οΈ Installation

Create the environment and install the project dependencies:

conda create -n momentseg python=3.10 -y
conda activate momentseg

# Install the PyTorch build that matches your CUDA version first.
pip install -r requirements.txt

Prepare the base MLLM and mask decoder checkpoints under pretrained/:

pretrained/
β”œβ”€β”€ Qwen2.5-VL-3B-Instruct/
β”œβ”€β”€ Qwen2.5-VL-7B-Instruct/
└── sam2_hiera_large.pt

For example:

huggingface-cli download Qwen/Qwen2.5-VL-3B-Instruct \
  --local-dir pretrained/Qwen2.5-VL-3B-Instruct

huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct \
  --local-dir pretrained/Qwen2.5-VL-7B-Instruct

curl -L \
  https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt \
  -o pretrained/sam2_hiera_large.pt

πŸ“š Data Preparation

The released data package is available at MomentSeg Data. If you prepare the data manually, organize the original datasets and processed annotations under data/ as follows:

data/
β”œβ”€β”€ video_datas/
β”‚   β”œβ”€β”€ revos/
β”‚   β”œβ”€β”€ mevis/
β”‚   β”‚   └── train/
β”‚   β”œβ”€β”€ rvos/
β”‚   β”œβ”€β”€ ref_sav_eval/
β”‚   β”‚   β”œβ”€β”€ videos/
β”‚   β”‚   β”œβ”€β”€ meta_expressions_valid.json
β”‚   β”‚   └── mask_dict.json
β”‚   β”œβ”€β”€ chat_univi/
β”‚   β”‚   β”œβ”€β”€ Activity_Videos/
β”‚   β”‚   └── video_chat.json
β”‚   β”œβ”€β”€ sam_v_full/
β”‚   └── sam_v_final_custom.json
β”œβ”€β”€ ref_seg/
β”‚   β”œβ”€β”€ refcoco/
β”‚   β”œβ”€β”€ refcoco+/
β”‚   └── refcocog/
β”œβ”€β”€ reason_seg/
β”œβ”€β”€ glamm_data/
β”‚   β”œβ”€β”€ images/
β”‚   └── annotations/
β”œβ”€β”€ llava_data/
β”‚   β”œβ”€β”€ llava_images/
β”‚   β”œβ”€β”€ LLaVA-Instruct-150K/
β”‚   └── LLaVA-Pretrain/
└── VTG/
    └── NumPro_FT/
        β”œβ”€β”€ videos_1FPS/
        β”œβ”€β”€ train.caption_coco_format.json
        └── activitynet_captions_train.json

The main training configs read paths relative to ./data/. Please keep the folder names consistent with the layout above, or update the corresponding paths in:

projects/qwenvl_sam2/configs/momentseg-3B.py
projects/qwenvl_sam2/configs/momentseg-7B.py

πŸš€ Training

MomentSeg provides training configs for both model scales:

projects/qwenvl_sam2/configs/momentseg-3B.py
projects/qwenvl_sam2/configs/momentseg-7B.py

Use the root training launcher:

bash train.sh

train.sh reads configs from CONFIG_LIST and uses NUM_GPUS and PORT for distributed launch. Edit CONFIG_LIST to select the model scale, or override the launch settings from the command line:

NUM_GPUS=8 PORT=29500 bash train.sh

You can also launch a config manually:

bash tools/dist.sh train projects/qwenvl_sam2/configs/momentseg-3B.py 8

After training, convert the checkpoint to Hugging Face format if needed:

python projects/qwenvl_sam2/hf/convert_to_hf_qwenv2.py \
  projects/qwenvl_sam2/configs/momentseg-3B.py \
  --pth-model work_dirs/momentseg-3B/iter_xxx.pth \
  --save-path work_dirs/momentseg-3B/hf_model

🎬 Demo Usage

Run inference on an image, a video, or a folder of frames with demo/demo.py. The demo uses checkpoints/MomentSeg-3B by default and saves visualized results to output/demo unless --work-dir is specified:

CUDA_VISIBLE_DEVICES=0 python demo/demo.py \
  <IMAGE_OR_VIDEO_OR_FRAME_FOLDER> \
  --model_path <MODEL_DIR> \
  --work-dir output/demo \
  --text "xxx"

<MODEL_DIR> can be a released MomentSeg checkpoint directory or a locally converted Hugging Face model directory, such as:

checkpoints/MomentSeg-3B
checkpoints/MomentSeg-7B

πŸ“Š Evaluation

For full evaluation, update MODEL_PATHS, NUM_GPUS, and MASTER_PORT in the provided script, then run:

bash test.sh

The main video segmentation evaluation entry point is:

torchrun --nproc_per_node=4 --master_port=29506 \
  projects/qwenvl_sam2/evaluation/ref_vos_eval.py \
  <MODEL_DIR> \
  --dataset REVOS \
  --launcher pytorch \
  --work_dir <OUTPUT_DIR> \
  --frame_num 8 \
  --inference_mode multi-frame \
  --video_max_frames 100

Metric scripts are provided under tools/eval/, including:

tools/eval/eval_revos.py
tools/eval/eval_mevis.py
tools/eval/eval_davis.py
tools/eval/eval_ref_sav.py
tools/eval/eval_reasonvos.py
tools/eval/eval_tvg.py

For example:

python tools/eval/eval_revos.py <OUTPUT_DIR>/REVOS.json \
  --save_json_name revos_valid.json

πŸ™ Acknowledgements

This project builds on the foundation of Sa2VA. We sincerely thank the Sa2VA team for releasing their codebase and model framework.

πŸ“– Citation

Please cite our paper if you find this project helpful.

@misc{momentseg,
      title={MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding}, 
      author={Ming Dai and Sen Yang and Boqiang Duan and Wankou Yang and Jingdong Wang},
      year={2025},
      eprint={2510.09274},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.09274}, 
}