README.md

April 21, 2026 · View on GitHub

Weakly-Supervised Referring Video Object Segmentation through Text Supervision

Official implementation of the paper accepted by CVPR 2026 Findings !

WSRVOS overview

Overview

WSRVOS studies weakly-supervised referring video object segmentation with only text supervision. The implementation in this repository follows the main components described in the paper:

  • contrastive referring expression augmentation
  • bi-directional vision-language feature selection
  • bi-directional vision-language feature interaction
  • instance-aware expression classification
  • positive-prediction fusion
  • temporal segment ranking

Environment Setup

Recommended Python: 3.10+

pip install -r requirements.txt

Notes:

  • RoBERTa weights are loaded from pretrained/pretrained_roberta .
  • Video Swin weights are loaded from pretrained/pretrained_swin_transformer .

Dataset Preparation

Default config paths assume the repository is placed next to RVOS_datasets/:

../RVOS_datasets/a2d_sentences
../RVOS_datasets/jhmdb_sentences
../RVOS_datasets/refer_youtube_vos

Training / Evaluation

Train on A2D-Sentences:

bash scripts/train_a2d.sh

Evaluate on A2D-Sentences:

bash scripts/eval_a2d.sh --checkpoint ./outputs/a2d/best.pth

Train on Ref-YouTube-VOS:

bash scripts/train_refytb.sh