README.md
April 21, 2026 · View on GitHub
Weakly-Supervised Referring Video Object Segmentation through Text Supervision
Official implementation of the paper accepted by CVPR 2026 Findings !
Overview
WSRVOS studies weakly-supervised referring video object segmentation with only text supervision. The implementation in this repository follows the main components described in the paper:
- contrastive referring expression augmentation
- bi-directional vision-language feature selection
- bi-directional vision-language feature interaction
- instance-aware expression classification
- positive-prediction fusion
- temporal segment ranking
Environment Setup
Recommended Python: 3.10+
pip install -r requirements.txt
Notes:
- RoBERTa weights are loaded from
pretrained/pretrained_roberta. - Video Swin weights are loaded from
pretrained/pretrained_swin_transformer.
Dataset Preparation
Default config paths assume the repository is placed next to RVOS_datasets/:
../RVOS_datasets/a2d_sentences
../RVOS_datasets/jhmdb_sentences
../RVOS_datasets/refer_youtube_vos
Training / Evaluation
Train on A2D-Sentences:
bash scripts/train_a2d.sh
Evaluate on A2D-Sentences:
bash scripts/eval_a2d.sh --checkpoint ./outputs/a2d/best.pth
Train on Ref-YouTube-VOS:
bash scripts/train_refytb.sh