[CVPR 2026 ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰] Towards Streaming Referring Video Segmentation via Large Language Model

June 14, 2026 ยท View on GitHub

[๐Ÿ“‚ GitHub] [๐Ÿ“„ Paper] [๐Ÿš€ Quick Start]

Network Architecture

Comparison between previous methods and our streaming framework

StreamingRVOS Performance

Model NameRefCOCORefCOCO+RefCOCOgDAVISRef-YTBVOSMeVISREVOS(Referring)
StreamingRVOS-1B80.375.077.576.469.149.759.7
StreamingRVOS-4B82.577.979.976.670.550.963.0

Performance comparison on RVOS benchmarks

Dataset Structure

data/
โ”œโ”€โ”€ video_datas
โ”‚   โ”œโ”€โ”€ revos
โ”‚   โ”œโ”€โ”€ mevis
โ”‚   โ”œโ”€โ”€ davis17
โ”‚   โ”œโ”€โ”€ sam_v_full          # please download from the SAM-2 official repo
โ”‚   โ””โ”€โ”€ sam_v_final_v3.json
โ”œโ”€โ”€ ref_seg
โ”‚   โ”œโ”€โ”€ refclef
โ”‚   โ”œโ”€โ”€ refcoco
โ”‚   โ”œโ”€โ”€ refcoco+
โ”‚   โ””โ”€โ”€ refcocog
โ”œโ”€โ”€ glamm_data
โ”‚   โ”œโ”€โ”€ images
โ”‚   โ””โ”€โ”€ annotations
โ”œโ”€โ”€ osprey-724k
โ”‚   โ”œโ”€โ”€ Osprey-724K
โ”‚   โ””โ”€โ”€ coco
โ””โ”€โ”€ llava_data
    โ”œโ”€โ”€ llava_images
    โ”œโ”€โ”€ LLaVA-Instruct-150K
    โ””โ”€โ”€ LLaVA-Pretrain

Pretrained Models

pretrained/
โ”œโ”€โ”€ sam2_hiera_large.pt
โ”œโ”€โ”€ InternVL2_5-1B
โ””โ”€โ”€ InternVL2_5-4B

Quick Start

Environment

conda create -n strvos python==3.10
conda activate strvos
pip install -r requirements.txt

Train

# stage1
CUDA_VISIBLE_DEVICES=0,1,2,3 bash tools/dist.sh train projects/streamingrvos/configs/1b_stage1.py 4

# stage2 (edit stage1 ckpt path in config first)
CUDA_VISIBLE_DEVICES=0,1,2,3 bash tools/dist.sh train projects/streamingrvos/configs/1b_stage2.py 4

Convert to HF

PYTHONPATH=. python projects/streamingrvos/hf/convert_to_hf.py config_path pth_path --save-path save_path

Test

Supported datasets: DAVIS MEVIS_U MEVIS REVOS REFYTVOS refcoco refcoco_plus refcocog

CUDA_VISIBLE_DEVICES=0,1,2,3 PYTHONPATH=. python -m torch.distributed.launch --master_port=29500 --nproc_per_node=4 \
    projects/streamingrvos/evaluation/ref_vos_eval.py --launcher pytorch --model_path your_model_path --dataset DAVIS

CUDA_VISIBLE_DEVICES=0,1,2,3 PYTHONPATH=. python -m torch.distributed.launch --master_port=29500 --nproc_per_node=4 \
    projects/streamingrvos/evaluation/ref_vos_eval.py --launcher pytorch --model_path your_model_path --dataset REFYTVOS --submit

PYTHONPATH=. CUDA_VISIBLE_DEVICES=2 python projects/streamingrvos/evaluation/refcoco_eval.py --model_path your_model_path --dataset refcocog

Eval

python tools/eval/eval_mevis.py  # edit dataset path and predicted path inside the script

Acknowledgement

Our work is built upon Sa2VA. We thank the authors for their excellent open-source contribution.

Citation

@InProceedings{Zhang_2026_CVPR,
    author    = {Zhang, Wenkang and Yang, Kaicheng and An, Xiang and Li, Qiang and Feng, Ziyong and Yang, Wankou and Deng, Jiankang},
    title     = {Towards Streaming Referring Video Segmentation via Large Language Model},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {24598-24607}
}