README.md
March 7, 2026 Β· View on GitHub
Deforming Videos to Masks: Flow Matching for Referring Video Segmentation
π’ News
[2026.02.13] π¨π¨ We updated interesting demos.
[2026.02.05] π₯π₯ We updated training code.
[2026.01.26] ππFlowRVS was accepted by ICLR 2026!
[2025.12.01] π₯π₯ We updated model weight and inference code.
πββοΈ Overview
β¨ Key Features:
- FlowRVS reformulates RVOS as learning a continuous, text-conditioned flow that deforms a videoβs spatio-temporal representation into its target mask.
- FlowRVS successfully transfer the powerful text-to-video generative model to this RVOS task by proposing a suite of principled techniques.
- FlowRVS achieves new state of the art (SOTA) results on key benchmarks
π¬ Demos
We provide weights trained exclusively on the challenging MeViS dataset. Despite not seeing these domains during training, FlowRVS demonstrates remarkable zero-shot generalization across movies, sports, and internet memes. Have fun exploring!
π¦Ύ Ultraman
|
πβΉοΈββοΈ Basketball
|
βοΈ Better Call Saul
|
π± Cat Memes Segmentation
- Robustness against severe occlusions (shelf, paper roll, sausage) and significant non-rigid body deformation. The model tracks the target continuously even when partially hidden or undergoing extreme pose changes.
π οΈ Environment Setup
1. Create a conda environment
git clone https://github.com/xmz111/FlowRVS.git && cd FlowRVS
conda create -n flowrvs python=3.10 -y
conda activate flowrvs
2. Install dependencies
pip install -r requirements.txt
3. Prepare Wan2.1 T2V model, we need config to construct models and T5 Encoder.
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B-Diffusers --local-dir ./Wan2.1-T2V-1.3B-Diffusers
π» Inference
Inference on MeViS val and val_u splits.
1. Prepare data
The dataset can be found in: https://github.com/henghuiding/MeViS
After you successfully download the dataset, the file structure of the dataset should be like this:
- datasets
- MeViS/
- valid/
- JPEGImages/
- meta_expressions.json
- valid_u/
- JPEGImages/
- mask_dict.json
- meta_expressions.json
- valid/
- MeViS/
pip install gdown
gdown https://drive.google.com/drive/folders/1MACaQ-O8seyMj-MBlycxRgCT08RVBZJp --folder -O dataset/MeViS/
2. Download DiT and tuned VAE checkpoints from https://huggingface.co/xmz111/FlowRVS and place them as mevis_dit.pth and tuned_vae.pth;
3. Inference
Just run:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 inference_mevis.py --dit_ckpt=FlowRVS_dit_mevis.pth --vae_ckpt=tuned_vae.pth --output_dir=result --split=valid_u
Note that this code will cost about 33G GPU memory with default setting.
Inference on any videos.
python inference_demo.py --input_path=video.mp4 --text_prompts "prompt_1" "prompt_2" --fps=12 --save_fig --output_dir=result --dit_ckpt=FlowRVS_dit_mevis.pth --vae_ckpt=tuned_vae.pth
π₯ Training
Use --dataset_file to select training dataset (mevis, pretrain, ytvos), and use --resume to load checkpoint.
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --dataset_file=mevis --num_frames=17 --lr=5e-5 --output_dir=mevis_training
π Acknowledgement
We referenced the following works, and appreciate their contributions to the community.
π BibTeX
If you find our FlowRVS useful for your research and applications, please kindly cite us:
@article{wang2025flowrvs,
title={Deforming Videos to Masks: Flow Matching for Referring Video Segmentation},
author={Wang, Zanyi and Jiang, Dengyang and Li, Liuzhuozheng and Dang, Sizhe and Li, Chengzu and Yang, Harry and Dai, Guang and Wang, Mengmeng and Wang, Jingdong},
journal={arXiv preprint arXiv:2510.06139},
year={2025}
}