README.md

March 7, 2026 · View on GitHub

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

📢 News

[2026.02.13] 🎨🎨 We updated interesting demos.

[2026.02.05] 🔥🔥 We updated training code.

[2026.01.26] 🎉🎉FlowRVS was accepted by ICLR 2026!

[2025.12.01] 🔥🔥 We updated model weight and inference code.

FlowRVS replaces the cascaded ‘locate-then-segment’ paradigm (A) with a unified, end-to-end flow (B). This new paradigm avoids information bottlenecks, enabling superior handling of complex language and dynamic video (C) and achieving state-of-the-art performance (D).

✨ Key Features:

FlowRVS reformulates RVOS as learning a continuous, text-conditioned flow that deforms a video’s spatio-temporal representation into its target mask.
FlowRVS successfully transfer the powerful text-to-video generative model to this RVOS task by proposing a suite of principled techniques.
FlowRVS achieves new state of the art (SOTA) results on key benchmarks

Result

🎬 Demos

We provide weights trained exclusively on the challenging MeViS dataset. Despite not seeing these domains during training, FlowRVS demonstrates remarkable zero-shot generalization across movies, sports, and internet memes. Have fun exploring!

🦾 Ultraman

FPS: 12
Prompt: "the Ultraman", "the devil cat"
Note: Handles complex dynamic interactions (combat) and severe environmental interference (heavy smoke/fog). Observe the fine-grained boundary adherence on the cat's fur and the Ultraman's silhouette despite the chaos.

🏀⛹️‍♂️ Basketball

FPS: 12
Prompt: "the man wearing colorful shoes shoots the ball", "the man who is defending", "basketball"
Note: Successfully tracks small, fast-moving objects (the basketball) and articulates complex human motion. It distinguishes the shooter from the defender even during rapid crossover movements.

⚖️ Better Call Saul

FPS: 8
Prompt: "angry man in the suit shouting at another man"
Note: Demonstrates robust long-term temporal consistency. The model maintains identity and accurate segmentation over extended sequences, resisting drift even as the camera zooms and subjects interact.

🐱 Cat Memes Segmentation

Robustness against severe occlusions (shelf, paper roll, sausage) and significant non-rigid body deformation. The model tracks the target continuously even when partially hidden or undergoing extreme pose changes.

🛠️ Environment Setup

1. Create a conda environment

git clone https://github.com/xmz111/FlowRVS.git && cd FlowRVS
conda create -n flowrvs python=3.10 -y
conda activate flowrvs

2. Install dependencies

pip install -r requirements.txt

3. Prepare Wan2.1 T2V model, we need config to construct models and T5 Encoder.

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B-Diffusers --local-dir ./Wan2.1-T2V-1.3B-Diffusers

datasets
- MeViS/
  - valid/
    - JPEGImages/
    - meta_expressions.json
  - valid_u/
    - JPEGImages/
    - mask_dict.json
    - meta_expressions.json

pip install gdown
gdown https://drive.google.com/drive/folders/1MACaQ-O8seyMj-MBlycxRgCT08RVBZJp --folder -O dataset/MeViS/

2. Download DiT and tuned VAE checkpoints from https://huggingface.co/xmz111/FlowRVS and place them as mevis_dit.pth and tuned_vae.pth;

3. Inference

Just run:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 inference_mevis.py --dit_ckpt=FlowRVS_dit_mevis.pth --vae_ckpt=tuned_vae.pth --output_dir=result --split=valid_u

Note that this code will cost about 33G GPU memory with default setting.

Inference on any videos.

python inference_demo.py --input_path=video.mp4  --text_prompts "prompt_1" "prompt_2"    --fps=12 --save_fig --output_dir=result  --dit_ckpt=FlowRVS_dit_mevis.pth  --vae_ckpt=tuned_vae.pth

🥂 Training

Use --dataset_file to select training dataset (mevis, pretrain, ytvos), and use --resume to load checkpoint.

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2  main.py  --dataset_file=mevis --num_frames=17 --lr=5e-5 --output_dir=mevis_training

💚 Acknowledgement

We referenced the following works, and appreciate their contributions to the community.

Wan2.1
MeViS

🔗 BibTeX

If you find our FlowRVS useful for your research and applications, please kindly cite us:

@article{wang2025flowrvs,
  title={Deforming Videos to Masks: Flow Matching for Referring Video Segmentation},
  author={Wang, Zanyi and Jiang, Dengyang and Li, Liuzhuozheng and Dang, Sizhe and Li, Chengzu and Yang, Harry and Dai, Guang and Wang, Mengmeng and Wang, Jingdong},
  journal={arXiv preprint arXiv:2510.06139}, 
  year={2025}
}