VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

June 5, 2026 · View on GitHub

VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

Temporal-spatial reasoning · Multi-turn keyframe exploration · SEG-aware reinforcement learning

Ming Dai¹, Sen Yang², Boqiang Duan², Boyuan Tong², Jiedong Zhuang³, Wankou Yang¹, Jingdong Wang²

¹Southeast University ²Baidu Inc. ³Zhejiang University

📖 Introduction

VideoSEG-O3 is a multi-turn reinforcement learning framework for Reasoning Video Object Segmentation (RVOS). Instead of segmenting from a fixed set of sampled frames, VideoSEG-O3 actively explores temporal intervals and keyframes through a temporal-spatial chain-of-thought, enabling coarse-to-fine reasoning over object identity, motion, and linguistic references. The framework further introduces SEG-aware logit calibration to connect token-level policy optimization with pixel-level mask quality, and uses a decoupled thinking trace to structure temporal, spatial, and language reasoning. The resulting pipeline combines SFT, VTS-CoT cold start, and GRPO-based reinforcement learning for multi-turn video segmentation.

✨ Highlights

Multi-turn temporal-spatial reasoning: VideoSEG-O3 actively explores temporal intervals and keyframes instead of relying on a fixed frame set.
Decoupled thinking trace: temporal localization, spatial grounding, and language reasoning are structured into an explicit multi-turn workflow.
SEG-aware reinforcement learning: token-level policy optimization is aligned with pixel-level mask quality through dense segmentation rewards.

📢 News

2026.06.05 🔥 The training code, evaluation code, and paper are now available. We have also released the RL checkpoints for VideoSEG-O3-2B and VideoSEG-O3-4B, together with the VTS-CoT and RL training data.
2026.04.30 🎉 VideoSEG-O3 has been accepted by ICML 2026.

🏆 Main Results

Referring Video Object Segmentation (J&F). VideoSEG-O3 achieves strong performance across five RefVOS benchmarks.

Model	MeViS	Ref-Youtube-VOS	Ref-DAVIS17	Ref-SAV	Long-RVOS
VideoSEG-O3-2B	55.6	70.5	80.0	62.9	54.8
VideoSEG-O3-4B	60.0	74.1	79.4	65.5	57.4

Reasoning Video Object Segmentation. On ReVOS, ReasonVOS, and GroundMoRe, VideoSEG-O3 shows advanced in-domain and zero-shot reasoning performance.

Model	ReVOS Referring J&F	ReVOS Reasoning J&F	ReVOS Overall J&F	ReasonVOS	GroundMoRe
VideoSEG-O3-2B	67.5	62.0	64.8	60.2	29.1
VideoSEG-O3-4B	70.3	65.1	67.7	62.9	31.9

🤖 Model Zoo

Model	Base MLLM	Mask Decoder	SFT	Cold-start	RL
VideoSEG-O3-2B	Qwen3-VL-2B-Instruct	SAM2-Hiera-Large	Coming soon	Coming soon	Checkpoint
VideoSEG-O3-4B	Qwen3-VL-4B-Instruct	SAM2-Hiera-Large	Coming soon	Coming soon	Checkpoint

@inproceedings{dai2026videosego3,
  title     = {VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation},
  author    = {Dai, Ming and Yang, Sen and Duan, Boqiang and Tong, Boyuan and Zhuang, Jiedong and Yang, Wankou and Wang, Jingdong},
  booktitle = {ICML},
  year      = {2026}
}

VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

📖 Introduction

✨ Highlights

📢 News

🏆 Main Results

🤖 Model Zoo

🛠️ Installation

📚 Data Preparation

🚀 Training

📊 Evaluation

🙌 Acknowledgements

📮 Contact

📄 Citation