VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

June 5, 2026 · View on GitHub

VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

Temporal-spatial reasoning · Multi-turn keyframe exploration · SEG-aware reinforcement learning

ICML 2026 Paper ModelScope Data

Ming Dai1, Sen Yang2, Boqiang Duan2, Boyuan Tong2, Jiedong Zhuang3, Wankou Yang1, Jingdong Wang2

1Southeast University    2Baidu Inc.    3Zhejiang University

VideoSEG-O3 model pipeline

📖 Introduction

VideoSEG-O3 is a multi-turn reinforcement learning framework for Reasoning Video Object Segmentation (RVOS). Instead of segmenting from a fixed set of sampled frames, VideoSEG-O3 actively explores temporal intervals and keyframes through a temporal-spatial chain-of-thought, enabling coarse-to-fine reasoning over object identity, motion, and linguistic references. The framework further introduces SEG-aware logit calibration to connect token-level policy optimization with pixel-level mask quality, and uses a decoupled thinking trace to structure temporal, spatial, and language reasoning. The resulting pipeline combines SFT, VTS-CoT cold start, and GRPO-based reinforcement learning for multi-turn video segmentation.

✨ Highlights

  • Multi-turn temporal-spatial reasoning: VideoSEG-O3 actively explores temporal intervals and keyframes instead of relying on a fixed frame set.
  • Decoupled thinking trace: temporal localization, spatial grounding, and language reasoning are structured into an explicit multi-turn workflow.
  • SEG-aware reinforcement learning: token-level policy optimization is aligned with pixel-level mask quality through dense segmentation rewards.

📢 News

🏆 Main Results

Referring Video Object Segmentation (J&F). VideoSEG-O3 achieves strong performance across five RefVOS benchmarks.

ModelMeViSRef-Youtube-VOSRef-DAVIS17Ref-SAVLong-RVOS
VideoSEG-O3-2B55.670.580.062.954.8
VideoSEG-O3-4B60.074.179.465.557.4

Reasoning Video Object Segmentation. On ReVOS, ReasonVOS, and GroundMoRe, VideoSEG-O3 shows advanced in-domain and zero-shot reasoning performance.

ModelReVOS Referring J&FReVOS Reasoning J&FReVOS Overall J&FReasonVOSGroundMoRe
VideoSEG-O3-2B67.562.064.860.229.1
VideoSEG-O3-4B70.365.167.762.931.9

🤖 Model Zoo

ModelBase MLLMMask DecoderSFTCold-startRL
VideoSEG-O3-2BQwen3-VL-2B-InstructSAM2-Hiera-LargeComing soonComing soonCheckpoint
VideoSEG-O3-4BQwen3-VL-4B-InstructSAM2-Hiera-LargeComing soonComing soonCheckpoint

🛠️ Installation

Please follow Setup to prepare the environment and pretrained models.

📚 Data Preparation

Please follow Data Preparation to download the released VTS-CoT/RL annotations and organize the required original datasets under data/.

🚀 Training

VideoSEG-O3 uses SFT, CoT cold-start, and RL stages. Please see Training for the available configs and launch scripts.

📊 Evaluation

Please see Evaluation for benchmark evaluation commands and dataset-specific metric scripts.

🙌 Acknowledgements

This project is based on Sa2VA. We also thank Open-R1 and TRL for their open-source reinforcement learning training frameworks.

📮 Contact

If you have any questions, please feel free to open an issue or contact us at mingdai@seu.edu.cn. If this project is helpful to your research, we would appreciate a 🌟.

📄 Citation

If you find this project useful, please consider citing:

@inproceedings{dai2026videosego3,
  title     = {VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation},
  author    = {Dai, Ming and Yang, Sen and Duan, Boqiang and Tong, Boyuan and Zhuang, Jiedong and Yang, Wankou and Wang, Jingdong},
  booktitle = {ICML},
  year      = {2026}
}