VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation
June 5, 2026 · View on GitHub
VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation
Temporal-spatial reasoning · Multi-turn keyframe exploration · SEG-aware reinforcement learning
Ming Dai1, Sen Yang2, Boqiang Duan2, Boyuan Tong2, Jiedong Zhuang3, Wankou Yang1, Jingdong Wang2
1Southeast University 2Baidu Inc. 3Zhejiang University
📖 Introduction
VideoSEG-O3 is a multi-turn reinforcement learning framework for Reasoning Video Object Segmentation (RVOS). Instead of segmenting from a fixed set of sampled frames, VideoSEG-O3 actively explores temporal intervals and keyframes through a temporal-spatial chain-of-thought, enabling coarse-to-fine reasoning over object identity, motion, and linguistic references. The framework further introduces SEG-aware logit calibration to connect token-level policy optimization with pixel-level mask quality, and uses a decoupled thinking trace to structure temporal, spatial, and language reasoning. The resulting pipeline combines SFT, VTS-CoT cold start, and GRPO-based reinforcement learning for multi-turn video segmentation.
✨ Highlights
- Multi-turn temporal-spatial reasoning: VideoSEG-O3 actively explores temporal intervals and keyframes instead of relying on a fixed frame set.
- Decoupled thinking trace: temporal localization, spatial grounding, and language reasoning are structured into an explicit multi-turn workflow.
- SEG-aware reinforcement learning: token-level policy optimization is aligned with pixel-level mask quality through dense segmentation rewards.
📢 News
2026.06.05🔥 The training code, evaluation code, and paper are now available. We have also released the RL checkpoints for VideoSEG-O3-2B and VideoSEG-O3-4B, together with the VTS-CoT and RL training data.2026.04.30🎉 VideoSEG-O3 has been accepted by ICML 2026.
🏆 Main Results
Referring Video Object Segmentation (J&F). VideoSEG-O3 achieves strong performance across five RefVOS benchmarks.
| Model | MeViS | Ref-Youtube-VOS | Ref-DAVIS17 | Ref-SAV | Long-RVOS |
|---|---|---|---|---|---|
| VideoSEG-O3-2B | 55.6 | 70.5 | 80.0 | 62.9 | 54.8 |
| VideoSEG-O3-4B | 60.0 | 74.1 | 79.4 | 65.5 | 57.4 |
Reasoning Video Object Segmentation. On ReVOS, ReasonVOS, and GroundMoRe, VideoSEG-O3 shows advanced in-domain and zero-shot reasoning performance.
| Model | ReVOS Referring J&F | ReVOS Reasoning J&F | ReVOS Overall J&F | ReasonVOS | GroundMoRe |
|---|---|---|---|---|---|
| VideoSEG-O3-2B | 67.5 | 62.0 | 64.8 | 60.2 | 29.1 |
| VideoSEG-O3-4B | 70.3 | 65.1 | 67.7 | 62.9 | 31.9 |
🤖 Model Zoo
| Model | Base MLLM | Mask Decoder | SFT | Cold-start | RL |
|---|---|---|---|---|---|
| VideoSEG-O3-2B | Qwen3-VL-2B-Instruct | SAM2-Hiera-Large | Coming soon | Coming soon | Checkpoint |
| VideoSEG-O3-4B | Qwen3-VL-4B-Instruct | SAM2-Hiera-Large | Coming soon | Coming soon | Checkpoint |
🛠️ Installation
Please follow Setup to prepare the environment and pretrained models.
📚 Data Preparation
Please follow Data Preparation to download the released
VTS-CoT/RL annotations and organize the required original datasets under
data/.
🚀 Training
VideoSEG-O3 uses SFT, CoT cold-start, and RL stages. Please see Training for the available configs and launch scripts.
📊 Evaluation
Please see Evaluation for benchmark evaluation commands and dataset-specific metric scripts.
🙌 Acknowledgements
This project is based on Sa2VA. We also thank Open-R1 and TRL for their open-source reinforcement learning training frameworks.
📮 Contact
If you have any questions, please feel free to open an issue or contact us at mingdai@seu.edu.cn. If this project is helpful to your research, we would appreciate a 🌟.
📄 Citation
If you find this project useful, please consider citing:
@inproceedings{dai2026videosego3,
title = {VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation},
author = {Dai, Ming and Yang, Sen and Duan, Boqiang and Tong, Boyuan and Zhuang, Jiedong and Yang, Wankou and Wang, Jingdong},
booktitle = {ICML},
year = {2026}
}