CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
April 17, 2026 · View on GitHub
IEEE International Conference on Image Processing (ICIP) 2025
Oral Paper ⭐
InSeok Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee
Minseok Kang, Jungho Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee
Yonsei University
Teaser
CMTM introduces cross-modal token modulation to effectively model interactions between appearance and motion cues for unsupervised video object segmentation.
Overview
Recent advances in unsupervised video object segmentation have highlighted the effectiveness of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interactions.
In this paper, we propose Cross-Modal Token Modulation (CMTM), a novel framework designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To further improve learning efficiency, we introduce a token masking strategy that encourages more effective representation learning beyond simply increasing model complexity.
CMTM achieves state-of-the-art performance across all public benchmarks, outperforming previous methods.
Method
Our framework consists of:
- Appearance encoder
- Motion encoder
- Cross-modal token modulation module
- Transformer-based relation modeling
- Decoder for segmentation prediction
The proposed token modulation mechanism enables effective interaction between appearance and motion representations.
Qualitative Results
Video Results
Qualitative segmentation results of CMTM across challenging scenarios including fast motion, occlusion, and complex scene dynamics.
Installation
Environment
- Python 3.8+
- PyTorch
- torchvision
- numpy
- opencv-python
Install dependencies:
pip install -r requirements.txt
Dataset Preparation
Download the following datasets
from the official websites.
For convenience, you may download the pre-processed data from the following repository: TransFlow
Organize the datasets as follows:
dataset/
DUTS/
DAVIS/
FBMS/
YTOBJ/
LVID/
Please modify dataset paths in the configuration file if necessary.
Training
To train the model:
python run.py --train
Evaluation
To evaluate the model:
python run.py --test
Citation
If you find this work useful for your research, please consider citing our paper.
@inproceedings{jeon2025cmtm,
title={CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation},
author={Jeon, Inseok and Cho, Suhwan and Lee, Minhyeok and Lee, Seunghoon and Kang, Minseok and Lee, Jungho and Park, Chaewon and Kim, Donghyeong and Lee, Sangyoun},
booktitle={2025 IEEE International Conference on Image Processing (ICIP)},
pages={1390--1395},
year={2025},
organization={IEEE}
}
License
This project is released under the MIT License.
Acknowledgements
This repository builds upon prior research in unsupervised video object segmentation. We thank the research community for their valuable open-source contributions. We also thank the authors of TransFlow for providing their publicly available implementation, which served as a strong baseline for this work.
Contact
If you have any questions about the code or the paper, please feel free to contact:
InSeok Jeon
Email: sunlight3919@gmail.com