CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

April 17, 2026 · View on GitHub

IEEE International Conference on Image Processing (ICIP) 2025
Oral Paper ⭐

InSeok Jeon, Suhwan Cho, Minhyeok Lee, Seunghoon Lee
Minseok Kang, Jungho Lee, Chaewon Park, Donghyeong Kim, Sangyoun Lee

Yonsei University

Paper


Teaser

CMTM introduces cross-modal token modulation to effectively model interactions between appearance and motion cues for unsupervised video object segmentation.


Overview

Recent advances in unsupervised video object segmentation have highlighted the effectiveness of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interactions.

In this paper, we propose Cross-Modal Token Modulation (CMTM), a novel framework designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra-modal and inter-modal information propagation through relation transformer blocks. To further improve learning efficiency, we introduce a token masking strategy that encourages more effective representation learning beyond simply increasing model complexity.

CMTM achieves state-of-the-art performance across all public benchmarks, outperforming previous methods.


Method

Our framework consists of:

  • Appearance encoder
  • Motion encoder
  • Cross-modal token modulation module
  • Transformer-based relation modeling
  • Decoder for segmentation prediction

The proposed token modulation mechanism enables effective interaction between appearance and motion representations.


Qualitative Results


Video Results

Qualitative segmentation results of CMTM across challenging scenarios including fast motion, occlusion, and complex scene dynamics.


Installation

Environment

  • Python 3.8+
  • PyTorch
  • torchvision
  • numpy
  • opencv-python

Install dependencies:

pip install -r requirements.txt

Dataset Preparation

Download the following datasets

from the official websites.

For convenience, you may download the pre-processed data from the following repository: TransFlow

Organize the datasets as follows:

dataset/
    DUTS/
    DAVIS/
    FBMS/
    YTOBJ/
    LVID/

Please modify dataset paths in the configuration file if necessary.


Training

To train the model:

python run.py --train

Evaluation

To evaluate the model:

python run.py --test

Citation

If you find this work useful for your research, please consider citing our paper.

@inproceedings{jeon2025cmtm,
  title={CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation},
  author={Jeon, Inseok and Cho, Suhwan and Lee, Minhyeok and Lee, Seunghoon and Kang, Minseok and Lee, Jungho and Park, Chaewon and Kim, Donghyeong and Lee, Sangyoun},
  booktitle={2025 IEEE International Conference on Image Processing (ICIP)},
  pages={1390--1395},
  year={2025},
  organization={IEEE}
}

License

This project is released under the MIT License.


Acknowledgements

This repository builds upon prior research in unsupervised video object segmentation. We thank the research community for their valuable open-source contributions. We also thank the authors of TransFlow for providing their publicly available implementation, which served as a strong baseline for this work.


Contact

If you have any questions about the code or the paper, please feel free to contact:

InSeok Jeon
Email: sunlight3919@gmail.com