README.md
June 23, 2026 · View on GitHub
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
CVPR 2026 · 📄 Paper
Narges Norouzi1, Idil Esen Zulfikar2,*, Niccolò Cavagnero1,*, Tommie Kerssies1, Bastian Leibe2, Gijs Dubbelman1, Daan de Geus1
¹ Eindhoven University of Technology, ² RWTH Aachen University, * Equal contribution
🚀 NEW: PMT
We released PMT (Plain Mask Transformer), the next generation of VidEoMT — a segmentation model that works on top of frozen Vision Foundation Model features, requiring no encoder finetuning. The encoder stays fully frozen and shareable across tasks, while matching the accuracy and speed of finetuned alternatives.
Both the research paper and the full source code are publicly available: 📄 Paper · 💻 Code
Overview

We introduce Video Encoder-only Mask Transformer (VidEoMT), a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It performs both spatial and temporal reasoning within the ViT encoder, without relying on dedicated tracking modules or heavy task-specific heads.
VidEoMT propagates information over time by reusing queries from the previous frame and fusing them with a compact set of learned, frame-agnostic queries. This design achieves competitive accuracy while being 5x–10× faster than existing approaches, reaching up to 160 FPS with a ViT-L backbone.
Installation
If you don't have Conda installed, install Miniconda and restart your shell:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Then create the environment, activate it, and install the dependencies:
conda create -n videomt python==3.12.3
conda activate videomt
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126
python -m pip install --no-build-isolation 'git+https://github.com/facebookresearch/detectron2.git'
pip install git+https://github.com/cocodataset/panopticapi.git
python3 -m pip install -r requirements.txt
Weights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:
wandb login
Data preparation
Download and prepare the datasets.
Usage
Evaluation
To evaluate a pre-trained VidEoMT model, first prepare the datasets by following the instructions in this link and download the trained weights from here. Once these are set up, run:
python train_net_video.py \
--num-gpus 1 \
--config-file /path/to/config.yaml \
--eval-only MODEL.WEIGHTS /path/to/weight.pth \
MODEL.BACKBONE.TEST.WINDOW_SIZE 1 \
OUTPUT_DIR /path/to/output
🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.
🔧 Replace /path/to/output with the path to the output folder.
🔧 Change the value of --num-gpus to the number of GPUs available to you.
For detailed instructions on running evaluation on different datasets, see Evaluation.
Training
To train an online VidEoMT model, run:
python3 train_net_video.py \
--num-gpus 4 \
--num-machines 2 \
--config-file /path/to/config.yaml \
MODEL.WEIGHTS /path/to/segmenter_weight.pth \
MODEL.BACKBONE.TEST.WINDOW_SIZE 1 \
OUTPUT_DIR /path/to/output
Replace /path/to/segmenter_weight.pth with the segmenter checkpoint used to initialize training. For DINOv2 models, choose this weight from the Init Weights column in DINOv2 Models.
Replace /path/to/output with the directory where training logs and checkpoints should be written.
Benchmark
To calculate the FPS and GFLOPs, run:
# DINOv2 FPS
python benchmark.py \
--task fps \
--config-file /path/to/config.yaml \
--model-weights /path/to/weight.pth \
--warmup-iters 100 \
--model-type dinov2
# DINOv3 FPS
python benchmark.py \
--task fps \
--config-file /path/to/config.yaml \
--model-weights /path/to/weight.pth \
--warmup-iters 100 \
--model-type dinov3 \
--fused-qkv
export TIMM_FUSED_ATTN=0
python benchmark.py \
--task flops \
--config-file /path/to/config.yaml \
--model-weights /path/to/weight.pth \
--model-type dinov2
For DINOv3 FPS benchmarking, enable --fused-qkv. This is recommended to get FPS closer to the DINOv2 setup.
🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.
Demo
We provide example visualizations below.

To generate additional visualization samples, please use the code in Visualization.
Upcoming Features
- [x] Inference code
- [x] Flops and FPS code
- [x] Visualization code
- [x] Training codes
- [x] DINOv3 model zoo and code
Model Zoo
We provide pre-trained weights for both DINOv2- and DINOv3-based VidEoMT models.
- DINOv2 Models - Original published results and pre-trained weights.
- DINOv3 Models - DINOv3-based models and pre-trained weights.
Citation
If you find this work useful in your research, please cite it using the BibTeX entry below:
@inproceedings{norouzi2026videomt,
author = {Norouzi, Narges and Zulfikar, Idil and Cavagnero, Niccol\`{o} and Kerssies, Tommie and Leibe, Bastian and Dubbelman, Gijs and {de Geus}, Daan},
title = {{VidEoMT: Your ViT is Secretly Also a Video Segmentation Model}},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026},
}
Acknowledgements
This project builds upon code from the following libraries and repositories:
- EoMT (MIT License)
- Hugging Face Transformers (Apache-2.0 License)
- PyTorch Image Models (timm) (Apache-2.0 License)
- CAVIS (MIT License)
- Mask2Former (Apache-2.0 License)
- Detectron2 (Apache-2.0 License)