README.md

June 23, 2026 · View on GitHub

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

CVPR 2026 · 📄 Paper

Narges Norouzi¹, Idil Esen Zulfikar^2,*, Niccolò Cavagnero^1,*, Tommie Kerssies¹, Bastian Leibe², Gijs Dubbelman¹, Daan de Geus¹

¹ Eindhoven University of Technology, ² RWTH Aachen University, * Equal contribution

🚀 NEW: PMT

We released PMT (Plain Mask Transformer), the next generation of VidEoMT — a segmentation model that works on top of frozen Vision Foundation Model features, requiring no encoder finetuning. The encoder stays fully frozen and shareable across tasks, while matching the accuracy and speed of finetuned alternatives.

Both the research paper and the full source code are publicly available: 📄 Paper · 💻 Code

Overview

VidEoMT Overview

We introduce Video Encoder-only Mask Transformer (VidEoMT), a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It performs both spatial and temporal reasoning within the ViT encoder, without relying on dedicated tracking modules or heavy task-specific heads.

VidEoMT propagates information over time by reusing queries from the previous frame and fusing them with a compact set of learned, frame-agnostic queries. This design achieves competitive accuracy while being 5x–10× faster than existing approaches, reaching up to 160 FPS with a ViT-L backbone.

Installation

If you don't have Conda installed, install Miniconda and restart your shell:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Then create the environment, activate it, and install the dependencies:

conda create -n videomt python==3.12.3
conda activate videomt
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126
python -m pip install --no-build-isolation 'git+https://github.com/facebookresearch/detectron2.git'  
pip install git+https://github.com/cocodataset/panopticapi.git
python3 -m pip install -r requirements.txt

Weights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:

wandb login

Data preparation

Download and prepare the datasets.

Usage

Evaluation

To evaluate a pre-trained VidEoMT model, first prepare the datasets by following the instructions in this link and download the trained weights from here. Once these are set up, run:

python train_net_video.py \
  --num-gpus 1 \
  --config-file /path/to/config.yaml \
  --eval-only MODEL.WEIGHTS /path/to/weight.pth \
  MODEL.BACKBONE.TEST.WINDOW_SIZE 1 \ 
  OUTPUT_DIR /path/to/output

🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.
🔧 Replace /path/to/output with the path to the output folder.
🔧 Change the value of --num-gpus to the number of GPUs available to you.

For detailed instructions on running evaluation on different datasets, see Evaluation.

Training

To train an online VidEoMT model, run:

python3 train_net_video.py \
  --num-gpus 4 \
  --num-machines 2 \
  --config-file /path/to/config.yaml \
  MODEL.WEIGHTS /path/to/segmenter_weight.pth \
  MODEL.BACKBONE.TEST.WINDOW_SIZE 1 \
  OUTPUT_DIR /path/to/output

Replace /path/to/segmenter_weight.pth with the segmenter checkpoint used to initialize training. For DINOv2 models, choose this weight from the Init Weights column in DINOv2 Models.

Replace /path/to/output with the directory where training logs and checkpoints should be written.

Benchmark

To calculate the FPS and GFLOPs, run:

# DINOv2 FPS
python benchmark.py \
  --task fps \
  --config-file    /path/to/config.yaml \
  --model-weights  /path/to/weight.pth  \
  --warmup-iters 100 \
  --model-type dinov2

# DINOv3 FPS
python benchmark.py \
  --task fps \
  --config-file    /path/to/config.yaml \
  --model-weights  /path/to/weight.pth  \
  --warmup-iters 100 \
  --model-type dinov3 \
  --fused-qkv

export TIMM_FUSED_ATTN=0 
python benchmark.py \
  --task flops \
  --config-file    /path/to/config.yaml \
  --model-weights  /path/to/weight.pth \
  --model-type dinov2

For DINOv3 FPS benchmarking, enable --fused-qkv. This is recommended to get FPS closer to the DINOv2 setup.

🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.

Demo

We provide example visualizations below.

To generate additional visualization samples, please use the code in Visualization.

Upcoming Features

- [x] Inference code
- [x] Flops and FPS code
- [x] Visualization code 
- [x] Training codes 
- [x] DINOv3 model zoo and code

Model Zoo

We provide pre-trained weights for both DINOv2- and DINOv3-based VidEoMT models.

DINOv2 Models - Original published results and pre-trained weights.
DINOv3 Models - DINOv3-based models and pre-trained weights.

Citation

If you find this work useful in your research, please cite it using the BibTeX entry below:

@inproceedings{norouzi2026videomt,
  author     = {Norouzi, Narges and Zulfikar, Idil and Cavagnero, Niccol\`{o} and Kerssies, Tommie and Leibe, Bastian and Dubbelman, Gijs and {de Geus}, Daan},
  title      = {{VidEoMT: Your ViT is Secretly Also a Video Segmentation Model}},
  booktitle  = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year       = {2026},
}

Acknowledgements

This project builds upon code from the following libraries and repositories:

EoMT (MIT License)
Hugging Face Transformers (Apache-2.0 License)
PyTorch Image Models (timm) (Apache-2.0 License)
CAVIS (MIT License)
Mask2Former (Apache-2.0 License)
Detectron2 (Apache-2.0 License)