PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders
June 11, 2026 Β· View on GitHub
CVPR 2026 Workshop Β· π Paper
NiccolΓ² Cavagnero, Narges Norouzi, Gijs Dubbelman, Daan de Geus
Eindhoven University of Technology
Overview
We present the Plain Mask Transformer (PMT), a fast Transformer-based segmentation model that operates on top of frozen Vision Foundation Model (VFM) features.
Encoder-only models like EoMT and VidEoMT achieve competitive accuracy with low latency but require finetuning the full encoder, preventing the VFM from being reused for other downstream tasks.
PMT addresses this by introducing the Plain Mask Decoder (PMD): a lightweight Transformer decoder that mimics the last encoder layers of EoMT, processing queries and frozen patch tokens jointly β without touching the encoder weights.
The result: a model that keeps the encoder frozen and shareable across tasks while matching the accuracy and speed of finetuned alternatives.
Repository Structure
The codebase is organized by task domain, with separate directories for image and video segmentation.
pmt/
βββ requirements.txt # shared dependencies
βββ image/ # image segmentation
βββ video/ # video segmentation
βββ model_zoo/ # pre-trained weight catalogues
β βββ image/ # image model weights (DINOv3)
β βββ video/ # video model weights (DINOV2, DINOV3)
βββ docs/ # project page
Installation
If you don't have Conda installed, install Miniconda and restart your shell:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Then create the environment, activate it, and install the dependencies:
conda create -n pmt python==3.13.2
conda activate pmt
python3 -m pip install -r requirements.txt
Weights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:
wandb login
Data Preparation
- Image datasets (COCO, ADE20K): follow the instructions in the EoMT repository.
Image Segmentation
Training
To train PMT from scratch, run:
python3 image/main.py fit \
-c image/configs/coco/panoptic/pmt_l_640.yaml \
--trainer.devices 4 \
--data.batch_size 4 \
--data.path /path/to/dataset
This trains PMT-L with a 640Γ640 input on COCO panoptic segmentation using 4 GPUs, for a total batch size of 16.
β
Make sure the total batch size is devices Γ batch_size = 16
π§ Replace /path/to/dataset with the directory containing the dataset zip files.
This configuration takes ~6 hours on 4ΓNVIDIA H100 GPUs, each using ~26GB VRAM.
To fine-tune a pre-trained PMT model, add:
--model.ckpt_path /path/to/pytorch_model.bin \
--model.load_ckpt_class_head False
π§ Replace /path/to/pytorch_model.bin with the path to the checkpoint to fine-tune.
--model.load_ckpt_class_head Falseskips loading the classification head when fine-tuning on a dataset with different classes.
Evaluating
To evaluate a pre-trained PMT model, run:
python3 image/main.py validate \
-c image/configs/coco/panoptic/pmt_l_640.yaml \
--model.network.masked_attn_enabled False \
--trainer.devices 4 \
--data.batch_size 4 \
--data.path /path/to/dataset \
--model.ckpt_path /path/to/pytorch_model.bin
π§ Replace /path/to/dataset with the directory containing the dataset zip files.
π§ Replace /path/to/pytorch_model.bin with the path to the checkpoint to evaluate.
Video Segmentation
For video segmentation instructions, training, evaluation, and model weights, see the video/README.md.
Model Zoo
We provide pre-trained weights for PMT models with DINOv3 encoders.
- Image Models - Image segmentation with DINOv3 encoder.
- Video Models β DINOv2 - Video segmentation with DINOv2 encoder.
- Video Models β DINOv3 - Video segmentation with DINOv3 encoder.
Citation
If you find this work useful in your research, please cite it using the BibTeX entry below:
@inproceedings{cavagnero2026pmt,
author = {Cavagnero, Niccol\`{o} and Norouzi, Narges and Dubbelman, Gijs and {de Geus}, Daan},
title = {{PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders}},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
year = {2026},
}
Acknowledgements
This project builds upon code from the following libraries and repositories:
- EoMT (MIT License)
- VidEoMT (MIT License)
- Hugging Face Transformers (Apache-2.0 License)
- PyTorch Image Models (timm) (Apache-2.0 License)
- PyTorch Lightning (Apache-2.0 License)
- TorchMetrics (Apache-2.0 License)
- Mask2Former (Apache-2.0 License)
- Detectron2 (Apache-2.0 License)