PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders

June 11, 2026 · View on GitHub

CVPR 2026 Workshop · 📄 Paper

Niccolò Cavagnero, Narges Norouzi, Gijs Dubbelman, Daan de Geus

Eindhoven University of Technology

Overview

We present the Plain Mask Transformer (PMT), a fast Transformer-based segmentation model that operates on top of frozen Vision Foundation Model (VFM) features.

Encoder-only models like EoMT and VidEoMT achieve competitive accuracy with low latency but require finetuning the full encoder, preventing the VFM from being reused for other downstream tasks.

PMT addresses this by introducing the Plain Mask Decoder (PMD): a lightweight Transformer decoder that mimics the last encoder layers of EoMT, processing queries and frozen patch tokens jointly — without touching the encoder weights.

The result: a model that keeps the encoder frozen and shareable across tasks while matching the accuracy and speed of finetuned alternatives.

Repository Structure

The codebase is organized by task domain, with separate directories for image and video segmentation.

pmt/
├── requirements.txt          # shared dependencies
├── image/                    # image segmentation
├── video/                    # video segmentation 
├── model_zoo/                # pre-trained weight catalogues
│   ├── image/                # image model weights (DINOv3)
│   └── video/                # video model weights (DINOV2, DINOV3)
└── docs/                     # project page

Installation

If you don't have Conda installed, install Miniconda and restart your shell:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Then create the environment, activate it, and install the dependencies:

conda create -n pmt python==3.13.2
conda activate pmt
python3 -m pip install -r requirements.txt

Weights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:

wandb login

Data Preparation

Image datasets (COCO, ADE20K): follow the instructions in the EoMT repository.

Image Segmentation

Training

To train PMT from scratch, run:

python3 image/main.py fit \
  -c image/configs/coco/panoptic/pmt_l_640.yaml \
  --trainer.devices 4 \
  --data.batch_size 4 \
  --data.path /path/to/dataset

This trains PMT-L with a 640×640 input on COCO panoptic segmentation using 4 GPUs, for a total batch size of 16.

✅ Make sure the total batch size is devices × batch_size = 16
🔧 Replace /path/to/dataset with the directory containing the dataset zip files.

This configuration takes ~6 hours on 4×NVIDIA H100 GPUs, each using ~26GB VRAM.

To fine-tune a pre-trained PMT model, add:

  --model.ckpt_path /path/to/pytorch_model.bin \
  --model.load_ckpt_class_head False

🔧 Replace /path/to/pytorch_model.bin with the path to the checkpoint to fine-tune.

--model.load_ckpt_class_head False skips loading the classification head when fine-tuning on a dataset with different classes.

Evaluating

To evaluate a pre-trained PMT model, run:

python3 image/main.py validate \
  -c image/configs/coco/panoptic/pmt_l_640.yaml \
  --model.network.masked_attn_enabled False \
  --trainer.devices 4 \
  --data.batch_size 4 \
  --data.path /path/to/dataset \
  --model.ckpt_path /path/to/pytorch_model.bin

🔧 Replace /path/to/dataset with the directory containing the dataset zip files.
🔧 Replace /path/to/pytorch_model.bin with the path to the checkpoint to evaluate.

Video Segmentation

For video segmentation instructions, training, evaluation, and model weights, see the video/README.md.

Model Zoo

We provide pre-trained weights for PMT models with DINOv3 encoders.

Image Models - Image segmentation with DINOv3 encoder.
Video Models — DINOv2 - Video segmentation with DINOv2 encoder.
Video Models — DINOv3 - Video segmentation with DINOv3 encoder.

Citation

If you find this work useful in your research, please cite it using the BibTeX entry below:

@inproceedings{cavagnero2026pmt,
  author    = {Cavagnero, Niccol\`{o} and Norouzi, Narges and Dubbelman, Gijs and {de Geus}, Daan},
  title     = {{PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders}},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  year      = {2026},
}

Acknowledgements

This project builds upon code from the following libraries and repositories:

EoMT (MIT License)
VidEoMT (MIT License)
Hugging Face Transformers (Apache-2.0 License)
PyTorch Image Models (timm) (Apache-2.0 License)
PyTorch Lightning (Apache-2.0 License)
TorchMetrics (Apache-2.0 License)
Mask2Former (Apache-2.0 License)
Detectron2 (Apache-2.0 License)