SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition

June 2, 2025 · View on GitHub

🧠 Overview

SemiVT-Surge is a semi-supervised learning framework for surgical phase recognition based on a Video Transformer architecture. Accurate phase recognition is crucial for computer-assisted interventions and surgical video analysis. However, annotating long surgical videos is time-consuming and costly, which motivates leveraging unlabeled data to reduce the need for manual annotations.

While self-supervised learning has gained traction by enabling large-scale pretraining followed by fine-tuning on smaller labeled subsets, semi-supervised approaches remain underexplored in the surgical domain.

We propose a video transformer-based model enhanced with a robust pseudo-labeling framework that integrates: Model architecture

(a) Temporal consistency regularization for unlabeled data
(b) Contrastive learning using class prototypes to improve feature space separation

Our method is evaluated on:

RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) – a private dataset
Cholec80 – a public surgical video benchmark

📈 Key Results:

Achieved state-of-the-art performance on RAMIE, improving accuracy by +4.9%
Matched fully-supervised performance on Cholec80 using only 25% of the labeled data

📢 News

🗓️ 2024-05-12: Our paper has been early accepted to MICCAI 2025, placing in the top 9% of submissions!

Setup & Environment

We recommend using a Conda environment.

# Create a conda environment with Python 3.11
conda create -n semivt-surge python=3.11
conda activate semivt-surge

Install the required packages:

# PyTorch (matching your CUDA version, adjust as needed)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

# FlashAttention 2 for faster training
# More info: https://github.com/Dao-AILab/flash-attention
pip install flash-attn --no-build-isolation

# Install other dependencies

Data Preparation

To prepare the Cholec80 dataset for training and evaluation, follow these steps:

Use utils/split_video.py to extract frames from each video and zip them into individual archives (Due to file count limitations on our server, frames are zipped per video, this is not necessary).
Run utils/generate_splits.py to create train, validation, and unlabeled splits. Each train split has a corresponding unlabeled split, and they do not overlap.
Use timesformer/phase_datasets/data_preprocess/generate_labels_ch80_zip.py to generate pickle files for labeled data.
Use timesformer/phase_datasets/data_preprocess/generate_labels_ch80_zip_unlabelled.py for unlabeled data.

Example data folder structure:

data/
    └──Cholec80/
        └──Cholec80-splits-256px_1fps/
            └──video01.zip
            ├──...    
            └──video80.zip
        └──frames_cutmargin/
        └──labels_pkl/
            └──train
                ├── 1-5-1fpstrain.pickle
                ├── 1-10-1fpstrain.pickle
                └── 1-20-1fpstrain.pickle
            └──unlabelled
                ├── 1-5-1fpsunlabelled.pickle
                ├── 1-10-1fpsunlabelled.pickle
                └── 1-20-1fpsunlabelled.pickle
            └──val
                └──  1fpsval.pickle
            └──test
                └──  1pstest.pickle

All dataloaders are located in timesformer/phase_datasets/phase/.
Dataloaders are adapted to work with zipped frames - modify them if you're using unzipped data.

Pretrained Parameters

We initialize our model with pretrained weights from TimeSformer, trained on the Kinetics-600 dataset using 8-frame input and a spatial resolution of 224×224.

📥 Download pretrained weights (Dropbox)

Place the .pyth file in an accessible directory and update your config accordingly.

Config file

An example config file is at configs/Kinetics/TimeSformer_base_Cholec80.yaml. Edit this YAML to set your file paths, datasets, batch size, and other parameters as needed.

Perform Training

Run your training script with the appropriate configuration YAML.

python main/train_net_emamix_triplet_semi_prototype_Cholec80.py --cfg <path_to_config.yaml>

Extract features & Perform prediction

Extract features and generate predictions using:

python main/extract_features_Cholec80.py --cfg <path_to_config.yaml>

Extracted features will be saved to the output directory specified in the config file.
The predicted phase labels will be stored in a file named predictions.yaml.

(Optional) TCN Training on Extracted Features

To train a Temporal Convolutional Network (TCN) on top of the extracted features:

Update the path to the features in TemporalModel/train.py
Run the TCN training script: python TemporalModel/train.py

The predictions will be saved as predictions.yaml, generated by the test_model function.

Evaluation

The file predictions.yaml is compatible with the PhaseMetrics evaluation repository. All evaluation results reported in the paper were obtained using the eval.yaml file generated by this repo.

Acknowledgements

This codebase builds upon work from TimeSformer and SVFormer. We also acknowledge the PhaseMetrics repository for evaluation tools. Many thanks to the original authors for making their code publicly available!