SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition
June 2, 2025 Β· View on GitHub
π§ Overview
SemiVT-Surge is a semi-supervised learning framework for surgical phase recognition based on a Video Transformer architecture. Accurate phase recognition is crucial for computer-assisted interventions and surgical video analysis. However, annotating long surgical videos is time-consuming and costly, which motivates leveraging unlabeled data to reduce the need for manual annotations.
While self-supervised learning has gained traction by enabling large-scale pretraining followed by fine-tuning on smaller labeled subsets, semi-supervised approaches remain underexplored in the surgical domain.
We propose a video transformer-based model enhanced with a robust pseudo-labeling framework that integrates:

- (a) Temporal consistency regularization for unlabeled data
- (b) Contrastive learning using class prototypes to improve feature space separation
Our method is evaluated on:
- RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) β a private dataset
- Cholec80 β a public surgical video benchmark
π Key Results:
- Achieved state-of-the-art performance on RAMIE, improving accuracy by +4.9%
- Matched fully-supervised performance on Cholec80 using only 25% of the labeled data
π’ News
ποΈ 2024-05-12: Our paper has been early accepted to MICCAI 2025, placing in the top 9% of submissions!
Setup & Environment
We recommend using a Conda environment.
# Create a conda environment with Python 3.11
conda create -n semivt-surge python=3.11
conda activate semivt-surge
Install the required packages:
# PyTorch (matching your CUDA version, adjust as needed)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
# FlashAttention 2 for faster training
# More info: https://github.com/Dao-AILab/flash-attention
pip install flash-attn --no-build-isolation
# Install other dependencies
Data Preparation
To prepare the Cholec80 dataset for training and evaluation, follow these steps:
- Use
utils/split_video.pyto extract frames from each video and zip them into individual archives (Due to file count limitations on our server, frames are zipped per video, this is not necessary). - Run
utils/generate_splits.pyto create train, validation, and unlabeled splits. Each train split has a corresponding unlabeled split, and they do not overlap. - Use
timesformer/phase_datasets/data_preprocess/generate_labels_ch80_zip.pyto generate pickle files for labeled data. - Use
timesformer/phase_datasets/data_preprocess/generate_labels_ch80_zip_unlabelled.pyfor unlabeled data.
Example data folder structure:
data/
βββCholec80/
βββCholec80-splits-256px_1fps/
βββvideo01.zip
βββ...
βββvideo80.zip
βββframes_cutmargin/
βββlabels_pkl/
βββtrain
βββ 1-5-1fpstrain.pickle
βββ 1-10-1fpstrain.pickle
βββ 1-20-1fpstrain.pickle
βββunlabelled
βββ 1-5-1fpsunlabelled.pickle
βββ 1-10-1fpsunlabelled.pickle
βββ 1-20-1fpsunlabelled.pickle
βββval
βββ 1fpsval.pickle
βββtest
βββ 1pstest.pickle
- All dataloaders are located in
timesformer/phase_datasets/phase/. - Dataloaders are adapted to work with zipped frames - modify them if you're using unzipped data.
Pretrained Parameters
We initialize our model with pretrained weights from TimeSformer, trained on the Kinetics-600 dataset using 8-frame input and a spatial resolution of 224Γ224.
π₯ Download pretrained weights (Dropbox)
Place the .pyth file in an accessible directory and update your config accordingly.
Config file
An example config file is at configs/Kinetics/TimeSformer_base_Cholec80.yaml. Edit this YAML to set your file paths, datasets, batch size, and other parameters as needed.
Perform Training
Run your training script with the appropriate configuration YAML.
python main/train_net_emamix_triplet_semi_prototype_Cholec80.py --cfg <path_to_config.yaml>
Extract features & Perform prediction
Extract features and generate predictions using:
python main/extract_features_Cholec80.py --cfg <path_to_config.yaml>
-
Extracted features will be saved to the output directory specified in the config file.
-
The predicted phase labels will be stored in a file named
predictions.yaml.
(Optional) TCN Training on Extracted Features
To train a Temporal Convolutional Network (TCN) on top of the extracted features:
- Update the path to the features in
TemporalModel/train.py - Run the TCN training script:
python TemporalModel/train.py
The predictions will be saved as predictions.yaml, generated by the test_model function.
Evaluation
The file predictions.yaml is compatible with the PhaseMetrics evaluation repository.
All evaluation results reported in the paper were obtained using the eval.yaml file generated by this repo.
Acknowledgements
This codebase builds upon work from TimeSformer and SVFormer. We also acknowledge the PhaseMetrics repository for evaluation tools. Many thanks to the original authors for making their code publicly available!