FEAT：Full-Dimensional Efficient Attention Transformer for Medical Video Generation (MICCAI 2025)

June 7, 2025 · View on GitHub

This paper has been early accepted by MICCAI 2025 (top 9%)

Huihan Wang^1* Zhiwen Yang^1* Hui Zhang² Dan Zhao³ Bingzheng Wei⁴ Yan Xu¹ ^✉

¹BUAA ²THU ³PUMC ⁴ByteDance

^* Equal Contributions. ^✉ Corresponding Author.

https://github.com/user-attachments/assets/c0b3a5a7-8ef0-4524-a057-369278a9fb16

introduction

🛠Setup

git clone https://github.com/Yaziwel/FEAT.git
cd FEAT
conda create -n FEAT python=3.10
conda activate FEAT

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118

pip install -r requirements.txt

📚Data Preparation

Colonoscopic: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.

Kvasir-Capsule: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.

Please run process_data.py and process_list.py to get the split frames and the corresponding list at first.

CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s ./data/Colonoscopic -t ./data/Colonoscopic_frames

CUDA_VISIBLE_DEVICES=gpu_id python process_list.py -f ./data/Colonoscopic_frames -t ./data/Colonoscopic_frames/train_128_list.txt

The resulted file structure is as follows.

├── data
│   ├── Colonoscopic
│     ├── 00001.mp4
|     ├──  ...
│   ├── Kvasir-Capsule
│     ├── 00001.mp4
|     ├──  ...
│   ├── Colonoscopic_frames
│     ├── train_128_list.txt
│     ├── 00001
│           ├── 00000.jpg
|           ├── ...
|     ├──  ...
│   ├── Kvasir-Capsule_frames
│     ├── train_128_list.txt
│     ├── 00001
│           ├── 00000.jpg
|           ├── ...
|     ├──  ...

⏳Training

You can follow the steps below to train FEAT:

bash train_scripts/col/train_col.sh
bash train_scripts/kva/train_kva.sh

🎇Sampling

You can directly sample the medical videos from the checkpoint model. Here is an example for quick usage for using our pre-trained models:

Download the pre-trained weights from here and put them to specific path defined in the configs. You can also use huggingface_hub to download the weights. For example, a checkpoint can be download like so:

from huggingface_hub import hf_hub_download

# 4 models supported: FEAT_L_col.pt, FEAT_L_kva.pt, FEAT_S_col.pt and FEAT_S_kva.pt
filepath = hf_hub_download(repo_id="WTHH031230/FEAT", filename="FEAT_L_col.pt")

Run sample.py by the following scripts to customize the various arguments like adjusting sampling steps.

You can follow the steps below to sample a video by using FEAT:

bash sample/col.sh
bash sample/kva.sh

DDP sample:

bash sample/col_ddp.sh
bash sample/kva_ddp.sh

After the DDP sample, there will be more than 3125 videos generated to calculate the metrics.

📏Evaluation

The metrics we calculated in Colonoscopic dataset are below:

Method	FVD↓	CD-FVD↓	FID↓	IS↑
StyleGAN-V	2110.7	1032.8	226.14	2.12
LVDM	1036.7	792.9	96.85	1.93
MoStGAN-V	468.5	592.0	53.17	3.37
Endora	460.7	545.3	13.41	3.90
FEAT-S (Ours)	415.4	444.0	13.34	3.96
FEAT-L (Ours)	351.1	397.0	12.31	4.01

Before calculating the metrics in our code, you may need the weights for several models, which can be downloaded from the following links:

Inception v3 for calculating FID and IS.
I3D for calculating FVD.
Videomae for calculating CD-FVD.

You can also simply follow this part of the code in Endora to automatically download models from the internet for metric calculation.

To calculate the metrics, you can follow the steps below to evaluate the model.

## FVD, FID and IS
CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s /path/to/generated/video -t /path/to/video/frames
cd /path/to/stylegan-v
CUDA_VISIBLE_DEVICES=gpu_id python ./src/scripts/calc_metrics_for_dataset.py \
  --fake_data_path /path/to/video/frames \
  --real_data_path /path/to/dataset/frames 
  
## CD-FVD
CUDA_VISIBLE_DEVICES=gpu_id python calculate_cdfvd.py

🧰Running Other Methods

As we follow the work Endora, you can run other methods the same way as how Endora described.

🎪Downstream Application

As we follow the work Endora, you can run the downstream task the same way as how Endora described.

Method	Colonoscopic
Supervised-only	74.5
LVDM	76.2
Endora	87.0
FEAT-S (ours)	89.9
FEAT-L (ours)	91.3

🎈Acknowledgements

Greatly appreciate the tremendous effort for the following projects!

📜Citation

If you find FEAT useful in your research, please consider citing:

@article{wang2025feat,
  author    = {Huihan Wang and Zhiwen Yang and Hui Zhang and Dan Zhao and Bingzheng Wei and Yan Xu},
  title     = {FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation},
  journal   = {arXiv preprint arXiv:2506.04956},
  year      = {2025}
}