FEAT:Full-Dimensional Efficient Attention Transformer for Medical Video Generation (MICCAI 2025)
June 7, 2025 · View on GitHub
This paper has been early accepted by MICCAI 2025 (top 9%)
Huihan Wang1* Zhiwen Yang1* Hui Zhang2 Dan Zhao3 Bingzheng Wei4 Yan Xu1 ✉
1BUAA 2THU 3PUMC 4ByteDance
* Equal Contributions. ✉ Corresponding Author.

https://github.com/user-attachments/assets/c0b3a5a7-8ef0-4524-a057-369278a9fb16

🛠Setup
git clone https://github.com/Yaziwel/FEAT.git
cd FEAT
conda create -n FEAT python=3.10
conda activate FEAT
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
📚Data Preparation
Colonoscopic: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
Kvasir-Capsule: The dataset provided by paper can be found here. You can directly use the processed video data by Endo-FM without further data processing.
Please run process_data.py and process_list.py to get the split frames and the corresponding list at first.
CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s ./data/Colonoscopic -t ./data/Colonoscopic_frames
CUDA_VISIBLE_DEVICES=gpu_id python process_list.py -f ./data/Colonoscopic_frames -t ./data/Colonoscopic_frames/train_128_list.txt
The resulted file structure is as follows.
├── data
│ ├── Colonoscopic
│ ├── 00001.mp4
| ├── ...
│ ├── Kvasir-Capsule
│ ├── 00001.mp4
| ├── ...
│ ├── Colonoscopic_frames
│ ├── train_128_list.txt
│ ├── 00001
│ ├── 00000.jpg
| ├── ...
| ├── ...
│ ├── Kvasir-Capsule_frames
│ ├── train_128_list.txt
│ ├── 00001
│ ├── 00000.jpg
| ├── ...
| ├── ...
⏳Training
You can follow the steps below to train FEAT:
bash train_scripts/col/train_col.sh
bash train_scripts/kva/train_kva.sh
🎇Sampling
You can directly sample the medical videos from the checkpoint model. Here is an example for quick usage for using our pre-trained models:
- Download the pre-trained weights from here and put them to specific path defined in the configs. You can also use huggingface_hub to download the weights. For example, a checkpoint can be download like so:
from huggingface_hub import hf_hub_download
# 4 models supported: FEAT_L_col.pt, FEAT_L_kva.pt, FEAT_S_col.pt and FEAT_S_kva.pt
filepath = hf_hub_download(repo_id="WTHH031230/FEAT", filename="FEAT_L_col.pt")
- Run
sample.pyby the following scripts to customize the various arguments like adjusting sampling steps.
You can follow the steps below to sample a video by using FEAT:
bash sample/col.sh
bash sample/kva.sh
DDP sample:
bash sample/col_ddp.sh
bash sample/kva_ddp.sh
After the DDP sample, there will be more than 3125 videos generated to calculate the metrics.
📏Evaluation
The metrics we calculated in Colonoscopic dataset are below:
| Method | FVD↓ | CD-FVD↓ | FID↓ | IS↑ |
|---|---|---|---|---|
| StyleGAN-V | 2110.7 | 1032.8 | 226.14 | 2.12 |
| LVDM | 1036.7 | 792.9 | 96.85 | 1.93 |
| MoStGAN-V | 468.5 | 592.0 | 53.17 | 3.37 |
| Endora | 460.7 | 545.3 | 13.41 | 3.90 |
| FEAT-S (Ours) | 415.4 | 444.0 | 13.34 | 3.96 |
| FEAT-L (Ours) | 351.1 | 397.0 | 12.31 | 4.01 |
Before calculating the metrics in our code, you may need the weights for several models, which can be downloaded from the following links:
- Inception v3 for calculating FID and IS.
- I3D for calculating FVD.
- Videomae for calculating CD-FVD.
You can also simply follow this part of the code in Endora to automatically download models from the internet for metric calculation.
To calculate the metrics, you can follow the steps below to evaluate the model.
## FVD, FID and IS
CUDA_VISIBLE_DEVICES=gpu_id python process_data.py -s /path/to/generated/video -t /path/to/video/frames
cd /path/to/stylegan-v
CUDA_VISIBLE_DEVICES=gpu_id python ./src/scripts/calc_metrics_for_dataset.py \
--fake_data_path /path/to/video/frames \
--real_data_path /path/to/dataset/frames
## CD-FVD
CUDA_VISIBLE_DEVICES=gpu_id python calculate_cdfvd.py
🧰Running Other Methods
As we follow the work Endora, you can run other methods the same way as how Endora described.
🎪Downstream Application
As we follow the work Endora, you can run the downstream task the same way as how Endora described.
| Method | Colonoscopic |
|---|---|
| Supervised-only | 74.5 |
| LVDM | 76.2 |
| Endora | 87.0 |
| FEAT-S (ours) | 89.9 |
| FEAT-L (ours) | 91.3 |
🎈Acknowledgements
Greatly appreciate the tremendous effort for the following projects!
📜Citation
If you find FEAT useful in your research, please consider citing:
@article{wang2025feat,
author = {Huihan Wang and Zhiwen Yang and Hui Zhang and Dan Zhao and Bingzheng Wei and Yan Xu},
title = {FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation},
journal = {arXiv preprint arXiv:2506.04956},
year = {2025}
}