Medical AI for Synthetic Imaging (MAISI) Data Preparation

September 7, 2024 ยท View on GitHub

Disclaimer: We are not the hosts of the data. Please make sure to read the requirements and usage policies of the data and give credit to the authors of the datasets!

1 VAE training Data

For the released Foundation autoencoder model weights in MAISI, we used 37243 CT training data and 1963 CT validation data from chest, abdomen, head and neck region; and 17887 MRI training data and 940 MRI validation data from brain, skull-stripped brain, chest, and below-abdomen region. The training data come from TCIA Covid 19 Chest CT, TCIA Colon Abdomen CT, MSD03 Liver Abdomen CT, LIDC chest CT, TCIA Stony Brook Covid Chest CT, NLST Chest CT, TCIA Upenn GBM Brain MR, Aomic Brain MR, QTIM Brain MR, TCIA Acrin Chest MR, TCIA Prostate MR Below-Abdomen MR.

In total, we included:

IndexDataset NameNumber of Training DataNumber of Validation Data
1Covid 19 Chest CT72249
2TCIA Colon Abdomen CT152277
3MSD03 Liver Abdomen CT1040
4LIDC chest CT45024
5TCIA Stony Brook Covid Chest CT2644139
6NLST Chest CT318011674
7TCIA Upenn GBM Brain MR (skull-stripped)2550134
8Aomic Brain MR2630138
9QTIM Brain MR127567
10Acrin Chest MR6599347
11TCIA Prostate MR Below-Abdomen MR92849
12Aomic Brain MR, skull-stripped2630138
13QTIM Brain MR, skull-stripped127567
Total CT372431963
Total MRI17887940

2 Diffusion model training Data

The training dataset for the Diffusion model used in MAISI comprises 10,277 CT volumes from 24 distinct datasets, encompassing various body regions and disease patterns.

The table below provides a summary of the number of volumes for each dataset.

IndexDataset nameNumber of volumes
1AbdomenCT-1K789
2AeroPath15
3AMOS22240
4autoPET23200
5Bone-Lesion223
6BTCV48
7COVID-19524
8CRLM-CT158
9CT-ORG94
10CTPelvic1K-CLINIC94
11LIDC422
12MSD Task0388
13MSD Task0650
14MSD Task07224
15MSD Task08235
16MSD Task0933
17MSD Task1087
18Multi-organ-Abdominal-CT65
19NLST3109
20Pancreas-CT51
21StonyBrook-CT1258
22TCIA_Colon1437
23TotalSegmentatorV2654
24VerSe179

3 ControlNet model training Data

3.1 Example preprocessed dataset

We provide the preprocessed subset of C4KC-KiTS dataset used in the finetuning config environment_maisi_controlnet_train.json. The dataset and corresponding JSON data list can be downloaded and should be saved in maisi/dataset/ folder.

The structure of example folder in the preprocessed dataset is:

            |-*arterial*.nii.gz               # original image
            |-*arterial_emb*.nii.gz           # encoded image embedding
KiTS-000* --|-mask*.nii.gz                    # original labels
            |-mask_pseudo_label*.nii.gz       # pseudo labels
            |-mask_combined_label*.nii.gz     # combined mask of original and pseudo labels

An example combined mask of original and pseudo labels is shown below: example_combined_mask

Please note that the label of Kidney Tumor is mapped to index 129 in this preprocessed dataset. The encoded image embedding is generated by provided Autoencoder in ./models/autoencoder_epoch273.pt during preprocessing to save memory usage for training. The pseudo labels are generated by VISTA 3D. In addition, the dimension of each volume and corresponding pseudo label is resampled to the closest multiple of 128 (e.g., 128, 256, 384, 512, ...).

The training workflow requires one JSON file to specify the image embedding and segmentation pairs. The example file is located in the maisi/dataset/C4KC-KiTS_subset.json.

The JSON file has the following structure:

{
    "training": [
        {
            "image": "*/*arterial_emb*.nii.gz",  # relative path to the image embedding file
            "label": "*/mask_combined_label*.nii.gz",  # relative path to the combined label file
            "dim": [512, 512, 512],  # the dimension of image
            "spacing": [1.0, 1.0, 1.0],  # the spacing of image
            "top_region_index": [0, 1, 0, 0],  # the top region index of the image
            "bottom_region_index": [0, 0, 0, 1],  # the bottom region index of the image
            "fold": 0  # fold index for cross validation, fold 0 is used for training
        },

        ...
    ]
}

3.2 Controlnet full training datasets

The ControlNet training dataset used in MAISI contains 6330 CT volumes (5058 and 1272 volumes are used for training and validation, respectively) across 20 datasets and covers different body regions and diseases.

The table below summarizes the number of volumes for each dataset.

IndexDataset nameNumber of volumes
1AbdomenCT-1K789
2AeroPath15
3AMOS22240
4Bone-Lesion237
5BTCV48
6CT-ORG94
7CTPelvic1K-CLINIC94
8LIDC422
9MSD Task03105
10MSD Task0650
11MSD Task07225
12MSD Task08235
13MSD Task0933
14MSD Task10101
15Multi-organ-Abdominal-CT64
16Pancreas-CT51
17StonyBrook-CT1258
18TCIA_Colon1436
19TotalSegmentatorV2654
20VerSe179

4. Questions and bugs

  • For questions relating to the use of MONAI, please use our Discussions tab on the main repository of MONAI.
  • For bugs relating to MONAI functionality, please create an issue on the main repository.
  • For bugs relating to the running of a tutorial, please create an issue in this repository.

Reference

[1] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." CVPR 2022.