README.md

December 21, 2025 · View on GitHub

ITA-MDT:
Image-Timestep-Adaptive Masked Diffusion Transformer Framework
for Image-Based Virtual Try-On

Ji Woo Hong, Tri Ton, Pham X. Trung, Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo
Korea Advanced Institute of Science and Technology (KAIST)

ITA-MDT

Requirements

git clone https://github.com/jiwoohong93/ita-mdt_code.git
cd ita-mdt_code

bash environment.sh
conda activate ITA-MDT

The above commands will create and activate the conda environment with all core dependencies for ITA-MDT.

(optional) We recommend utilizing Adan and xFormers for improved training and generation efficiency.

Pre-trained Models Required

Two pre-trained components are required and will be automatically downloaded on the first run of training or generation:

DINOv2 — Vision Transformer backbone for garment feature extraction.
Stable Diffusion VAE — Variational Autoencoder for image encoding/decoding in latent space.

Once downloaded, they will be cached locally for subsequent runs.

Datasets Preparation

Download VITON-HD from HERE

Download DressCode from HERE

Place both datasets inside the DATA/ folder:

DATA/
  ├── zalando-hd-resized/
  ├── DressCode/

1. Additional Images Required for DressCode

To generate the agnostic images and corresponding masks, we adopt the dataset preparation of CAT-DM. For DensePose, we utilize new part-based color map DensePose images provided by IDM-VTON to ensure consistency with the VITON-HD dataset.

These images are required for proper training and generation. They can all be downloaded from HERE.

After downloading, place each garment category’s folder and its images into the corresponding directory of the original DressCode dataset.

2. Pre-process SRE (Salient Region Extraction)

This code pre-processes SRE and saves salient region images in advance for faster and more efficient training and generation.

Run the following command:

python preprocess_salient_region_extraction.py --path_to_datasets ./DATA

--path_to_datasets should point to the folder containing zalando-hd-resized and DressCode directories.
This script will process both datasets and save salient region images into the cloth_sr folder for each category.

or, you can download the pre-processed salient region images from HERE.

After downloading, place each garment category’s folder and its images into the corresponding directory.

Expected Data Structure

zalando-hd-resized/
  ├── test/
  │   ├── agnostic-mask
  │   ├── agnostic-v3.2
  │   ├── cloth
  │   ├── cloth_sr
  │   ├── image
  │   └── image-densepose
  ├── train/
  │   ├── agnostic-mask
  │   ├── agnostic-v3.2
  │   ├── cloth
  │   ├── cloth_sr
  │   ├── image
  │   └── image-densepose
  ├── test_pairs.txt
  └── train_pairs.txt

DressCode/
  ├── dresses/
  │   ├── agnostic
  │   ├── cloth_sr
  │   ├── image-densepose
  │   ├── images
  │   ├── mask
  │   ├── test_pairs_paired.txt
  │   ├── test_pairs_unpaired.txt
  │   └── train_pairs.txt
  ├── lower_body/
  │   ├── agnostic
  │   ├── cloth_sr
  │   ├── image-densepose
  │   ├── images
  │   ├── mask
  │   ├── test_pairs_paired.txt
  │   ├── test_pairs_unpaired.txt
  │   └── train_pairs.txt
  └── upper_body/
      ├── agnostic
      ├── cloth_sr
      ├── image-densepose
      ├── images
      ├── mask
      ├── test_pairs_paired.txt
      ├── test_pairs_unpaired.txt
      └── train_pairs.txt

Training

Run:

bash train.sh

Variables to Edit in `train.sh`

export CUDA_VISIBLE_DEVICES= → GPU IDs to use for training (comma-separated).
NUM_GPUS= → Number of GPUs to use.
export OPENAI_LOGDIR= → Directory to save training logs and checkpoints.
LR= → Learning rate.
BATCH_SIZE= → Batch size.
SAVE_INTERVAL= → Save model checkpoint every this many steps.
MASTER_PORT= → Port used for inter-process communication in distributed training (change if conflict occurs).
(Optional) --resume_checkpoint → Uncomment and set a path if resuming from a saved checkpoint.

Generation

You can download the checkpoint of our ITA-MDT from HERE.

[2025-10-08] Reuploaded with the correct model weights.

VITON-HD

Run:

bash generate_vitonhd.sh

DressCode

Run:

bash generate_dc.sh

Variables to Edit in Generation Scripts

Common for both generate_vitonhd.sh and generate_dc.sh:

export CUDA_VISIBLE_DEVICES= → GPU ID to use for generation.
OUTPUT_DIR= → Path where generated images will be saved.
MODEL_PATH= → Path to trained weights (ema).
BATCH_SIZE= → Images generated per batch.
NUM_SAMPLING_STEPS= → Diffusion sampling steps.
UNPAIR=false → Whether to use unpaired garment-person combinations.

For generate_dc.sh:

SUBDATA= → Category of DressCode dataset (dresses, upper_body, or lower_body).

Evaluation

The evaluation code is adapted from LaDI-VTON. Please refer to the original repository for the environment required to run the evaluation.

Run:

bash eval.sh

Variables to Edit in `eval.sh`

CUDA_VISIBLE_DEVICES= → GPU ID to use for evaluation.
--batch_size= → Batch size for evaluation.
--gen_folder= → Path to generated images to be evaluated.
--dataset= → Dataset to evaluate on (vitonhd or dresscode).
--test_order= → Paired/unpaired evaluation (paired or unpaired). For unpaired, only FID is valid.
--category= → Category for DressCode dataset (upper_body, lower_body, dresses).

Citation

We kindly encourage citation of our work if you find it useful.

@article{hong2025ita,
  title={ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On},
  author={Hong, Ji Woo and Ton, Tri and Pham, Trung X and Koo, Gwanhyeong and Yoon, Sunjae and Yoo, Chang D},
  journal={arXiv preprint arXiv:2503.20418},
  year={2025}
}

License

The codes in this repository are released under the CC BY-NC-SA 4.0 license.

Acknowledgement

This work was supported by Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)
(No. RS-2021-II211381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments),
and partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT)
(No. RS-2022-II220184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics).

ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Frameworkfor Image-Based Virtual Try-On

ITA-MDT:
Image-Timestep-Adaptive Masked Diffusion Transformer Framework
for Image-Based Virtual Try-On