MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks

July 6, 2025 · View on GitHub

Paper[PDF] Dataset[Google Drive] Code[Github]

we propose MM-Skin, a large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks and over 27k vision question answering (VQA) samples.

In addition, we developed SkinVL, a dermatology-specific VLM, and conducted comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT), and zero-shot classification tasks.

Code and model weights are coming soon.

Quick Start

1、Environment

First, clone the repo and cd into the directory:

git clone https://github.com/ZwQ803/MM-Skin.git
cd MM-Skin

Then create a conda env and install the dependencies:

conda create -n mmskin python=3.10 -y
conda activate mmskin
pip install -r requirements.txt

2、Download MM-SkinVL Pre-trained Weights

Model NameLink
SkinVL-MMLink
SkinVL-PubLink
SkinVL-PubMMLink

Download Pre-training Datasets

DatasetModalityLink
SCINClinicalLink
DDIClinicalLink
Fitzpatrick17kClinicalLink
PADClinicalLink
DermnetClinicalLink
HAM10000DermoscopyLink
ISIC2019DermoscopyLink
BCN20000DermoscopyLink
HIBADermoscopyLink
MSKCCDermoscopyLink
Patch16PathologyLink
MM-SkinClinical, Dermoscopy, PathologyLink

Training

To train the model using LoRA, run finetune_lora.sh with pre-trained LLaVA-Med weights (available here).
Update LLAVA_MED_WEIGHT_PATH in the script to your local path, and replace PRETRAIN_DATAFRAME with the processed JSON training file.
We provide training JSONs for SkinVL-MM, SkinVL-Pub, and SkinVL-PubMM at: /Dataframe/Pretrain.

After training, merge the LoRA weights with the base model:

python merge_lora_weights.py \
    --model-path /path/to/lora_model \
    --model-base /path/to/base_model/llava-med-v1.5-mistral-7b \
    --save-model-path /path/to/merge_model

You can also directly use our provided merged models by placing them in the /merge directory.

Evaluation

1、 VQA Evaluation。

To evaluate SkinVL-MM, SkinVL-Pub, and SkinVL-PubMM, run:

python VQA_test.py --model-path MERGED_SKINVL_MODEL

Replace caption file and image folder in the script with your dataset paths. We provide preprocessed MM-Skin test data in /Dataframe/test/VQA, which can be used directly for evaluation.

2. Supervised Fine-Tuning (SFT) Classification

Run SFT_classify_test.sh for supervised classification. Replace all paths with your local files. Preprocessed data for reproducing our results can be found in /Dataframe/test/classification.

3. Zero-Shot Classification

Run ZS_classify_test.sh to perform zero-shot classification.

Data Collection and Statistics

The 15 professional dermatology textbooks are:

MM-Skin contains 11,039 dermatology images with expert descriptions across three modalities. It provides three subsets:

- MM-Skin-C (Captions)

- MM-Skin-O (Open-ended VQA)

- MM-Skin-D (Demographics)

Data Collection Process

  1. Image-Text Extraction: From 15 dermatology textbooks using OCR and Adobe API.

  2. Alignment: Match images with captions.

  3. Modality Classification: Feature-based classification (color, texture) with manual verification.

  4. Text Cleaning: Extract age and gender info.

  5. Filtering: Remove sensitive or annotated images.

Citation

If you find our work helpful, feel free to give us a cite.

@article{zeng2025mm,
  title={MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks},
  author={Zeng, Wenqi and Sun, Yuqi and Ma, Chenxi and Tan, Weimin and Yan, Bo},
  journal={arXiv preprint arXiv:2505.06152},
  year={2025}
}