MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks

July 6, 2025 · View on GitHub

Paper[PDF] Dataset[Google Drive] Code[Github]

we propose MM-Skin, a large-scale multimodal dermatology dataset that encompasses 3 imaging modalities, including clinical, dermoscopic, and pathological and nearly 10k high-quality image-text pairs collected from professional textbooks and over 27k vision question answering (VQA) samples.

In addition, we developed SkinVL, a dermatology-specific VLM, and conducted comprehensive benchmark evaluations of SkinVL on VQA, supervised fine-tuning (SFT), and zero-shot classification tasks.

Code and model weights are coming soon.

Quick Start

1、Environment

First, clone the repo and cd into the directory:

git clone https://github.com/ZwQ803/MM-Skin.git
cd MM-Skin

Then create a conda env and install the dependencies:

conda create -n mmskin python=3.10 -y
conda activate mmskin
pip install -r requirements.txt

2、Download MM-SkinVL Pre-trained Weights

Model Name	Link
SkinVL-MM	Link
SkinVL-Pub	Link
SkinVL-PubMM	Link

Download Pre-training Datasets

Dataset	Modality	Link
SCIN	Clinical	Link
DDI	Clinical	Link
Fitzpatrick17k	Clinical	Link
PAD	Clinical	Link
Dermnet	Clinical	Link
HAM10000	Dermoscopy	Link
ISIC2019	Dermoscopy	Link
BCN20000	Dermoscopy	Link
HIBA	Dermoscopy	Link
MSKCC	Dermoscopy	Link
Patch16	Pathology	Link
MM-Skin	Clinical, Dermoscopy, Pathology	Link

Training

To train the model using LoRA, run finetune_lora.sh with pre-trained LLaVA-Med weights (available here).
Update LLAVA_MED_WEIGHT_PATH in the script to your local path, and replace PRETRAIN_DATAFRAME with the processed JSON training file.
We provide training JSONs for SkinVL-MM, SkinVL-Pub, and SkinVL-PubMM at: /Dataframe/Pretrain.

After training, merge the LoRA weights with the base model:

python merge_lora_weights.py \
    --model-path /path/to/lora_model \
    --model-base /path/to/base_model/llava-med-v1.5-mistral-7b \
    --save-model-path /path/to/merge_model

You can also directly use our provided merged models by placing them in the /merge directory.

Evaluation

1、 VQA Evaluation。

To evaluate SkinVL-MM, SkinVL-Pub, and SkinVL-PubMM, run:

python VQA_test.py --model-path MERGED_SKINVL_MODEL

Replace caption file and image folder in the script with your dataset paths. We provide preprocessed MM-Skin test data in /Dataframe/test/VQA, which can be used directly for evaluation.

2. Supervised Fine-Tuning (SFT) Classification

Run SFT_classify_test.sh for supervised classification. Replace all paths with your local files. Preprocessed data for reproducing our results can be found in /Dataframe/test/classification.

3. Zero-Shot Classification

Run ZS_classify_test.sh to perform zero-shot classification.

Data Collection and Statistics

The 15 professional dermatology textbooks are:

MM-Skin contains 11,039 dermatology images with expert descriptions across three modalities. It provides three subsets:

- MM-Skin-C (Captions)

- MM-Skin-O (Open-ended VQA)

- MM-Skin-D (Demographics)

Data Collection Process

Image-Text Extraction: From 15 dermatology textbooks using OCR and Adobe API.
Alignment: Match images with captions.
Modality Classification: Feature-based classification (color, texture) with manual verification.
Text Cleaning: Extract age and gender info.
Filtering: Remove sensitive or annotated images.

Citation

If you find our work helpful, feel free to give us a cite.

@article{zeng2025mm,
  title={MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks},
  author={Zeng, Wenqi and Sun, Yuqi and Ma, Chenxi and Tan, Weimin and Yan, Bo},
  journal={arXiv preprint arXiv:2505.06152},
  year={2025}
}