ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

March 21, 2026 · View on GitHub

arXiv Paper WACV 2025 CVPR 2025 arXiv v2 Paper CVPR 2026 XL Code Model Weights License

img

(April. 2025) Official implementation of Colorize Diffusion.

Colorize Diffusion is a SD-based colorization framework that can achieve high-quality colorization results with arbitrary input pairs.

Fundamental issue for this repository: ColorizeDiffusion (e-print).
Version 1 - Base training, 512px. Released, ckpt starts with mult.
Version 1.5 - Solving spatial entanglement, 512px. Released, ckpt starts with switch.
Version 2 - Enhancing background and style transfer, 768px. Released, ckpt starts with v2.
Version XL - Enhancing embedding guidance for character colorization, geometry disentanglement, 1024px. Released, code, accepted by CVPR 2026.

Getting Start


conda env create -f environment.yaml
conda activate hf

Make sure open-clip-torch<=2.24.0.

User Interface


We implement a fully-featured UI. To run it, just:

python -u app.py

The default server address is http://localhost:7860.

Important inference options

OptionsDescription
BG enhanceLow-level feature injection for backgrounds in v2 models.
Style enhanceLow-level feature injection for style details in v2 models.
FG enhanceUseless for currently open-sourced models.
Reference strengthDecreasing it to increase semantic fidelity to sketch inputs.
Foreground strengthSimilar to reference strength but only for foreground region. Need to activate FG or BG enhance.
PreprocessorSketch preprocessing. Extract is suggested if the sketch input is complicated pencil drawing.
Line extractorLine extractors used when preprocessor is Extract.
Sketch guidance scaleClassifier-free guidance scale of the sketch image, suggested 1.
Attention injectionNoised low-level feature injection, 2x inference time.

768-level Cross-content colorization results (from v2)

img img

1536-level Character colorization results (from XL)

img img

Manipulation


The colorization results can be manipulated using text prompts, see ColorizeDiffusion (e-print).

It is now deactivated by default. To activate it, use

python -u app.py -manipulate

For local manipulations, a visualization is provided to show the correlation between each prompt and tokens in the reference image.

The manipulation result and correlation visualization of the settings:

Target prompt: the girl's blonde hair
Anchor prompt the girl's brown hair
Control prompt the girl's brown hair, 
Target scale: 8
Enhanced: false
Thresholds: 0.5、0.55、0.65、0.95

img img As you can see, the manipluation unavoidably changed some unrelated regions as it is taken on the reference embeddings.

Manipulation options

OptionsDescription
Group indexThe index of selected manipulation sequences's parameter group.
Target promptThe prompt used to specify the desired visual attribute for the image after manipulation.
Anchor promptThe prompt to specify the anchored visaul attribute for the image before manipulation.
Control promptUsed for local manipulation (crossattn-based models). The prompt to specify the target regions.
EnhanceSpecify whether this manipulation should be enhanced or not. (More likely to influence unrelated attribute).
Target scaleThe scale used to progressively control the manipulation.
ThresholdsUsed for local manipulation (crossattn-based models). Four hyperparameters used to reduce the influnece on irrelevant visual attributes, where 0.0 < threshold 0 < threshold 1 < threshold 2 < threshold 3 < 1.0.
<Threshold0Select regions most related to control prompt. Indicated by deep blue.
Threshold0-Threshold1Select regions related to control prompt. Indicated by blue.
Threshold1-Threshold2Select neighbouring but unrelated regions. Indicated by green.
Threshold2-Threshold3Select unrelated regions. Indicated by orange.
>Threshold3Select most unrelated regions. Indicated by brown.
AddClick add to save current manipulation in the sequence.

Training & inference & validation

Our implementation is based on Accelerate and Deepspeed.
Before starting a training, first collect data and organize your training dataset as follows:

[dataset_path]
├── image_list.json    # Optionally for image indexing
├── color              # Color images
│   ├── 0001.zip        
|   |   ├── 10001.png
|   |   ├── 100001.jpg
│   |   └── ...
│   ├── 0002.zip
│   └── ...
├── sketch             # Sketch images
│   ├── 0001.zip
|   |   ├── 10001.png
|   |   ├── 100001.jpg
│   |   └── ...
│   ├── 0002.zip
│   └── ...
└── mask               # Mask images (required for fg-bg training)
    ├── 0001.zip
    |   ├── 10001.png
    |   ├── 100001.jpg
    |   └── ...
    ├── 0002.zip
    └── ...

For details of dataset organization, check data/dataloader.py.
Training command example:

accelerate launch --config_file [accelerate_config_file] \
    train.py \
    --name base \
    --dataroot [dataset_path] \
    --batch_size 64 \
    --num_threads 8 \
    -cfg configs/train/sd2.1/mult.yaml \
    -pt [pretrained_model_path]

Note that the batch size here is micro batch size per gpu. If you run the command on 8 gpus, the total batch size is 512.

Inference example:

python inference.py
    --name inf \
    --dataroot [dataset_path] \
    --batch_size 64 \
    --num_threads 8 \
    -cfg configs/inference/val.yaml \ (for mult/switch-eps models)
    -cfg configs/inference/v2-val.yaml \ (for v2 models)
    -pt [pretrained_model_path]
    -gs 5

Validation example:

python inference.py
    --name val \
    --dataroot [dataset_path] \
    --batch_size 64 \
    --num_threads 8 \
    -cfg configs/inference/val.yaml \ (for mult/switch-eps models)
    -cfg configs/inference/v2-val.yaml \ (for v2 models)
    -pt [pretrained_model_path]
    -gs 5
    -val

The difference between inference and validation modes is that validation mode use randomly selected images as reference inputs. Refer to options.py for full arguments.

Code reference

  1. Stable Diffusion v2
  2. Stable Diffusion XL
  3. SD-webui-ControlNet
  4. Stable-Diffusion-webui
  5. K-diffusion
  6. Deepspeed
  7. sketchKeras-PyTorch

Citation

@article{2024arXiv240101456Y,
       author = {{Yan}, Dingkun and {Yuan}, Liang and {Wu}, Erwin and {Nishioka}, Yuma and {Fujishiro}, Issei and {Saito}, Suguru},
        title = "{ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text}",
      journal = {arXiv e-prints},
         year = {2024},
          doi = {10.48550/arXiv.2401.01456},
}

@InProceedings{Yan_2025_WACV,
    author    = {Yan, Dingkun and Yuan, Liang and Wu, Erwin and Nishioka, Yuma and Fujishiro, Issei and Saito, Suguru},
    title     = {ColorizeDiffusion: Improving Reference-Based Sketch Colorization with Latent Diffusion Model},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    year      = {2025},
    pages     = {5092-5102}
}

@article{2025arXiv250219937Y,
    author = {{Yan}, Dingkun and {Wang}, Xinrui and {Li}, Zhuoru and {Saito}, Suguru and {Iwasawa}, Yusuke and {Matsuo}, Yutaka and {Guo}, Jiaxian},
    title = "{Image Referenced Sketch Colorization Based on Animation Creation Workflow}",
    journal = {arXiv e-prints},
    year = {2025},
    doi = {10.48550/arXiv.2502.19937},
}

@article{yan2025colorizediffusionv2enhancingreferencebased,
      title={ColorizeDiffusion v2: Enhancing Reference-based Sketch Colorization Through Separating Utilities}, 
      author={Dingkun Yan and Xinrui Wang and Yusuke Iwasawa and Yutaka Matsuo and Suguru Saito and Jiaxian Guo},
      year={2025},
      journal = {arXiv e-prints},
      doi = {10.48550/arXiv.2504.06895},

}