A Survey on Conditional Image Synthesis with Diffusion Models

June 13, 2025 · View on GitHub

Awesome License: MIT visitors

The repository is based on our survey paper Conditional Image Synthesis with Diffusion Models: A Survey

TMLR (Survey Certification, 53 Pages): https://openreview.net/forum?id=ewwNKwh6SK

Zheyuan Zhan, Defang Chen, Jian-Ping Mei, Zhenghe Zhao, Jiawei Chen, Chun Chen, Siwei Lyu and Can Wang

Abstract

Conditional image synthesis based on user-specified requirements is a key component in creating complex visual content. In recent years, diffusion-based generative modeling has become a highly effective way for conditional image synthesis, leading to exponential growth in the literature. However, the complexity of diffusion-based modeling, the wide range of image synthesis tasks, and the diversity of conditioning mechanisms present significant challenges for researchers to keep up with rapid developments and understand the core concepts on this topic. In this survey, we categorize existing works based on how conditions are integrated into the two fundamental components of diffusion-based modeling, i.e., the denoising network and the sampling process. We specifically highlight the underlying principles, advantages, and potential challenges of various conditioning approaches in the training, re-purposing, and specialization stages to construct a desired denoising network. We also summarize six mainstream conditioning mechanisms in the essential sampling process. All discussions are centered around popular applications. Finally, we pinpoint some critical yet still open problems to be solved in the future and suggest some possible solutions.

News!

📆2024-10-05: Our comprehensive survey paper, summarizing related methods published before October 1, 2024, is now available.

📆2025-04-27: Our paper is accepted by TMLR!!!

BibTeX

@article{zhan2024conditional,
  title={Conditional Image Synthesis with Diffusion Models: A Survey},
  author={Zhan, Zheyuan and Chen, Defang and Mei, Jian-Ping and Zhao, Zhenghe and Chen, Jiawei and Chen, Chun and Lyu, Siwei and Wang, Can},
  journal={arXiv preprint arXiv:2409.19365},
  year={2024}
}

Contents

Overview

In the two figures below, they respectively illustrate the DCIS taxonomy in this survey and the categorization of conditional image synthesis tasks.

Paper Structure

Conditional image synthesis with diffusion model

Conditional image synthesis tasks

tasks

Papers

The date in the table represents the publication date of the first version of the paper on Arxiv.

DDPM denoising network

Workflow

Condition Integration in Denoising Networks

This figure provides an examplar workflow to build desired denoising network for conditional synthesis tasks including text-to-image, visual signals to image and customization via these three condition integration stages.

Workflow

Condition Integration in the Training Stage

Conditional models for text-to-image (T2I)

TitleTaskDatePublication
Vector quantized diffusion model for text-to-image synthesisText-to-image2021.11CVPR2022
High-resolution image synthesis with latent diffusion modelsText-to-image2021.12CVPR2022
GLIDE: towards photorealistic image generation and editing with text-guided diffusion modelsText-to-image2021.12ICML2022
Hierarchical text-conditional image generation with CLIP latentsText-to-image2022.4ARXIV2022
Photorealistic text-to-image diffusion models with deep language understandingText-to-image2022.5NeurIPS2022
ediffi: Text-to-image diffusion models with an ensemble of expert denoisersText-to-image2022.11ARXIV2022
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image SynthesisText-to-image2023.10ICLR2024
Scaling Rectified Flow Transformers for High-Resolution Image SynthesisText-to-image2024.03ICML2024

Conditional Models for Image Restoration

TitleTaskDatePublication
Srdiff: Single image super-resolution with diffusion probabilistic modelsImage restoration2021.4Neurocomputing2022
Image super-resolution via iterative refinementImage restoration2021.4TPAMI2022
Cascaded diffusion models for high fidelity image generationImage restoration2021.5JMLR2022
Palette: Image-to-image diffusion modelsImage restoration2021.11SIGGRAPH2022
Denoising diffusion probabilistic models for robust image super-resolution in the wildImage restoration2023.2ARXIV2023
Resdiff: Combining cnn and diffusion model for image super-resolutionImage restoration2023.3AAAI2024
Low-light image enhancement with wavelet-based diffusion modelsImage restoration2023.6TOG2023
Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restorationImage restoration2023.11CVPR2024
Diffusion-based blind text image super-resolutionImage restoration2023.12CVPR2023
Low-light image enhancement via clip-fourier guided wavelet diffusionImage restoration2024.1ARXIV2024

Conditional Models for Other Synthesis Scenarios

TitleTaskDatePublication
Diffusion autoencoders: Toward a meaningful and decodable representationNovel conditional control2021.11CVPR2022
Semantic image synthesis via diffusion modelsvisual feature map2022.6ARXIV2022
A novel unified conditional scorebased generative framework for multi-modal medical image completionMedical image synthesis2022.7ARXIV2022
A morphology focused diffusion probabilistic model for synthesis of histopathology imagesMedical image synthesis2022.9WACV2023
Humandiffusion: a coarse-to-fine alignment diffusion framework for controllable text-driven person image generationVisual signal to image2022.11ARXIV2022
Diffusion-based scene graph to image generation with masked contrastive pre-trainingGraph to image2022.11ARXIV2022
Dolce: A model-based probabilistic diffusion framework for limited-angle ct reconstructionMedical image synthesis2022.11ICCV2023
Zero-shot medical image translation via frequency-guided diffusion modelsImage editing2023.4Trans. Med. Imaging 2023
Learned representation-guided diffusion models for large-image generation/2023.12ARXIV2023

Condition Integration in the Re-purposing Stage

Re-purposed Conditional Encoders

TitleTaskDatePublication
Pretraining is all you need for image-to-image translationVisual signal to image2022.5ARXIV2022
T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion modelsVisual signal to image2023.2AAAI2024
Adding conditional control to text-to-image diffusion modelsVisual signal to image2023.2ICCV2023
Encoder-based domain tuning for fast personalization of text-to-image modelsCustomization2023.2TOG2023
Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion modelsImage editing, Image composition2023.3ARXIV2023
Taming encoder for zero fine-tuning image customization with text-to-image diffusion modelsCustomization2023.4ARXIV2023
Instantbooth: Personalized text-to-image generation without test-time finetuningCustomization2023.4CVPR2024
Blip-diffusion: pre-trained subject representation for controllable text-to-image generation and editingCustomization2023.5NeurIPS2023
Fastcomposer: Tuning-free multi-subject image generation with localized attentionCustomization2023.5ARXIV2023
Prompt-free diffusion: Taking” text” out of text-to-image diffusion modelsVisual signal to image2023.5CVPR2024
Paste,inpaint and harmonize via denoising: Subject-driven image editing with pre-trained diffusion modelImage composition2023.6ARXIV2023
Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuningCustomization,Layout control2023.7SIGGRAPH2024
Imagebrush: Learning visual in-context instructions for exemplar-based image manipulationImage editing2023.8NeurIPS2024
Guiding instruction-based image editing via multimodal large language modelsImage editing2023.9ARXIV2023
Ranni: Taming text-to-image diffusion for accurate instruction followingImage editing2023.11ARXIV2023
Smartedit: Exploring complex instruction-based image editing with multimodal large language modelsImage editing2023.12ARXIV2023
Instructany2pix: Flexible visual editing via multimodal instruction followingImage editing2023.12ARXIV2023
Warpdiffusion: Efficient diffusion model for high-fidelity virtual try-onImage composition2023.12ARXIV2023
Coarse-to-fine latent diffusion for pose-guided person image synthesisCustomization2024.2CVPR2024
Lightit: Illumination modeling and control for diffusion modelsVisual signal to image2024.3CVPR2024
Face2diffusion for fast and editable face personalizationCustomization2024.3CVPR2024

Condition Injection

TitleTaskDatePublication
GLIGEN: open-set grounded text-to-image generationLayout control2023.1CVPR2023
Elite: Encoding visual concepts into textual embeddings for customized text-to-image generationCustomization2023.2CVPR2023
Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion modelsCustomization2023.5NeurIPS2024
Dragondiffusion: Enabling drag-style manipulation on diffusion modelsImage editing2023.7ICLR2024
Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion modelsVisual signal to image,Image editing2023.8ARXIV2023
Interactdiffusion: Interaction control in text-to-image diffusion modelsLayout control2023.12ARXIV2023
Instancediffusion: Instance-level control for image generationLayout control2024.2CVPR2024
Deadiff: An efficient stylization diffusion model with disentangled representationsImage editing2024.3CVPR2024

Backbone Fine-tuning

TitleDatePublication
Instructpix2pix: Learning to follow image editing instructionsImage editing2022.11CVPR2023
Paint by example: Exemplar-based image editing with diffusion modelsImage composition2022.11CVPR2023
Objectstitch: Object compositing with diffusion modelImage composition2022.12CVPR2023
Smartbrush: Text and shape guided object inpainting with diffusion modelImage restoration2022.12CVPR2023
Imagen editor and editbench: Advancing and evaluating text-guided image inpaintingImage restoration2022.12CVPR2023
Reference-based image composition with sketch via structure-aware diffusion modelImage composition2023.3ARXIV2023
Dialogpaint: A dialogbased image editing modelImage editing2023.3ARXIV2023
Hive: Harnessing human feedback for instructional visual editingImage editing2023.3CVPR2024
Inst-inpaint: Instructing to remove objects with diffusion modelsImage editing2023.4ARXIV2023
Text-to-image editing by image information removalImage editing2023.5WACV2024
Magicbrush: A manually annotated dataset for instruction-guided image editingImage editing2023.6NeurIPS2024
Anydoor: Zero-shot object-level image customizationImage composition2023.7CVPR2024
Instructdiffusion: A generalist modeling interface for vision tasksImage editing2023.9ARXIV2023
Emu edit: Precise image editing via recognition and generation tasksImage editing2023.11CVPR2024
Dreaminpainter: Text-guided subject-driven image inpainting with diffusion modelsImage composition2023.12ARXIV2023

Condition Integration in the Specialization Stage

Conditional Projection

TitleTaskDatePublication
An image is worth one word: Personalizing text-to-image generation using textual inversionCustomization2022.8ICLR2023
Imagic: Text-based real image editing with diffusion modelsImage editing2022.10CVPR2023
Uncovering the disentanglement capability in text-to-image diffusion modelsImage editing2022.12CVPR2023
Preditor: Text guided image editing with diffusion priorImage editing2023.2ARXIV2023
iedit: Localised text-guided image editing with weak supervisionImage editing2023.5CVPR2024
Forgedit: Text guided image editing via learning and forgettingImage editing2023.9ARXIV2023
Prompting hard or hardly prompting: Prompt inversion for text-to-image diffusion modelsImage editing2023.12CVPR2024

Testing-time Model Fine-Tuning

TitletaskDatePublication
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generationCustomization2022.8CVPR2023
Imagic: Text-based real image editing with diffusion modelsImage editing2022.10CVPR2023
Unitune: Text-driven image editing by fine tuning a diffusion model on a single imageImage editing2022.10TOG2023
Multi-concept customization of text-to-image diffusionCustomization2022.12CVPR2023
Sine: Single image editing with text-to-image diffusion modelsImage editing2022.12CVPR2023
Encoder-based domain tuning for fast personalization of text-to-image modelsCustomization2023.2TOG2023
Svdiff: Compact parameter space for diffusion fine-tuningCustomization2023.3ICCV2023
Cones: concept neurons in diffusion models for customized generationCustomization2023.3ICML2023
Custom-edit: Text-guided image editing with customized diffusion modelsCustomization2023.5ARXIV2023
Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion modelsCustomization2023.5NeurIPS2024
Layerdiffusion: Layered controlled image editing with diffusion modelsImage editing2023.5SIGGRAPH Asia2023
Cones 2: Customizable image synthesis with multiple subjectsCustomization2023.5NeurIPS2023

Condition Integration in the Sampling Process

We illustrate six conditioning mechanisms with an exemplary image editing process in next figure.

Sampling

Inversion

TitleTaskDatePublication
Sdedit: Guided image synthesis and editing with stochastic differential equationsImage editing, Visual signal to image2021.8ICLR2022
Dual diffusion implicit bridges for image-to-image translationImage editing, Visual signal to image2022.3ICLR2023
Null-text inversion for editing real images using guided diffusion modelsImage editing2022.11CVPR2023
Edict: Exact diffusion inversion via coupled transformationsImage editing2022.11CVPR2023
A latent space of stochastic diffusion models for zero-shot image editing and guidanceImage editing2022.11ICCV2023
Inversion-based style transfer with diffusion modelsImage editing2022.11CVPR2023
An edit friendly ddpm noise space: Inversion and manipulationsImage editing2023.4ARXIV2023
Prompt tuning inversion for text-driven image editing using diffusion modelsImage editing2023.5ICCV2023
Negative-prompt inversion: Fast image inversion for editing with textguided diffusion modelsImage editing2023.5ARXIV2023
Dragdiffusion: Harnessing diffusion models for interactive point-based image editingImage editing2023.6CVPR2024
Tf-icon: Diffusion-based training-free cross-domain image compositionImage editing2023.7ICCV2023
Stylediffusion: Controllable disentangled style transfer via diffusion modelsImage editing2023.8ICCV2023
Kv inversion: Kv embeddings learning for text-conditioned real image action editingImage editing2023.9PRCV2023
Effective real image editing with accelerated iterative diffusion inversionImage editing2023.9ICCV2023
Direct inversion: Boosting diffusion-based editing with 3 lines of codeImage editing2023.10ARXIV2023
Ledits++: Limitless image editing using text-to-image modelsImage editing2023.11CVPR2024
The blessing of randomness: Sde beats ode in general diffusionbased image editingImage editing2023.11ICLR2023
Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transferImage editing2023.12CVPR2024
Fixed-point inversion for text-to-image diffusion modelsImage editing2023.12ARXIV2023

Attention Manipulation

TitleTaskDatePublication
Prompt-to-prompt image editing with cross attention controlImage editing2022.8ICLR2023
Plug-and-play diffusion features for text-driven image-to-image translationImage editing2022.11CVPR2023
ediffi: Text-toimage diffusion models with an ensemble of expert denoisersLayout control2022.11ARXIV2022
Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editingImage editing2023.4ICCV2023
Custom-edit: Text-guided image editing with customized diffusion modelsCustomization2023.5ARXIV2023
Cones 2: Customizable image synthesis with multiple subjectsCustomization2023.5NeurIPS2023
Dragdiffusion: Harnessing diffusion models for interactive point-based image editingImage editing2023.6CVPR2024
Tf-icon: Diffusion-based training-free cross-domain image compositionImage editing2023.7ICCV2023
Dragondiffusion: Enabling drag-style manipulation on diffusion modelsImage editing2023.7ICLR2024
Stylediffusion: Controllable disentangled style transfer via diffusion modelsImage editing2023.8ICCV2023
Face aging via diffusion-based editingImage editing2023.9BMVC2023
Dynamic prompt learning: Addressing cross-attention leakage for text-based image editingImage editing2023.9NeurIPS2024
Style injection in diffusion: A training-free approach for adapting large-scale diffusion models for style transferImage editing2023.12CVPR2024
Focus on your instruction: Fine-grained and multi-instruction image editing by attention modulationImage editing2023.12ARXIV2023
Towards understanding cross and self-attention in stable diffusion for text-guided image editingImage editing2024.3CVPR2024
Taming Rectified Flow for Inversion and EditingImage editing2024.11ARXIV2024
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion ModelsImage editing2024.11ARXIV2024

Noise Blending

TitleTaskDatePublication
Compositional visual generation with composable diffusion modelsGeneral approach2022.6ECCV2022
Classifier-free diffusion guidance/2022.7ARXIV2022
Sine: Single image editing with text-to-image diffusion modelsImage editing2022.12CVPR2023
Multidiffusion: Fusing diffusion paths for controlled image generationMultiple control2023.2ICML2023
Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion modelsImage editing, Image composition2023.3ARXIV2023
Magicfusion: Boosting text-to-image generation performance by fusing diffusion modelsImage composition2023.3ICCV2023
Effective real image editing with accelerated iterative diffusion inversionimage editing2023.9ICCV2023
Ledits++: Limitless image editing using text-to-image modelsImage editing2023.11CVPR2024
Noisecollage: A layout-aware text-to-image diffusion model based on noise cropping and mergingImage composition2024.3CVPR2024

Revising Diffusion Process

TitleTaskDatePublication
Snips: Solving noisy inverse problems stochasticallyImage restoration2021.5NeurIPS2021
Denoising diffusion restoration modelsImage restoration2022.1NeurIPS2022
Driftrec: Adapting diffusion models to blind jpeg restorationImage restoration2022.11TIP2024
Zero-shot image restoration using denoising diffusion null-space modelImage restoration2022.12ICLR2024
Image restoration with mean-reverting stochastic differential equationsImage restoration2023.1ICML2023
Inversion by direct iteration: An alternative to denoising diffusion for image restorationImage restoration2023.3TMLR2023
Resshift: Efficient diffusion model for image super-resolution by residual shiftingImage restoration2023.7NeurIPS2024
Sinsr: diffusion-based image super-resolution in a single stepImage restoration2023.11CVPR2024

Guidance

TitleTaskDatePublication
Diffusion models beat gans on image synthesisText-to-image2021.5NeurIPS2021
Blended diffusion for text-driven editing of natural imagesImage restoration2021.11CVPR2022
More control for free! image synthesis with semantic diffusion guidanceText/Image-to-image2021.12WACV2023
Improving diffusion models for inverse problems using manifold constraintsImage restoration2022.6NeurIPS2022
Diffusion posterior sampling for general noisy inverse problemsImage restoration2022.9ICLR2023
Diffusion-based image translation using disentangled style and content representationImage editing2022.9ICLR2023
Sketch-guided text-to-image diffusion modelsVisual signal to image2022.11SIGGRAPH2023
High-fidelity guided image synthesis with latent diffusion modelsVisual signal to image2022.11CVPR2023
Parallel diffusion models of operator and image for blind inverse problemsImage restoration2022.11CVPR2023
Zero-shot image-to-image translationImage editing2023.2SIGGRAPH2023
Universal guidance for diffusion modelsGeneral guidance framework2023.2CVPR2023
Pseudoinverse-guided diffusion models for inverse problemsImage restoration2023.2ICLR2023
Freedom: Training-free energy-guided conditional diffusion modelGeneral guidance framework2023.3ICCV2023
Training-free layout control with cross-attention guidanceLayout control2023.4WACV2024
Generative diffusion prior for unified image restoration and enhancementImage restoration2023.4CVPR2023
Regeneration learning of diffusion models with rich prompts for zero-shot image translationImage editing2023.5ARXIV2023
Diffusion self-guidance for controllable image generationImage editing2023.6NeurIPS2024
Energy-based cross attention for bayesian context update in text-to-image diffusion modelsImage editing2023.6NeurIPS2024
Solving linear inverse problems provably via posterior sampling with latent diffusion modelsImage restoration2023.7NeurIPS2024
Dragondiffusion: Enabling drag-style manipulation on diffusion modelsImage editing2023.7ICLR2024
Readout guidance: Learning control from diffusion featuresVisual signal to image2023.12CVPR2024
Freecontrol: Training-free spatial control of any text-to-image diffusion model with any conditionVisual signal to image2023.12CVPR2024
Diffeditor: Boosting accuracy and flexibility on diffusion-based image editingImage editing2024.2CVPR2024

Conditional Correction

TitleTaskDatePublication
Score-based generative modeling through stochastic differential equationsImage restoration2020.11ICLR2021
ILVR: conditioning method for denoising diffusion probabilistic modelsImage restoration2021.8ICCV2021
Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contractionImage restoration2021.12CVPR2022
Repaint: Inpainting using denoising diffusion probabilistic modelsImage restoration2022.1CVPR2022
Improving diffusion models for inverse problems using manifold constraintsImage restoration2022.6NeurIPS2022
Diffedit: Diffusion-based semantic image editing with mask guidanceImage editing2022.10ICLR2023
Region-aware diffusion for zero-shot text-driven image editingImage editing2023.2ARXIV2023
Localizing object-level shape variations with text-to-image diffusion modelsImage editing2023.3ICCV2023
Instructedit: Improving automatic masks for diffusion-based image editing with user instructionsImage editing2023.5ARXIV2023
Text-driven image editing via learnable regionsImage editing2023.11CVPR2024

Star History

Star History Chart