Diffusion Theory
June 11, 2023 · View on GitHub
01-Diffusion-Sandbox - This note relates to the notebook, which includes more visualizations of the diffusion process!
Diffusion models solve a task similar to GANs (and other generative model types, like VAEs or Normalizing Flows) - they attempt to approximate some probability distribution of a given domain and most importantly, provide a way to sample from that distribution .
This is achieved by optimizing some parameters (represented as a neural network) that result in a probability distribution . The objective of training is so that produces similar samples to those drawn from the true underlying distribution .
:bulb: What's different from GANs?
- GANs produce samples from a latent vector in a single forward pass through the Generator network. The likelihood of the produced samples is controlled by the Discriminator network, which is trained to distinguish between and .
- Diffusion models use a single network that is used to sequentially converge to an approximation of a real sample through several estimation steps. So, the model input and output are generally of the same dimensionality.
:wrench: Mechanism of the Denoising Diffusion Process
Denoising Diffusion Process consists of a chain of steps in two directions, corresponding do destruction and creation of information in the sample.
:point_right: Forward Process
With access to a sample at a time step , one can make an estimation about the next sample in the forward process, defined by the true distribution :
Quite often, what is available are the samples at time step $0$ (meaning clean samples), and it is then useful to use the types of operation that allow easy and efficient formulation of:
So far, the most common choice for a forward process has been Gaussian. Easy to compute and convenient in various respects:
the notation above simply means that the previous sample is scaled down by a factor of and additional Gaussian noise (sampled from a zero-mean unit-variance Gaussian) multiplied by is added.
Furthermore, the $0\to t$ step can also be easily defined as:
where and
:point_left: Reverse Process
The reverse process is designed to restore the information in the sample, which allows to generate a new sample from the distribution. Generally, it will start at some high time step (very often at , which indicates the end of the diffusion chain, where the probability distribution is extremely close to a pure Gaussian), and attempt to approximate the distribution of the previous sample .
If diffusion steps are small enough, the reverse process of a Gaussian forward process can also be approximated by a Gaussian:
The reverse process is often parameterized using a neural network , a common good candidate for approximating complex transformations. In many cases, a standard deviation function independent of can be used:
:steam_locomotive: DDPM: Denoising Diffusion Probabilistic Model
DDPM is one of the first popular approaches to denoising diffusion. It generates samples by following the reverse process through all T steps of the diffusion chain.
When it comes to parameterizing the mean of the reverse process distribution, the network can either:
- Predict it directly as
- Predict the original sample , where
- Predict the normal noise sample (from a unit-variance distribution), which has been added to the sample
The third option, where the network predicts appears to be most common, and that's what is being done in DDPM. This yields to a new equation for expressed in terms of and :
and hence
...which is the key equation for DDPM used for sampling.
Training
Training a model tasked to predict the noise shape is quite straightforward.
At each training step:
- Use forward process to generate a sample for a sampled uniformly from :
- Sample time step from a uniform distribution
- Sample from a normal Gaussian
- Compute noisy input sample for training via
- Compute the approximation of noise using the model with parameters
- Minimize the error between and by optimizing parameters
Sampling
Generation begins at by sampling from the last step in the diffusion process, which is modelled by a normal Gaussian.
Then, until is reached, the network makes a prediction of noise in the sample and then approximates the mean of the process at , using:
Hence, the next sample at is sampled from the Gaussian distribution like below:
...until is reached, in which case only the mean is extracted as output.
:bullettrain_front: (Sampling Faster) DDIM: Denoising Diffusion Implicit Model
Warning: if you look up the original DDIM paper, you will see the symbol used for . In this note, no such notation change is made for the sake of consistency.
DDPM reverse process attempts to navigate the diffusion chain of T steps in the reverse order. However, there as shown in (9), the reverse process involves an approximation of the clean sample .
If we substitute for in (4):
which yields
...and based on a specific measured at the previous step , it can be rewritten as:
Generally, is set to:
Further, we can introduce a new parameter to control the magnitude of the stochastic component:
\sigma_t^2 = \eta \tilde{\beta}_t \tag{18}
As found in the original DDIM paper, setting appears to be particularly beneficial when fewer steps of the reverse process are applied and that specific type of process is known as Denoising Diffusion Implicit Model (DDIM). The above formulation is still consistent with DDPM when .
:flashlight: So, how can the reverse chain be navigated in the reverse direction? First, a sequence of fewer steps is defined as a subset of the original temporal steps of the forward process. Sampling is then based on (16).
At each step:
- Predict
- Compute the direction towards current
- (If not DDIM) inject some noise for the stochastic functionality
It can generally be assumed that DDIM
- Offers better sample quality at fewer steps
- Allows for deterministic matching between the starting noise and the generated sample
- Performs worse than DDPM for large numbers of steps (such as 1000)