Structured Denoising Diffusion Models in Discrete State-Spaces

December 10, 2022 · View on GitHub

Getting Started

See annotated_diffusion.ipynb for a walkthrough about the forward diffusion process.

See annotated_traning.ipynb for a detailed guide to train a model and generate samples from the model.

Training with fashion_mnist with 20 epochs, here are a few example GIFs of the image generation process:

Read Papers to learn more:

Structured Denoising Diffusion Models in Discrete State-Spaces

Understanding Diffusion Models: A Unified Perspective

Variational Autoencoder (VAE)

Parameters

  • Joint distribution: p(x,z)p(x,z)

  • Posterior: qϕ(zx)q_\phi(z|x)

Evidence Lower Bound

logp(x)Eqϕ(zx)[logp(x,z)qϕ(zx)]=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)  p(z))\begin{align*} \log p(x) &\geq \mathbb{E}_{q_\phi(z|x)}[\log \frac{p(x,z)}{q_\phi(z|x)}] \\ &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \ || \ p(z)) \end{align*}
  • Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] measures the reconstruction likelihood of the decoder from the variational distribution. (Monte Carlo estimate)

  • DKL(qϕ(zx)  p(z))D_{KL}(q_\phi(z|x) \ || \ p(z)) measures how similar the learned variational distribution is to a prior belief held over latent variables. (Analytical calculation)

Objective

argmaxϕ,θEqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)  p(z))argmaxϕ,θl=1L[logpθ(xz(l))]DKL(qϕ(zx)  p(z))\begin{align*} \arg\max_{\phi,\theta} \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \ || \ p(z)) \\ \approx \arg\max_{\phi,\theta} \sum^L_{l=1}[\log p_\theta(x|z^{(l)})] - D_{KL}(q_\phi(z|x) \ || \ p(z)) \end{align*}

where latents {z(l)}l=1L\{ z^{(l)} \}^L_{l=1} are sampled from qϕ(zx)q_\phi(z|x).

Hierarchical Variational Autoencoder (HVAE)

Stacking VAEs on top of each other.

Parameters

  • Joint distribution: p(x,z1:T)=p(zT)pθ(xz1)t=2Tpθ(zt1zt)p(x,z_{1:T}) = p(z_T)p_\theta(x|z_1)\prod^T_{t=2}p_\theta(z_{t-1}|z_t)

  • Posterior: qϕ(zx)=qθ(z1x)t=2Tqθ(ztzt1)q_\phi(z|x) = q_\theta(z_1|x)\prod^T_{t=2}q_\theta(z_t|z_{t-1})

Evidence Lower Bound

logp(x)Eqϕ(z1:Tx)[logp(x,z1:T)qϕ(z1:Tx)]=Eqϕ(z1:Tx)[logp(zT)pθ(xz1)t=2Tpθ(zt1zt)qθ(z1x)t=2Tqθ(ztzt1)]\begin{align*} \log p(x) &\geq \mathbb{E}_{q_\phi(z_{1:T}|x)}[\log \frac{p(x,z_{1:T})}{q_\phi(z_{1:T}|x)}] \\ &= \mathbb{E}_{q_\phi(z_{1:T}|x)}[\log \frac{p(z_T)p_\theta(x|z_1)\prod^T_{t=2}p_\theta(z_{t-1}|z_t)}{q_\theta(z_1|x)\prod^T_{t=2}q_\theta(z_t|z_{t-1})}] \end{align*}

Objective

Similar to VAE.

Variational Diffusion Models (VDM)

HVAE but with three key restrictions:

  • The latent dimension is exactly equal to the data dimension     qϕ(z1:Tx)=q(z1:Tx0)=t=1Tq(xtxt1)\implies q_\phi(z_{1:T}|x)= q(z_{1:T}|x_0) = \prod^T_{t=1}q(x_t|x_{t-1})

  • The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model     \implies The latent encoder is a Gaussian distribution centered around the output of the previous timestep     q(xtxt1)=N(xt;αtxt1,(1αt)I)\implies q(x_t|x_{t-1}) = \mathcal{N}(x_t;\sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)\pmb{I})

  • The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep T is a standard Gaussian     p(xT)=N(xT;0,I)\implies p(x_T)=\mathcal{N}(x_T;0,\pmb{I}), which is pure noise

Parameters

  • Joint distribution: p(x0:T)p(x_{0:T})

  • Posterior: q(z1:Tx0)=t=1Tq(xtxt1)q(z_{1:T}|x_0) = \prod^T_{t=1}q(x_t|x_{t-1})

Evidence Lower Bound

logp(x)Eq(x1x0)[logpθ(x0x1)]DKL(q(xTx0)  p(xT))t=2TEq(xtx0)[DKL(q(xt1xt,x0)  pθ(xt1xt))]\begin{align*} \log p(x) \geq \mathbb{E}_{q(x_1|x_0)}[\log p_\theta(x_0|x_1)] - D_{KL}(q(x_T|x_0) \ || \ p(x_T)) \\ - \sum^T_{t=2} \mathbb{E}_{q(x_t|x_0)}[D_{KL}(q(x_{t-1}|x_t,x_0) \ || \ p_\theta(x_{t-1}|x_t))] \end{align*}
  • The first term measures the reconstruction likelihood of the decoder from the variational distribution. (Monte Carlo estimate)

  • The second term measures how close the distribution of the final nosisified input is to the standard Gaussian prior.

Note that it has no trainable parameters, and is also equal to zero under the assumptions.

  • The third term is for denoising matching. We learn desired denoising transition step pθ(xt1xt)p_\theta(x_{t-1}|x_t) as an approximation to tracable, ground-truth denoising transition step q(xt1xt,x0)q(x_{t-1}|x_t, x_0).

Note that when T=1T=1, VDM's ELBO falls back into VAE's.

Note that the denoising matching term dominates the overall optimization cost because of the summation term.

Objective

For learning a neural network to predict the original ground truth image from an arbitrarily noisified version of it, minimize the summation term of the derived ELBO objective across all noise levels, which can be approximated by minimizing the expectation over all timesteps:

argminθEtU2,T[Eq(xtx0)DKL(q(xt1xt,x0)  pθ(xt1xt))]\arg\min_\theta \mathbb{E}_{t \sim U{2,T}}[\mathbb{E}_{q(x_t|x_0)}D_{KL}(q(x_{t-1}|x_t,x_0) \ || \ p_\theta(x_{t-1}|x_t))]

which can be optimized using stochastic samples over timesteps.

For generating a novel x0x_0, sample Gaussian noise from p(xT)anditerativelyrunningthedenoisingtransitionsp(x_T) and iteratively running the denoising transitions p_\theta(x_{t-1} | x_t)$ for T steps.