Eqϕ(z∣x)[logpθ(x∣z)] measures the reconstruction likelihood of the decoder from the variational distribution. (Monte Carlo estimate)
DKL(qϕ(z∣x)∣∣p(z)) measures how similar the learned variational distribution is to a prior belief held over latent variables. (Analytical calculation)
The latent dimension is exactly equal to the data dimension
⟹qϕ(z1:T∣x)=q(z1:T∣x0)=∏t=1Tq(xt∣xt−1)
The structure of the latent encoder at each timestep is not learned; it is pre-defined as a linear Gaussian model
⟹ The latent encoder is a Gaussian distribution centered around the output of the previous timestep ⟹q(xt∣xt−1)=N(xt;αtxt−1,(1−αt)I)
The Gaussian parameters of the latent encoders vary over time in such a way that the distribution of the latent at final timestep T is a standard Gaussian ⟹p(xT)=N(xT;0,I), which is pure noise
The first term measures the reconstruction likelihood of the decoder from the variational distribution. (Monte Carlo estimate)
The second term measures how close the distribution of the final nosisified input is to the standard Gaussian prior.
Note that it has no trainable parameters, and is also equal to zero under the assumptions.
The third term is for denoising matching. We learn desired denoising transition step pθ(xt−1∣xt) as an approximation to tracable, ground-truth denoising transition step q(xt−1∣xt,x0).
Note that when T=1, VDM's ELBO falls back into VAE's.
Note that the denoising matching term dominates the overall optimization cost because of the summation term.
For learning a neural network to predict the original ground truth image from an arbitrarily noisified version of it, minimize the summation term of the derived ELBO objective across all noise levels, which can be approximated by minimizing the expectation over all timesteps: