Denoising Diffusion Models#
Denoising diffusion probabilistic models (DDPMs) [HJA20, SDWMG15] are the deep generative models underlying image generation tools like DALL-E 2 (from Open AI) and Stable Diffusion (from Stability AI). This lecture will unpack how they work. These notes are partly inspired by Turner et al. [TDM+24].
Key Ideas#
Diffusion models work by
Using a fixed (i.e., not learned), user-defined noising process to convert data into noise.
Learning the inverse this process so that starting from noise, we can generate samples that approximate the data distribution.
We can think of the DDPM as a giant latent variable model, where the latent variables are noisy versions of the data. As with other latent variable models (e.g., VAEs), once we’ve inferred the latent variables, the problem of learning the mapping from latents to observed data reduces to a supervised regression problem.
DDPMs were originally proposed for modeling continuous data, \(\mbx \in \reals^{D}\). For simplicity, we will present the framework for scalar data, \(x \in \reals\), and then discuss the straightforward generalization to multidimensional data afterward. Finally, we will close with a discussion of recent work on diffusion modeling for discrete data.
Noising process#
Let \(x \equiv x_0\) be our observed data. The noising process is a joint distribution over a sequence of latent variables \(x_{0:T} = (x_0, x_1, \ldots, x_T)\). We will denote the distribution as,
At each step, the latents will become increasingly noisy versions of the original data, until at time \(T\) the latent variable \(x_T\) is essentially pure noise. The generative model will then proceed by sampling pure noise and attemptign to invert the noising process to produce samples that approximate the data generating distribution.
Gaussian noising process#
For continuous data, the standard noising process is a first-order Gaussian autoregressive (AR) process,
The hyperparameters \(\{\lambda_t, \sigma_t^2\}_{t=1}^T\) and the number of steps \(T\) are fixed (not learned). We restrict \(\lambda_t < 1\) so that the process contracts
Since the noising process has linear Gaussian dynamics, we can compute conditional distributions in closed form.
Computing the conditional distributions
Show that the conditional distributions are,
where the parameters are defined recursively,
with base case \(\lambda_{1|0}=\lambda_1\) and \(\sigma_{1|0}^2=\sigma_1^2\).
Solution
Assume the equality above holds for \(t-1\), by induction. Then,
as desired.
The first latent is distributed as \(q(x_1 \mid x_0) = \mathrm{N}(x_1 \mid \lambda_1 x_0, \sigma_1^2)\), which matches the base case above.
Variance preserving diffusions#
It is common to set,
in which case the conditional variance simplifies to,
Under this setting, the noising process preserves the variance of the marginal distributions. If \(\bbE[x_0] = 0\) and \(\Var[x_0] = 1\), then the marginal distribution of \(x_t\) will be zero mean and unit variance as well.
Consider the following two limits:
As \(T \to \infty\), the conditional distribution goes to a standard normal, \(q(x_T \mid x_0) \to \mathrm{N}(0, 1)\), which makes the marginal distribution \(q(x_T)\) easy to sample from.
When \(\lambda_t \to 1\), the noising process adds infinitesimal noise so that \(x_t \approx x_{t-1}\), which makes the inverse process easier to learn.
Of course, these two limits are in conflict with one another. If we add a small amount of noise at each time step, the inverse process is easier to learn, but we need to take many time steps to converge to a Gaussian stationary distribution.
Generative process#
The generative process is a parameteric model that learns to invert the noise process,
The initial distribution \(p(x_T)\) has no parameters because it is set to the stationary distribution of the noising process, \(q(x_\infty)\). E.g., for the Gaussian noising process above, \(p(x_T) = \mathrm{N}(0,1)\).
Evidence Lower Bound#
Like the other latent variable models we studied in this course, we will estimate the parameters by maximizing an evidence lower bound (ELBO),
where \(q(x_{1:T} \mid x_0)\) is the conditional distibution of \(x_{1:T}\) under the noising process. Since there are no learnable parameters in the noising process, the objective simplifies to maximizing the expected log likelihood.
We can simplify further by expanding the log probability of the generative model,
which only depends on pairwise conditionals.
Gaussian generative process#
Since the noising process above adds a small amount of Gaussian noise at each step, it is reasonable to model the generative process as Gaussian as well,
where \(\mu_\theta: \reals \times [0,T] \mapsto \reals\) is a nonlinear mean function that should denoise \(x_{t+1}\) to obtain the expected value of \(x_t\), and \(\widetilde{\sigma}_t^2\) is a fixed variance for the generative process.
Parameter sharing
Rather than learn a separate function for each time point, it is common to parameterize the mean function as a function of both the state \(x_{t+1}\) and the time \(t\). For example, \(\mu_\theta(\cdot, \cdot)\) can be a neural network that takes in the state and a positional embedding of the time \(t\), like the sinusoidal embeddings used in transformers.
Generative process variance
You could try to learn the generative process variance as a function of \(x_{t+1}\) and \(t\) as well, but the literature suggests this is difficult to make work in practice. Instead, is common to set the variance to either
\(\widetilde{\sigma}_t^2 = \sigma_t^2 = 1-\lambda_t^2\), the conditional variance in the noising process, which tends to overestimate the conditional variance of the true generative process; or
\(\widetilde{\sigma}_t^2 = \Var_q[x_t \mid x_0, x_{t+1}]\), the conditional variance of the noising process given the data \(x_0\) and the next state \(x_{t+1}\). This tends to underestimate the conditional variance of the true generative process.
Rao-Blackwellization#
Under this Gaussian model for the generative process, we can analytically compute one of the expectations in the ELBO. This is called Rao-Blackwellization. It reduces the variance of the objective, which is good for SGD!
Using the chain rule and the Gaussian generative model, we have,
We already computed the conditional distribution \(q(x_{t+1} \mid x_0) = \mathrm{N}(x_{t+1} \mid \lambda_{t+1|0} x_0, \sigma_{t+1|0}^2)\) above. It turns out the second term is Gaussian as well!
Conditionals of a Gaussian noising process
Show that
where
is a linear combination of \(x_0\) and \(x_{t+1}\) with weights,
Solution
By Bayes rule and the Markovian nature of the noising process,
where, by completing the square,
The forms for \(a_t\) and \(b_t\) can now be read off.
Finally, to simplify the objective we need the Gaussian cross-entropy,
Gaussian cross-entropy
Let \(q(x) = \mathrm{N}(x \mid \mu_q, \sigma_q^2)\) and \(p(x) = \mathrm{N}(x \mid \mu_p, \sigma_p^2)\). Show that,
Putting it all together,
where we have absorbed terms that are independent of \(\theta\) into the constant \(c\).
Denoising mean function#
The loss function above suggests a particular form of the mean function,
where the only part that is learned is \(\hat{x}_0(x_{t+1}, t; \theta)\), a function that attempts to denoise the current state. Since \(x_{t+1}\) is given and \(a_t\) and \(b_t\) are determined solely by the hyperparameters, we can use them in the mean function.
Under this parameterization, the loss function reduces to,
One nice thing about this formulation is that the mean function is always outputting “the same thing” — an estimate of the completely denoised data, \(\hat{x}_0\), regardless of the time \(t\).
Inverting the noising process#
The generative process attempts to invert the noising process, but what is the actual inverse of the process? Since the noising process is a Markov chain, the reverse of the noising process must be Markovian as well. That is,
for some sequence of transition distributions \(q(x_t \mid x_{t+1})\). We can obtain those transition distributions by marginalizing and conditioning,
Using Bayes’ rule,
Now recall that \(q(x_0) = \frac{1}{n} \sum_{i=1}^n \delta_{x_0^{(i)}}(x_0)\) is the empirical measure of the data \(\{x_0^{(i)}\}_{i=1}^n\). Using this fact, the conditional is,
where we have defined the weights \(w_i(x_{t+1})\) for each data point \(i=1,\ldots,n\). They are non-negative and sum to one.
Finally, we can give a simpler form for the optimal generative process,
which we recognize as a mixture of Gaussians, all with the same variance, with means biased toward each of the \(n\) data points, and weighted by the relative likelihood of \(x_0^{(i)}\) having produced \(x_{t+1}\).
For small step sizes, that mixture of Gaussians can be approximated by a single Gaussian with mean equal to the expected value of the mixture,
For small steps, this expected value is approximately,
Derivation
Using the definitions from above,
we can write
Now add and subtract \(\frac{\sigma_{t|t+1,0}^2 x_{t+1}}{\sigma_{t|0}^2}\) to obtain,
In the limit of many small steps where \(\sigma_{t+1} \to 0\) so that \(\sigma^2_{t+1} \ll \sigma^2_{t|0}\), we have
and
These approximations complete the derivation.
Though it’s not immediately obvious, the second term in the expectation is related to the marginal probability,
Specifically, the second term is the Stein score function of the marginal probability,
Final form
Putting it all together, for small steps, the reverse process is approximately Gaussian with mean and variance,
This has a nice interpretation: to invert the noise process, first undo the contraction and then take a step in the direction of the Stein score!
Continuous time limit#
In practice, the best performing diffusion models are based on a continuous-time formulation of the noising process as an SDE [SSDK+20]. To motivate this approach, think of the noise process above as a discretization of a continuous process \(x(t)\) for \(t \in [0,1]\) with time steps of size \(\Delta = \tfrac{1}{T}\). That is, map \(x_i \mapsto x(i/T)\), \(\lambda_i \mapsto \lambda(i/T)\), and \(\sigma_i \mapsto \sigma(i/T)\) for \(i=0,1,\ldots, T\). Then the discrete model is can be rewritten as,
or equivalently,
We can view this as a discretization of the SDE,
where \(f(x,t)\) is the drift term, \(g(t)\) is the diffusion term, and \(\dif W\) is the Brownian motion.
The reverse (generative) process can be cast as an SDE as well! Following our derivation of the inverse process above, we can show that the reverse process is,
where \(\dif t\) is a negative time increment and \(\dif W\) is Brownian motion run in reverse time.
Multidimensional models#
Very few things need to change in order to apply this idea to multidimensional data \(\mbx_0 \in \reals^D\). The standard setup is to apply a Gaussian noising process to each coordinate \(x_{0,d}\) independently. Then, in the generative model,
The generative process still produces a factored distribution, but we need a separate mean function for each coordinate. Moreover, the mean function needs to consider the entire state \(\mbx_{t+1}\). The reason is that \(x_{t,d}\) is not conditionally independent of \(x_{t+1,d'}\) given \(x_{t+1,d}\); the coordinates are coupled in the inverse process since all of \(\mbx_{t+1}\) provides information about the \(\mbx_0\) that generated it.
Conclusion#
There’s a lot we didn’t cover. The Stein score function that appeared in the inverse of the noising process allows for connections between denoising score matching [SE19] and denoising diffusion models.
Another important topic is conditional generation. Suppose we want to take in text and spit out images, like DALL-E 2 or Stable Diffusion. One way to do so is using a diffusion model, but to steer the reverse diffusion based on the text prompt. So rather than just following the score function, the reverse process is also biased toward outputs that match the prompt.
Finally, this class was nominally about models for discrete data, but this lecture has focused on continuous diffusions. There has been recent work on discrete denoising diffusion models [CBDB+22], which we’ll have to cover another time!