Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Denoising Diffusion Models

Denoising diffusion probabilistic models (DDPMs) Sohl-Dickstein et al., 2015Ho et al., 2020 are the deep generative models underlying image generation tools like DALL-E 2 (from Open AI) and Stable Diffusion (from Stability AI). This lecture will unpack how they work. These notes are partly inspired by Turner (2024).

Key Ideas

Diffusion models work by

  1. Using a fixed (i.e., not learned), user-defined noising process to convert data into noise.

  2. Learning the inverse this process so that starting from noise, we can generate samples that approximate the data distribution.

We can think of the DDPM as a giant latent variable model, where the latent variables are noisy versions of the data. As with other latent variable models (e.g., VAEs), once we’ve inferred the latent variables, the problem of learning the mapping from latents to observed data reduces to a supervised regression problem.

DDPMs were originally proposed for modeling continuous data, xRD\mbx \in \reals^{D}. For simplicity, we will present the framework for scalar data, xRx \in \reals, and then discuss the straightforward generalization to multidimensional data afterward. Finally, we will close with a discussion of recent work on diffusion modeling for discrete data.

Noising process

Let xx0x \equiv x_0 be our observed data. The noising process is a joint distribution over a sequence of latent variables x0:T=(x0,x1,,xT)x_{0:T} = (x_0, x_1, \ldots, x_T). We will denote the distribution as,

q(x0:T)=q(x0)t=1Tq(xtxt1).\begin{align*} q(x_{0:T}) = q(x_0) \prod_{t=1}^T q(x_t \mid x_{t-1}). \end{align*}

At each step, the latents will become increasingly noisy versions of the original data, until at time TT the latent variable xTx_T is essentially pure noise. The generative model will then proceed by sampling pure noise and attemptign to invert the noising process to produce samples that approximate the data generating distribution.

Gaussian noising process

For continuous data, the standard noising process is a first-order Gaussian autoregressive (AR) process,

q(xtxt1)=N(xtλtxt1,σt2).\begin{align*} q(x_t \mid x_{t-1}) &= \mathrm{N}(x_t \mid \lambda_t x_{t-1}, \sigma_{t}^2). \end{align*}

The hyperparameters {λt,σt2}t=1T\{\lambda_t, \sigma_t^2\}_{t=1}^T and the number of steps TT are fixed (not learned). We restrict λt<1\lambda_t < 1 so that the process contracts

Since the noising process has linear Gaussian dynamics, we can compute conditional distributions in closed form.

Variance preserving diffusions

It is common to set,

σt2=1λt2,\begin{align*} \sigma_t^2 &= 1-\lambda_t^2, \end{align*}

in which case the conditional variance simplifies to,

σt02=1s=1tλs2=1λt02.\begin{align*} \sigma_{t|0}^2 = 1 - \prod_{s=1}^t \lambda_s^2 = 1 - \lambda_{t|0}^2. \end{align*}

Under this setting, the noising process preserves the variance of the marginal distributions. If E[x0]=0\bbE[x_0] = 0 and Var[x0]=1\Var[x_0] = 1, then the marginal distribution of xtx_t will be zero mean and unit variance as well.

Consider the following two limits:

  1. As TT \to \infty, the conditional distribution goes to a standard normal, q(xTx0)N(0,1)q(x_T \mid x_0) \to \mathrm{N}(0, 1), which makes the marginal distribution q(xT)q(x_T) easy to sample from.

  2. When λt1\lambda_t \to 1, the noising process adds infinitesimal noise so that xtxt1x_t \approx x_{t-1}, which makes the inverse process easier to learn.

Of course, these two limits are in conflict with one another. If we add a small amount of noise at each time step, the inverse process is easier to learn, but we need to take many time steps to converge to a Gaussian stationary distribution.

Generative process

The generative process is a parameteric model that learns to invert the noise process,

p(x0:T;θ)=p(xT)t=T10p(xtxt+1;θ).\begin{align*} p(x_{0:T}; \theta) &= p(x_T) \prod_{t=T-1}^0 p(x_t \mid x_{t+1}; \theta). \end{align*}

The initial distribution p(xT)p(x_T) has no parameters because it is set to the stationary distribution of the noising process, q(x)q(x_\infty). E.g., for the Gaussian noising process above, p(xT)=N(0,1)p(x_T) = \mathrm{N}(0,1).

Evidence Lower Bound

Like the other latent variable models we studied in this course, we will estimate the parameters by maximizing an evidence lower bound (ELBO),

L(θ)=Eq(x0)Eq(x1:Tx0)[logp(x0:T;θ)logq(x1:Tx0)]=Eq(x0)Eq(x1:Tx0)[logp(x0:T;θ)]+c,\begin{align*} \cL(\theta) &= \E_{q(x_0)} \E_{q(x_{1:T} \mid x_0)} \left[ \log p(x_{0:T}; \theta) - \log q(x_{1:T} \mid x_0) \right] \\ &= \E_{q(x_0)} \E_{q(x_{1:T} \mid x_0)} \left[ \log p(x_{0:T}; \theta) \right] + c, \end{align*}

where q(x1:Tx0)q(x_{1:T} \mid x_0) is the conditional distibution of x1:Tx_{1:T} under the noising process. Since there are no learnable parameters in the noising process, the objective simplifies to maximizing the expected log likelihood.

We can simplify further by expanding the log probability of the generative model,

L(θ)=Eq(x0)t=0T1Eq(xt,xt+1x0)[logp(xtxt+1;θ)]Eq(x0)EtUnif(0,T1)Eq(xt,xt+1x0)[logp(xtxt+1;θ)]\begin{align*} \cL(\theta) &= \E_{q(x_0)} \sum_{t=0}^{T-1} \E_{q(x_t, x_{t+1} \mid x_0)} \left[ \log p(x_t \mid x_{t+1}; \theta) \right] \\ &\propto \E_{q(x_0)} \mathbb{E}_{t \sim \mathrm{Unif}(0,T-1)} \E_{q(x_t, x_{t+1} \mid x_0)} \left[ \log p(x_t \mid x_{t+1}; \theta) \right] \end{align*}

which only depends on pairwise conditionals.

Gaussian generative process

Since the noising process above adds a small amount of Gaussian noise at each step, it is reasonable to model the generative process as Gaussian as well,

p(xtxt+1;θ)=N(xtμθ(xt+1,t),σ~t2)\begin{align*} p(x_t \mid x_{t+1}; \theta) &= \mathrm{N}(x_t \mid \mu_\theta(x_{t+1}, t), \widetilde{\sigma}_t^2) \end{align*}

where μθ:R×[0,T]R\mu_\theta: \reals \times [0,T] \mapsto \reals is a nonlinear mean function that should denoise xt+1x_{t+1} to obtain the expected value of xtx_t, and σ~t2\widetilde{\sigma}_t^2 is a fixed variance for the generative process.

Rao-Blackwellization

Under this Gaussian model for the generative process, we can analytically compute one of the expectations in the ELBO. This is called Rao-Blackwellization. It reduces the variance of the objective, which is good for SGD!

Using the chain rule and the Gaussian generative model, we have,

Eq(xt,xt+1x0)[logp(xtxt+1;θ)]=Eq(xt+1x0)Eq(xtxt+1,x0)[logN(xtμθ(xt+1,t),σ~t2)]\begin{align*} \E_{q(x_t, x_{t+1} \mid x_0)} \left[ \log p(x_t \mid x_{t+1}; \theta) \right] &= \E_{q(x_{t+1} \mid x_0)} \E_{q(x_t \mid x_{t+1}, x_0)} \left[\log \mathrm{N}(x_t \mid \mu_\theta(x_{t+1}, t), \widetilde{\sigma}_t^2) \right] \end{align*}

We already computed the conditional distribution q(xt+1x0)=N(xt+1λt+10x0,σt+102)q(x_{t+1} \mid x_0) = \mathrm{N}(x_{t+1} \mid \lambda_{t+1|0} x_0, \sigma_{t+1|0}^2) above. It turns out the second term is Gaussian as well!

Finally, to simplify the objective we need the Gaussian cross-entropy,

Putting it all together,

L(θ)=Eq(x0)EtEq(xt+1x0)Eq(xtx0,xt+1)[logp(xtxt+1;θ)]=Eq(x0)EtEq(xt+1x0)[logN(atx0+btxt+1μθ(xt+1,t),σ~t2)12σtt+1,02σ~t2]=12Eq(x0)EtEq(xt+1x0)[1σ~t2(atx0+btxt+1μθ(xt+1,t))2]+c\begin{align*} \cL(\theta) &= \E_{q(x_0)} \E_t \E_{q(x_{t+1} \mid x_0)} \E_{q(x_t | x_0, x_{t+1})} \left[ \log p(x_t \mid x_{t+1}; \theta) \right] \\ &= \E_{q(x_0)} \E_t \E_{q(x_{t+1} \mid x_0)} \left[ \log \mathrm{N}(a_t x_0 + b_t x_{t+1} \mid \mu_\theta(x_{t+1}, t), \widetilde{\sigma}_t^2) -\frac{1}{2} \frac{\sigma_{t|t+1,0}^2}{\widetilde{\sigma}_t^2} \right] \\ &= \frac{1}{2} \E_{q(x_0)} \E_t \E_{q(x_{t+1} \mid x_0)} \left[ \frac{1}{\widetilde{\sigma}_t^2} \left( a_t x_0 + b_t x_{t+1} - \mu_\theta(x_{t+1}, t) \right)^2 \right] + c \end{align*}

where we have absorbed terms that are independent of θ\theta into the constant cc.

Denoising mean function

The loss function above suggests a particular form of the mean function,

μθ(xt+1,t)=atx^0(xt+1,t;θ)+btxt+1,\begin{align*} \mu_\theta(x_{t+1}, t) &= a_t \hat{x}_0(x_{t+1}, t; \theta) + b_t x_{t+1}, \end{align*}

where the only part that is learned is x^0(xt+1,t;θ)\hat{x}_0(x_{t+1}, t; \theta), a function that attempts to denoise the current state. Since xt+1x_{t+1} is given and ata_t and btb_t are determined solely by the hyperparameters, we can use them in the mean function.

Under this parameterization, the loss function reduces to,

L(θ)=12Eq(x0)EtEq(xt+1x0)[at2σ~t2(x0x^0(xt+1,t;θ))2]+c\begin{align*} \cL(\theta) &= \frac{1}{2} \E_{q(x_0)} \E_t \E_{q(x_{t+1} \mid x_0)} \left[ \frac{a_t^2}{\widetilde{\sigma}_t^2} \left(x_0 - \hat{x}_0(x_{t+1}, t; \theta) \right)^2 \right] + c \end{align*}

One nice thing about this formulation is that the mean function is always outputting “the same thing” — an estimate of the completely denoised data, x^0\hat{x}_0, regardless of the time tt.

Inverting the noising process

The generative process attempts to invert the noising process, but what is the actual inverse of the process? Since the noising process is a Markov chain, the reverse of the noising process must be Markovian as well. That is,

q(x0:T)=q(xT)t=T10q(xtxt+1)\begin{align*} q(x_{0:T}) &= q(x_T) \prod_{t=T-1}^0 q(x_t \mid x_{t+1}) \end{align*}

for some sequence of transition distributions q(xtxt+1)q(x_t \mid x_{t+1}). We can obtain those transition distributions by marginalizing and conditioning,

q(xtxt+1)=q(x0,xtxt+1) ⁣dx0=q(xtx0,xt+1)q(x0xt+1) ⁣dx0.\begin{align*} q(x_t \mid x_{t+1}) &= \int q(x_0, x_t \mid x_{t+1}) \dif x_0 \\ &= \int q(x_t \mid x_0, x_{t+1}) \, q(x_0 \mid x_{t+1}) \dif x_0. \end{align*}

Using Bayes’ rule,

q(x0xt+1)=q(x0)q(xt+1x0)q(x0)q(xt+1x0) ⁣dx0\begin{align*} q(x_0 \mid x_{t+1}) &= \frac{q(x_0) \, q(x_{t+1} \mid x_0)}{\int q(x_0') q(x_{t+1} \mid x_0') \dif x_0'} \end{align*}

Now recall that q(x0)=1ni=1nδx0(i)(x0)q(x_0) = \frac{1}{n} \sum_{i=1}^n \delta_{x_0^{(i)}}(x_0) is the empirical measure of the data {x0(i)}i=1n\{x_0^{(i)}\}_{i=1}^n. Using this fact, the conditional is,

q(x0(i)xt+1)=q(xt+1x0(i))j=1nq(xt+1x0(j))wi(xt+1),\begin{align*} q(x_0^{(i)} \mid x_{t+1}) &= \frac{q(x_{t+1} \mid x_0^{(i)})}{\sum_{j=1}^n q(x_{t+1} \mid x_0^{(j)})} \triangleq w_i(x_{t+1}), \end{align*}

where we have defined the weights wi(xt+1)w_i(x_{t+1}) for each data point i=1,,ni=1,\ldots,n. They are non-negative and sum to one.

Finally, we can give a simpler form for the optimal generative process,

q(xtxt+1)=i=1nwi(xt+1)q(xtx0(i),xt+1)=i=1nwi(xt+1)N(xtatx0(i)+btxt+1,σtt+1,02),\begin{align*} q(x_t \mid x_{t+1}) &= \sum_{i=1}^n w_i(x_{t+1}) \, q(x_t \mid x_0^{(i)}, x_{t+1}) \\ &= \sum_{i=1}^n w_i(x_{t+1}) \, \mathrm{N}(x_t \mid a_t x_0^{(i)} + b_t x_{t+1}, \sigma_{t|t+1,0}^2), \end{align*}

which we recognize as a mixture of Gaussians, all with the same variance, with means biased toward each of the nn data points, and weighted by the relative likelihood of x0(i)x_0^{(i)} having produced xt+1x_{t+1}.

For small step sizes, that mixture of Gaussians can be approximated by a single Gaussian with mean equal to the expected value of the mixture,

E[xtxt+1]=i=1nwi(xt+1)(atx0(i)+btxt+1)\begin{align*} \E[x_t \mid x_{t+1}] &= \sum_{i=1}^n w_i(x_{t+1}) \, \left(a_t x_0^{(i)} + b_t x_{t+1} \right) \end{align*}

For small steps, this expected value is approximately,

E[xtxt+1]xt+1λt+1+σt+12i=1nwi(xt+1)(λt0x0(i)xt+1σt02)\begin{align*} \E[x_t \mid x_{t+1}] &\approx \frac{x_{t+1}}{\lambda_{t+1}} + \sigma_{t+1}^2 \sum_{i=1}^n w_i(x_{t+1}) \left(\frac{ \lambda_{t|0} x_0^{(i)} - x_{t+1}}{\sigma_{t|0}^2} \right) \end{align*}

Though it’s not immediately obvious, the second term in the expectation is related to the marginal probability,

q(xt)=1ni=1nq(xtx0(i))=1ni=1nN(xtλt0x0(i),σt02)\begin{align*} q(x_t) &= \frac{1}{n} \sum_{i=1}^n q(x_t \mid x_0^{(i)}) \\ &= \frac{1}{n} \sum_{i=1}^n \mathrm{N}(x_t \mid \lambda_{t|0} x_0^{(i)}, \sigma_{t|0}^2) \end{align*}

Specifically, the second term is the Stein score function of the marginal probability,

xlogqt(xt+1)=xqt(xt+1)qt(xt+1)=1ni=1nN(xt+1λt0x0(i),σt02)((xt+1λt0x0(i))σt02)1nj=1nN(xt+1λt0x0(j),σt02)=i=1nwi(xt+1)(λt0x0(i)xt+1σt02)\begin{align*} \nabla_{x} \log q_t(x_{t+1}) &= \frac{\nabla_{x} q_t(x_{t+1})}{q_t(x_{t+1})} \\ &=\frac{\frac{1}{n} \sum_{i=1}^n \mathrm{N}(x_{t+1} \mid \lambda_{t|0} x_0^{(i)}, \sigma_{t|0}^2) \left(-\frac{(x_{t+1} - \lambda_{t|0}x_0^{(i)})}{\sigma_{t|0}^2} \right)}{\frac{1}{n} \sum_{j=1}^n \mathrm{N}(x_{t+1} \mid \lambda_{t|0} x_0^{(j)}, \sigma_{t|0}^2) } \\ &= \sum_{i=1}^n w_i(x_{t+1}) \left(\frac{\lambda_{t|0}x_0^{(i)} - x_{t+1}}{\sigma_{t|0}^2} \right) \end{align*}

Continuous time limit

In practice, the best performing diffusion models are based on a continuous-time formulation of the noising process as an SDE Song et al., 2021. To motivate this approach, think of the noise process above as a discretization of a continuous process x(t)x(t) for t[0,1]t \in [0,1] with time steps of size Δ=1T\Delta = \tfrac{1}{T}. That is, map xix(i/T)x_i \mapsto x(i/T), λiλ(i/T)\lambda_i \mapsto \lambda(i/T), and σiσ(i/T)\sigma_i \mapsto \sigma(i/T) for i=0,1,,Ti=0,1,\ldots, T. Then the discrete model is can be rewritten as,

x(t+Δ)N(λ(t)x(t),σ(t)2),\begin{align*} x(t + \Delta) &\sim \mathrm{N}(\lambda(t) x(t), \sigma(t)^2), \end{align*}

or equivalently,

x(t+Δ)x(t)N(f(x(t),t)Δ,g(t)2Δ)f(x,t)=1λ(t)Δxg(t)=σ(t)Δ.\begin{align*} x(t + \Delta) - x(t) &\sim \mathrm{N}\left(f(x(t), t) \Delta, g(t)^2 \Delta \right) \\ f(x, t) &= \frac{1 - \lambda(t)}{\Delta} x \\ g(t) &= \frac{\sigma(t)}{\Delta}. \end{align*}

We can view this as a discretization of the SDE,

 ⁣dX=f(x,t) ⁣dt+g(t) ⁣dW\begin{align*} \dif X &= f(x, t) \dif t + g(t) \dif W \end{align*}

where f(x,t)f(x,t) is the drift term, g(t)g(t) is the diffusion term, and  ⁣dW\dif W is the Brownian motion.

The reverse (generative) process can be cast as an SDE as well! Following our derivation of the inverse process above, we can show that the reverse process is,

 ⁣dX=[f(x,t)g(t)2xlogqt(x)] ⁣dt+g(t) ⁣dW\begin{align*} \dif X = \left[f(x, t) - g(t)^2 \nabla_x \log q_t(x)\right] \dif t + g(t) \dif W \end{align*}

where  ⁣dt\dif t is a negative time increment and  ⁣dW\dif W is Brownian motion run in reverse time.

Multidimensional models

Very few things need to change in order to apply this idea to multidimensional data x0RD\mbx_0 \in \reals^D. The standard setup is to apply a Gaussian noising process to each coordinate x0,dx_{0,d} independently. Then, in the generative model,

p(xtxt+1;θ)=d=1Dp(xt,dxt+1;θ)=d=1DN(xt,dμθ(xt+1,t,d),σ~t2).\begin{align*} p(\mbx_t \mid \mbx_{t+1}; \theta) &= \prod_{d=1}^D p(x_{t,d} \mid \mbx_{t+1}; \theta) \\ &= \prod_{d=1}^D \mathrm{N}(x_{t,d} \mid \mu_\theta(\mbx_{t+1}, t, d), \widetilde{\sigma}_t^2). \end{align*}

The generative process still produces a factored distribution, but we need a separate mean function for each coordinate. Moreover, the mean function needs to consider the entire state xt+1\mbx_{t+1}. The reason is that xt,dx_{t,d} is not conditionally independent of xt+1,dx_{t+1,d'} given xt+1,dx_{t+1,d}; the coordinates are coupled in the inverse process since all of xt+1\mbx_{t+1} provides information about the x0\mbx_0 that generated it.

Conclusion

Denoising diffusion probabilistic models frame generative modeling as learning to invert a fixed, analytically tractable noising process. The key insight is that the optimal reverse transition is a mixture of Gaussians whose mean is a linear combination of the data and the noisy state, and that learning to denoise is equivalent to learning to generate. In the continuous-time limit, the reverse process is an SDE driven by the score function of the marginal density — a connection that unifies DDPMs with score-based generative modeling and Langevin dynamics. There is much more to explore: conditional generation (steering the reverse diffusion with text prompts), discrete diffusion models, and connections between the score function and denoising score matching.

References
  1. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. International Conference on Machine Learning, 2256–2265.
  2. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 6840–6851.
  3. Turner, R. E. (2024). Denoising Diffusion Probabilistic Models in Six Simple Steps. arXiv Preprint arXiv:2402.04384.
  4. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. International Conference on Learning Representations.