Denoising diffusion probabilistic models (DDPMs) Sohl-Dickstein et al., 2015Ho et al., 2020 are the deep generative models underlying image generation tools like DALL-E 2 (from Open AI) and Stable Diffusion (from Stability AI). This lecture will unpack how they work. These notes are partly inspired by Turner (2024).
Using a fixed (i.e., not learned), user-defined noising process to convert data into noise.
Learning the inverse this process so that starting from noise, we can generate samples that approximate the data distribution.
We can think of the DDPM as a giant latent variable model, where the latent variables are noisy versions of the data. As with other latent variable models (e.g., VAEs), once we’ve inferred the latent variables, the problem of learning the mapping from latents to observed data reduces to a supervised regression problem.
DDPMs were originally proposed for modeling continuous data, x∈RD. For simplicity, we will present the framework for scalar data, x∈R, and then discuss the straightforward generalization to multidimensional data afterward. Finally, we will close with a discussion of recent work on diffusion modeling for discrete data.
Let x≡x0 be our observed data. The noising process is a joint distribution over a sequence of latent variables x0:T=(x0,x1,…,xT). We will denote the distribution as,
At each step, the latents will become increasingly noisy versions of the original data, until at time T the latent variable xT is essentially pure noise. The generative model will then proceed by sampling pure noise and attemptign to invert the noising process to produce samples that approximate the data generating distribution.
Under this setting, the noising process preserves the variance of the marginal distributions. If E[x0]=0 and Var[x0]=1, then the marginal distribution of xt will be zero mean and unit variance as well.
Consider the following two limits:
As T→∞, the conditional distribution goes to a standard normal, q(xT∣x0)→N(0,1), which makes the marginal distribution q(xT) easy to sample from.
When λt→1, the noising process adds infinitesimal noise so that xt≈xt−1, which makes the inverse process easier to learn.
Of course, these two limits are in conflict with one another. If we add a small amount of noise at each time step, the inverse process is easier to learn, but we need to take many time steps to converge to a Gaussian stationary distribution.
The initial distribution p(xT) has no parameters because it is set to the stationary distribution of the noising process, q(x∞). E.g., for the Gaussian noising process above, p(xT)=N(0,1).
where q(x1:T∣x0) is the conditional distibution of x1:T under the noising process.
Since there are no learnable parameters in the noising process, the objective simplifies to maximizing the expected log likelihood.
We can simplify further by expanding the log probability of the generative model,
Since the noising process above adds a small amount of Gaussian noise at each step, it is reasonable to model the generative process as Gaussian as well,
where μθ:R×[0,T]↦R is a nonlinear mean function that should denoisext+1 to obtain the expected value of xt, and σt2 is a fixed variance for the generative process.
Under this Gaussian model for the generative process, we can analytically compute one of the expectations in the ELBO. This is called Rao-Blackwellization. It reduces the variance of the objective, which is good for SGD!
Using the chain rule and the Gaussian generative model, we have,
where the only part that is learned is x^0(xt+1,t;θ), a function that attempts to denoise the current state. Since xt+1 is given and at and bt are determined solely by the hyperparameters, we can use them in the mean function.
Under this parameterization, the loss function reduces to,
One nice thing about this formulation is that the mean function is always outputting “the same thing” — an estimate of the completely denoised data, x^0, regardless of the time t.
The generative process attempts to invert the noising process, but what is the actual inverse of the process? Since the noising process is a Markov chain, the reverse of the noising process must be Markovian as well. That is,
which we recognize as a mixture of Gaussians, all with the same variance, with means biased toward each of the n data points, and weighted by the relative likelihood of x0(i) having produced xt+1.
For small step sizes, that mixture of Gaussians can be approximated by a single Gaussian with mean equal to the expected value of the mixture,
In practice, the best performing diffusion models are based on a continuous-time formulation of the noising process as an SDE Song et al., 2021. To motivate this approach, think of the noise process above as a discretization of a continuous process x(t) for t∈[0,1] with time steps of size Δ=T1. That is, map xi↦x(i/T), λi↦λ(i/T), and σi↦σ(i/T) for i=0,1,…,T. Then the discrete model is can be rewritten as,
where f(x,t) is the drift term, g(t) is the diffusion term, and dW is the Brownian motion.
The reverse (generative) process can be cast as an SDE as well! Following our derivation of the inverse process above, we can show that the reverse process is,
Very few things need to change in order to apply this idea to multidimensional data x0∈RD. The standard setup is to apply a Gaussian noising process to each coordinate x0,d independently. Then, in the generative model,
The generative process still produces a factored distribution, but we need a separate mean function for each coordinate. Moreover, the mean function needs to consider the entire state xt+1. The reason is that xt,d is not conditionally independent of xt+1,d′ given xt+1,d; the coordinates are coupled in the inverse process since all of xt+1 provides information about the x0 that generated it.
Denoising diffusion probabilistic models frame generative modeling as learning to invert a fixed, analytically tractable noising process. The key insight is that the optimal reverse transition is a mixture of Gaussians whose mean is a linear combination of the data and the noisy state, and that learning to denoise is equivalent to learning to generate. In the continuous-time limit, the reverse process is an SDE driven by the score function of the marginal density — a connection that unifies DDPMs with score-based generative modeling and Langevin dynamics. There is much more to explore: conditional generation (steering the reverse diffusion with text prompts), discrete diffusion models, and connections between the score function and denoising score matching.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. International Conference on Machine Learning, 2256–2265.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Turner, R. E. (2024). Denoising Diffusion Probabilistic Models in Six Simple Steps. arXiv Preprint arXiv:2402.04384.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. International Conference on Learning Representations.