Stochastic Differential Equations

Stochastic differential equations (SDEs) describe the continuous-time evolution of systems driven by random noise. They are the continuous-time counterpart of the linear dynamical systems studied in Part IV and underpin a broad range of models in statistics, physics, finance, and machine learning. In this course they arise most prominently in two places: as the principled mathematical framework for denoising diffusion models, and as a lens through which many stationary Gaussian processes can be formulated and computed with efficiently.

Brownian Motion¶

The fundamental source of randomness in continuous-time stochastic systems is Brownian motion (also called the Wiener process).

A $d$ -dimensional Brownian motion $\mathbf{w}_t \in \mathbb{R}^d$ is a vector of $d$ independent scalar Brownian motions, so $\mathbf{w}_t - \mathbf{w}_s \sim \mathcal{N}(\mathbf{0},\, (t-s)I)$ .

Brownian motion is nowhere differentiable almost surely: sample paths have unbounded variation on every interval and are far too rough for a classical derivative to exist. This is not a pathology — it is essential for the central limit theorem intuition underlying the construction. As we add finer and finer independent noise increments, the accumulated noise scales as $\sqrt{t}$ (not $t$ ), and this $\sqrt{\Delta t}$ scaling of increments is what forces us to treat Brownian motion differently from smooth driving signals.

The Ito Integral¶

Because $w_t$ is not differentiable, integrals of the form $\int_0^t \sigma(u_s, s)\,dw_s$ must be defined with care. The Ito integral constructs this as an $L^2$ limit using left-endpoint Riemann sums:

\int_0^t \sigma(u_s, s)\,dw_s \;=\; \lim_{n \to \infty} \sum_{i=0}^{n-1} \sigma(u_{t_i}, t_i)\,\bigl[w_{t_{i+1}} - w_{t_i}\bigr], \qquad t_i = \tfrac{i\,t}{n}.

(1)

The integrand must be adapted to the Brownian filtration $\{\mathcal{F}_t\}$ , meaning $\sigma(u_t, t)$ may depend only on the history up to time $t$ , not on future values of $w$ .

Two key properties of the Ito integral follow directly from the construction.

Zero mean. Each summand $\sigma(u_{t_i}, t_i)[w_{t_{i+1}} - w_{t_i}]$ has zero conditional expectation given $\mathcal{F}_{t_i}$ , because the Brownian increment is independent of the past and has mean zero. By the tower property, every partial sum has mean zero, and the limit inherits this:

\mathbb{E}\!\left[\int_0^t \sigma(u_s, s)\,dw_s\right] = 0.

(2)

Ito isometry. Cross-terms in $\bigl(\sum_i \sigma_{t_i}\Delta w_i\bigr)^2$ vanish for $i \neq j$ by the same independence argument. Only the diagonal terms contribute, using $\mathbb{E}[(\Delta w_i)^2] = \Delta t$ :

\mathbb{E}\!\left[\left(\int_0^t \sigma(u_s, s)\,dw_s\right)^{\!2}\right] = \int_0^t \mathbb{E}\!\left[\sigma(u_s, s)^2\right] ds.

(3)

The isometry says that the $L^2$ norm of the integral equals the $L^2$ norm of the integrand in the product space $[0, t] \times \Omega$ .

Stochastic Differential Equations¶

Starting from Euler’s method for the ODE $\dot{u} = f(u, t)$ ,

u(t + \Delta t) = u(t) + f(u(t), t)\,\Delta t,

(4)

we augment each step with a Gaussian noise term whose variance scales as $\Delta t$ — matching the variance of a Brownian increment over the same interval:

u(t + \Delta t) = u(t) + f(u(t), t)\,\Delta t + \sigma(u(t), t)\,\varepsilon\sqrt{\Delta t}, \quad \varepsilon \sim \mathcal{N}(0, I).

(5)

Recognizing $\varepsilon\sqrt{\Delta t} = w(t + \Delta t) - w(t)$ and taking $\Delta t \to 0$ yields the stochastic differential equation

du_t = f(u_t, t)\,dt + \sigma(u_t, t)\,dw_t,

(6)

interpreted rigorously as the integral equation

u_t = u_0 + \int_0^t f(u_s, s)\,ds + \int_0^t \sigma(u_s, s)\,dw_s.

(7)

The function $f : \mathbb{R}^d \times \mathbb{R}_+ \to \mathbb{R}^d$ is the drift coefficient and $\sigma : \mathbb{R}^d \times \mathbb{R}_+ \to \mathbb{R}^{d \times d}$ is the diffusion coefficient. The process $\{u_t\}$ solving the SDE is called an Ito process or diffusion process. Under standard Lipschitz and growth conditions on $f$ and $\sigma$ , existence and uniqueness of a strong solution are guaranteed.

Why Gaussian noise? Beyond analytical tractability, the central limit theorem gives physical justification: a large number of small independent perturbations accumulate into something Gaussian. The SDE framework can be generalized to other driving processes (any semimartingale), but Gaussian noise is by far the most common choice.

Ito’s Lemma¶

The ordinary chain rule says: if $u(t)$ is a smooth function and $\phi$ is smooth, then $d\phi(u) = \phi'(u)\,du$ . For an Ito process, there is an additional correction term arising from the quadratic variation of Brownian motion.

Derivation sketch. Expand $\phi$ to second order via Taylor’s theorem and substitute $du_t = f_t\,dt + \sigma_t\,dw_t$ :

d\phi = \frac{\partial\phi}{\partial t}\,dt + \frac{\partial\phi}{\partial u}\,du_t + \frac{1}{2}\frac{\partial^2\phi}{\partial u^2}(du_t)^2 + \cdots

(9)

The key step is evaluating $(du_t)^2 = (f_t\,dt + \sigma_t\,dw_t)^2$ . Using the multiplication rules $dt^2 \to 0$ , $dt\,dw_t \to 0$ , and $dw_t^2 \to dt$ — which follow from the $L^2$ behavior of Brownian motion — gives $(du_t)^2 = \sigma_t^2\,dt$ . The higher-order terms vanish. The extra $\frac{\sigma_t^2}{2}\frac{\partial^2\phi}{\partial u^2}$ term, absent in ordinary calculus, is called the Ito correction.

Multivariate form. For $\mathbf{u}_t \in \mathbb{R}^d$ satisfying $d\mathbf{u}_t = \mathbf{f}_t\,dt + G_t\,d\mathbf{w}_t$ and a smooth function $\phi : \mathbb{R}^d \times \mathbb{R}_+ \to \mathbb{R}$ ,

d\phi = \frac{\partial\phi}{\partial t}\,dt + (\nabla_u \phi)^\top d\mathbf{u}_t + \frac{1}{2}\operatorname{tr}\!\left(G_t G_t^\top \nabla_u^2 \phi\right)dt.

(10)

Example: geometric Brownian motion. The SDE $du_t = \mu u_t\,dt + \sigma u_t\,dw_t$ is the classical model for asset prices. Applying Ito’s lemma to $\phi(u) = \log u$ gives

d\log u_t = \left(\mu - \frac{\sigma^2}{2}\right)dt + \sigma\,dw_t,

(11)

so $\log u_t$ drifts linearly and $u_t = u_0\exp\bigl((\mu - \frac{\sigma^2}{2})t + \sigma w_t\bigr)$ is log-normally distributed. The $-\sigma^2/2$ correction relative to the drift $\mu$ is a direct consequence of the Ito term.

The Fokker-Planck Equation¶

Instead of tracking individual trajectories, one can describe the evolution of the marginal density $p_t(u)$ of the process. For the SDE $du_t = f(u_t, t)\,dt + \sigma(u_t, t)\,dw_t$ , the marginal density satisfies the Fokker-Planck equation (also called the forward Kolmogorov equation):

\frac{\partial p_t}{\partial t}(u) = -\nabla_u \cdot \bigl[f(u, t)\,p_t(u)\bigr] + \frac{1}{2}\,\nabla_u^2\!:\bigl[\sigma(u,t)\sigma(u,t)^\top p_t(u)\bigr].

(12)

In the scalar case with state-independent diffusion $\sigma(t)$ , this simplifies to

\frac{\partial p_t}{\partial t} = -\frac{\partial}{\partial u}\!\bigl[f(u,t)\,p_t\bigr] + \frac{\sigma(t)^2}{2}\frac{\partial^2 p_t}{\partial u^2}.

(13)

The first term describes probability transport due to the drift — it has the form of a continuity equation. The second term is a diffusion equation that spreads probability mass. Together they give a precise PDE governing how the distribution of $u_t$ evolves over time.

The backward Kolmogorov equation is the adjoint of the Fokker-Planck operator and describes how expected functions of the terminal state $\mathbb{E}[\phi(u_T) \mid u_t = u]$ evolve backward in time from $T$ to $t$ . It is the continuous-time analogue of the backwards recursions in dynamic programming.

Simulation: Euler–Maruyama¶

Given initial condition $u_0 \sim P_0$ and a time grid $0 = t_0 < t_1 < \cdots < t_n = T$ with step $\Delta t$ , the Euler–Maruyama method approximates the SDE solution by

u_{t_{i+1}} = u_{t_i} + f(u_{t_i}, t_i)\,\Delta t + \sigma(u_{t_i}, t_i)\,\varepsilon_i\sqrt{\Delta t}, \qquad \varepsilon_i \overset{\text{iid}}{\sim} \mathcal{N}(0, I).

(14)

This is the direct stochastic analogue of Euler’s method and has strong convergence rate $O(\sqrt{\Delta t})$ . The Milstein scheme adds a correction using the Ito lemma applied to $\sigma$ ,

u_{t_{i+1}} = u_{t_i} + f\,\Delta t + \sigma\,\varepsilon_i\sqrt{\Delta t} + \tfrac{1}{2}\sigma\sigma'\bigl((\varepsilon_i)^2 - 1\bigr)\Delta t,

(15)

improving the strong convergence rate to $O(\Delta t)$ when $\sigma$ depends on the state. When the transition density $p_{t_i, t_{i+1}}(u_{t_{i+1}} \mid u_{t_i})$ is known in closed form — as it is for linear SDEs — exact simulation is possible without any discretization error.

Linear SDEs and the Ornstein–Uhlenbeck Process¶

The most tractable family of SDEs has a drift that is linear in the state:

d\mathbf{u}_t = \bigl(A(t)\mathbf{u}_t + \mathbf{b}(t)\bigr)dt + G(t)\,d\mathbf{w}_t.

(16)

Because the noise enters linearly, solutions have Gaussian transition densities and therefore define Gaussian processes.

Solution via integrating factor. For the scalar time-invariant case $du_t = a\,u_t\,dt + g\,dw_t$ , apply Ito’s lemma to $\phi(u_t, t) = e^{-at}u_t$ :

d(e^{-at}u_t) = e^{-at}\,g\,dw_t.

(17)

Integrating both sides from 0 to $t$ and rearranging,

u_t = e^{at}u_0 + g\int_0^t e^{a(t-s)}\,dw_s.

(18)

The Ito integral is Gaussian (as an $L^2$ limit of Gaussian sums), so the transition density is

p(u_t \mid u_0) = \mathcal{N}\!\left(e^{at}u_0,\;\; g^2\int_0^t e^{2a(t-s)}\,ds\right).

(19)

For $a < 0$ , the mean decays exponentially toward zero and the variance saturates at $g^2/(2|a|)$ as $t \to \infty$ .

The Ornstein–Uhlenbeck process. The canonical mean-reverting SDE is

du_t = -\theta(u_t - \mu)\,dt + \sigma\,dw_t, \qquad \theta > 0.

(20)

The drift pulls $u_t$ toward the long-run mean $\mu$ at rate $\theta$ ; $\sigma$ controls the noise level. The transition density is

p(u_t \mid u_s) = \mathcal{N}\!\left(\mu + e^{-\theta(t-s)}(u_s - \mu),\;\; \frac{\sigma^2}{2\theta}\bigl(1 - e^{-2\theta(t-s)}\bigr)\right).

(21)

The marginal distribution converges to the stationary distribution $\mathcal{N}(\mu,\, \sigma^2/(2\theta))$ regardless of the initial condition, with the exponential covariance kernel $k(s, t) = \frac{\sigma^2}{2\theta}e^{-\theta|t-s|}$ .

Multivariate case. For $d\mathbf{u}_t = A\mathbf{u}_t\,dt + G\,d\mathbf{w}_t$ with stable $A$ , the solution is

\mathbf{u}_t = e^{At}\mathbf{u}_0 + \int_0^t e^{A(t-s)}G\,d\mathbf{w}_s,

(22)

with transition covariance $P(t) = \int_0^t e^{As}GG^\top e^{A^\top s}\,ds$ . The stationary covariance $P_\infty$ is the unique positive definite solution to the continuous-time Lyapunov equation

AP_\infty + P_\infty A^\top + GG^\top = 0.

(23)

Stationary Distributions¶

When does the density $p_t$ converge as $t \to \infty$ ? A stationary distribution $p^*$ satisfies the time-independent Fokker-Planck equation (right-hand side set to zero). For the scalar constant-diffusion case $\sigma(u, t) = \sigma$ , one can integrate the Fokker-Planck equation twice to obtain

p^*(u) \propto \exp\!\left(\frac{2}{\sigma^2}\int^u f(v)\,dv\right).

(24)

This is the Boltzmann distribution with energy $E(u) = -\int^u f(v)\,dv$ .

Langevin dynamics. This connection between drift and stationary distribution is the foundation of gradient-based MCMC. To sample from a target distribution $p^*(u) \propto e^{-E(u)}$ , one runs the SDE

du_t = -\nabla E(u_t)\,dt + \sqrt{2}\,dw_t,

(25)

which has $p^*(u) \propto e^{-E(u)}$ as its unique stationary distribution. In the Bayesian context, $E(\boldsymbol{\theta}) = -\log p(\mathbf{x} \mid \boldsymbol{\theta}) - \log p(\boldsymbol{\theta})$ , and Langevin dynamics provides a continuous-time limit of gradient-based MCMC. Discretization via Euler–Maruyama gives the unadjusted Langevin algorithm (ULA); adding a Metropolis–Hastings correction step recovers the Metropolis-adjusted Langevin algorithm (MALA).

Linear SDEs as Gaussian Processes¶

Since solutions to linear SDEs have Gaussian finite-dimensional distributions, they are Gaussian processes. The GP mean and covariance can be computed from the SDE coefficients:

m(t) = \mathbb{E}[u_t], \qquad k(s, t) = \operatorname{Cov}(u_s, u_t),

(26)

both of which satisfy ODEs derivable from the SDE.

Conversely, many classical stationary covariance kernels correspond exactly to the stationary GPs generated by specific linear SDEs:

Kernel	Corresponding SDE
Exponential: $k(r) = \sigma^2 e^{-\ell	r
Matérn-3/2: $k(r) = (1 + \sqrt{3}\ell r)e^{-\sqrt{3}\ell r}$	Second-order linear SDE (two-state system)
Matérn-5/2: $k(r) = (1 + \sqrt{5}\ell r + \tfrac{5}{3}\ell^2 r^2)e^{-\sqrt{5}\ell r}$	Third-order linear SDE (three-state system)
Squared exponential (approximate)	Infinite-order SDE, approximated by truncating a series expansion

This SDE–GP equivalence has an important computational consequence. Standard GP regression with $N$ observations requires $O(N^3)$ time to compute (and invert) the $N \times N$ covariance matrix. When the covariance function corresponds to a linear SDE, the Kalman filter performs the same inference in $O(N)$ time, by exploiting the Markov structure of the SDE solution. This makes SDE-based GP approximation a key technique for large-scale time series Solin & Särkkä, 2020.

The Reverse-Time SDE¶

A foundational result due to Anderson, 1982 shows that the time-reverse of a diffusion process is also a diffusion. If $\{u_t : t \in [0, T]\}$ satisfies the forward SDE

du_t = f(u_t, t)\,dt + \sigma(t)\,dw_t,

(27)

then the reverse-time process $\bar{u}_t = u_{T-t}$ satisfies

d\bar{u}_t = \bigl[-f(\bar{u}_t, T-t) + \sigma(T-t)^2\,\nabla_u \log p_{T-t}(\bar{u}_t)\bigr]dt + \sigma(T-t)\,d\bar{w}_t,

(28)

where $\bar{w}_t$ is a new Brownian motion (running backward) and $\nabla_u \log p_t(u)$ is the score function of the marginal density at time $t$ .

The reverse SDE has the same marginal distributions as the forward SDE, but run in reverse time. Starting from $\bar{u}_0 \sim p_T$ (a simple noise distribution) and simulating the reverse SDE to time $T$ produces a sample from $p_0$ (the data distribution).

This result is the mathematical heart of score-based generative modeling: design a forward SDE that converts data into noise, estimate the score $\nabla_u \log p_t$ using a neural network trained by denoising score matching, then run the reverse SDE to generate new data. The connection between the reverse-time correction and the score function explains why learning to denoise is equivalent to learning to generate — a remarkable duality discussed further in the Denoising Diffusion Models chapter.

Conclusion¶

Stochastic differential equations extend ordinary differential equations by adding Brownian noise, yielding continuous-time models whose sample paths are nowhere differentiable. The Ito integral and Ito’s lemma provide the calculus needed to work with these processes. Linear SDEs are especially tractable: their solutions are Gaussian processes with kernels that correspond to classical covariance functions (exponential, Matérn), and the Kalman filter performs GP inference in linear time by exploiting the Markov structure. Stationary distributions of SDEs are Boltzmann densities, connecting SDEs to Langevin MCMC. The reverse-time SDE of Anderson (1982) is the mathematical foundation of score-based generative modeling, providing a principled framework for understanding denoising diffusion models.

References¶

Solin, A., & Särkkä, S. (2020). Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, 30(2), 419–446.
Anderson, B. D. O. (1982). Reverse-time diffusion equation models. Stochastic Processes and Their Applications, 12(3), 313–326.