Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Stochastic Differential Equations

Stochastic differential equations (SDEs) describe the continuous-time evolution of systems driven by random noise. They are the continuous-time counterpart of the linear dynamical systems studied in Part IV and underpin a broad range of models in statistics, physics, finance, and machine learning. In this course they arise most prominently in two places: as the principled mathematical framework for denoising diffusion models, and as a lens through which many stationary Gaussian processes can be formulated and computed with efficiently.

Brownian Motion

The fundamental source of randomness in continuous-time stochastic systems is Brownian motion (also called the Wiener process).

A dd-dimensional Brownian motion wtRd\mathbf{w}_t \in \mathbb{R}^d is a vector of dd independent scalar Brownian motions, so wtwsN(0,(ts)I)\mathbf{w}_t - \mathbf{w}_s \sim \mathcal{N}(\mathbf{0},\, (t-s)I).

Brownian motion is nowhere differentiable almost surely: sample paths have unbounded variation on every interval and are far too rough for a classical derivative to exist. This is not a pathology — it is essential for the central limit theorem intuition underlying the construction. As we add finer and finer independent noise increments, the accumulated noise scales as t\sqrt{t} (not tt), and this Δt\sqrt{\Delta t} scaling of increments is what forces us to treat Brownian motion differently from smooth driving signals.

The Ito Integral

Because wtw_t is not differentiable, integrals of the form 0tσ(us,s)dws\int_0^t \sigma(u_s, s)\,dw_s must be defined with care. The Ito integral constructs this as an L2L^2 limit using left-endpoint Riemann sums:

0tσ(us,s)dws  =  limni=0n1σ(uti,ti)[wti+1wti],ti=itn.\int_0^t \sigma(u_s, s)\,dw_s \;=\; \lim_{n \to \infty} \sum_{i=0}^{n-1} \sigma(u_{t_i}, t_i)\,\bigl[w_{t_{i+1}} - w_{t_i}\bigr], \qquad t_i = \tfrac{i\,t}{n}.

The integrand must be adapted to the Brownian filtration {Ft}\{\mathcal{F}_t\}, meaning σ(ut,t)\sigma(u_t, t) may depend only on the history up to time tt, not on future values of ww.

Two key properties of the Ito integral follow directly from the construction.

Zero mean. Each summand σ(uti,ti)[wti+1wti]\sigma(u_{t_i}, t_i)[w_{t_{i+1}} - w_{t_i}] has zero conditional expectation given Fti\mathcal{F}_{t_i}, because the Brownian increment is independent of the past and has mean zero. By the tower property, every partial sum has mean zero, and the limit inherits this:

E ⁣[0tσ(us,s)dws]=0.\mathbb{E}\!\left[\int_0^t \sigma(u_s, s)\,dw_s\right] = 0.

Ito isometry. Cross-terms in (iσtiΔwi)2\bigl(\sum_i \sigma_{t_i}\Delta w_i\bigr)^2 vanish for iji \neq j by the same independence argument. Only the diagonal terms contribute, using E[(Δwi)2]=Δt\mathbb{E}[(\Delta w_i)^2] = \Delta t:

E ⁣[(0tσ(us,s)dws) ⁣2]=0tE ⁣[σ(us,s)2]ds.\mathbb{E}\!\left[\left(\int_0^t \sigma(u_s, s)\,dw_s\right)^{\!2}\right] = \int_0^t \mathbb{E}\!\left[\sigma(u_s, s)^2\right] ds.

The isometry says that the L2L^2 norm of the integral equals the L2L^2 norm of the integrand in the product space [0,t]×Ω[0, t] \times \Omega.

Stochastic Differential Equations

Starting from Euler’s method for the ODE u˙=f(u,t)\dot{u} = f(u, t),

u(t+Δt)=u(t)+f(u(t),t)Δt,u(t + \Delta t) = u(t) + f(u(t), t)\,\Delta t,

we augment each step with a Gaussian noise term whose variance scales as Δt\Delta t — matching the variance of a Brownian increment over the same interval:

u(t+Δt)=u(t)+f(u(t),t)Δt+σ(u(t),t)εΔt,εN(0,I).u(t + \Delta t) = u(t) + f(u(t), t)\,\Delta t + \sigma(u(t), t)\,\varepsilon\sqrt{\Delta t}, \quad \varepsilon \sim \mathcal{N}(0, I).

Recognizing εΔt=w(t+Δt)w(t)\varepsilon\sqrt{\Delta t} = w(t + \Delta t) - w(t) and taking Δt0\Delta t \to 0 yields the stochastic differential equation

dut=f(ut,t)dt+σ(ut,t)dwt,du_t = f(u_t, t)\,dt + \sigma(u_t, t)\,dw_t,

interpreted rigorously as the integral equation

ut=u0+0tf(us,s)ds+0tσ(us,s)dws.u_t = u_0 + \int_0^t f(u_s, s)\,ds + \int_0^t \sigma(u_s, s)\,dw_s.

The function f:Rd×R+Rdf : \mathbb{R}^d \times \mathbb{R}_+ \to \mathbb{R}^d is the drift coefficient and σ:Rd×R+Rd×d\sigma : \mathbb{R}^d \times \mathbb{R}_+ \to \mathbb{R}^{d \times d} is the diffusion coefficient. The process {ut}\{u_t\} solving the SDE is called an Ito process or diffusion process. Under standard Lipschitz and growth conditions on ff and σ\sigma, existence and uniqueness of a strong solution are guaranteed.

Why Gaussian noise? Beyond analytical tractability, the central limit theorem gives physical justification: a large number of small independent perturbations accumulate into something Gaussian. The SDE framework can be generalized to other driving processes (any semimartingale), but Gaussian noise is by far the most common choice.

Ito’s Lemma

The ordinary chain rule says: if u(t)u(t) is a smooth function and ϕ\phi is smooth, then dϕ(u)=ϕ(u)dud\phi(u) = \phi'(u)\,du. For an Ito process, there is an additional correction term arising from the quadratic variation of Brownian motion.

Derivation sketch. Expand ϕ\phi to second order via Taylor’s theorem and substitute dut=ftdt+σtdwtdu_t = f_t\,dt + \sigma_t\,dw_t:

dϕ=ϕtdt+ϕudut+122ϕu2(dut)2+d\phi = \frac{\partial\phi}{\partial t}\,dt + \frac{\partial\phi}{\partial u}\,du_t + \frac{1}{2}\frac{\partial^2\phi}{\partial u^2}(du_t)^2 + \cdots

The key step is evaluating (dut)2=(ftdt+σtdwt)2(du_t)^2 = (f_t\,dt + \sigma_t\,dw_t)^2. Using the multiplication rules dt20dt^2 \to 0, dtdwt0dt\,dw_t \to 0, and dwt2dtdw_t^2 \to dt — which follow from the L2L^2 behavior of Brownian motion — gives (dut)2=σt2dt(du_t)^2 = \sigma_t^2\,dt. The higher-order terms vanish. The extra σt222ϕu2\frac{\sigma_t^2}{2}\frac{\partial^2\phi}{\partial u^2} term, absent in ordinary calculus, is called the Ito correction.

Multivariate form. For utRd\mathbf{u}_t \in \mathbb{R}^d satisfying dut=ftdt+Gtdwtd\mathbf{u}_t = \mathbf{f}_t\,dt + G_t\,d\mathbf{w}_t and a smooth function ϕ:Rd×R+R\phi : \mathbb{R}^d \times \mathbb{R}_+ \to \mathbb{R},

dϕ=ϕtdt+(uϕ)dut+12tr ⁣(GtGtu2ϕ)dt.d\phi = \frac{\partial\phi}{\partial t}\,dt + (\nabla_u \phi)^\top d\mathbf{u}_t + \frac{1}{2}\operatorname{tr}\!\left(G_t G_t^\top \nabla_u^2 \phi\right)dt.

Example: geometric Brownian motion. The SDE dut=μutdt+σutdwtdu_t = \mu u_t\,dt + \sigma u_t\,dw_t is the classical model for asset prices. Applying Ito’s lemma to ϕ(u)=logu\phi(u) = \log u gives

dlogut=(μσ22)dt+σdwt,d\log u_t = \left(\mu - \frac{\sigma^2}{2}\right)dt + \sigma\,dw_t,

so logut\log u_t drifts linearly and ut=u0exp((μσ22)t+σwt)u_t = u_0\exp\bigl((\mu - \frac{\sigma^2}{2})t + \sigma w_t\bigr) is log-normally distributed. The σ2/2-\sigma^2/2 correction relative to the drift μ\mu is a direct consequence of the Ito term.

The Fokker-Planck Equation

Instead of tracking individual trajectories, one can describe the evolution of the marginal density pt(u)p_t(u) of the process. For the SDE dut=f(ut,t)dt+σ(ut,t)dwtdu_t = f(u_t, t)\,dt + \sigma(u_t, t)\,dw_t, the marginal density satisfies the Fokker-Planck equation (also called the forward Kolmogorov equation):

ptt(u)=u[f(u,t)pt(u)]+12u2 ⁣:[σ(u,t)σ(u,t)pt(u)].\frac{\partial p_t}{\partial t}(u) = -\nabla_u \cdot \bigl[f(u, t)\,p_t(u)\bigr] + \frac{1}{2}\,\nabla_u^2\!:\bigl[\sigma(u,t)\sigma(u,t)^\top p_t(u)\bigr].

In the scalar case with state-independent diffusion σ(t)\sigma(t), this simplifies to

ptt=u ⁣[f(u,t)pt]+σ(t)222ptu2.\frac{\partial p_t}{\partial t} = -\frac{\partial}{\partial u}\!\bigl[f(u,t)\,p_t\bigr] + \frac{\sigma(t)^2}{2}\frac{\partial^2 p_t}{\partial u^2}.

The first term describes probability transport due to the drift — it has the form of a continuity equation. The second term is a diffusion equation that spreads probability mass. Together they give a precise PDE governing how the distribution of utu_t evolves over time.

The backward Kolmogorov equation is the adjoint of the Fokker-Planck operator and describes how expected functions of the terminal state E[ϕ(uT)ut=u]\mathbb{E}[\phi(u_T) \mid u_t = u] evolve backward in time from TT to tt. It is the continuous-time analogue of the backwards recursions in dynamic programming.

Simulation: Euler–Maruyama

Given initial condition u0P0u_0 \sim P_0 and a time grid 0=t0<t1<<tn=T0 = t_0 < t_1 < \cdots < t_n = T with step Δt\Delta t, the Euler–Maruyama method approximates the SDE solution by

uti+1=uti+f(uti,ti)Δt+σ(uti,ti)εiΔt,εiiidN(0,I).u_{t_{i+1}} = u_{t_i} + f(u_{t_i}, t_i)\,\Delta t + \sigma(u_{t_i}, t_i)\,\varepsilon_i\sqrt{\Delta t}, \qquad \varepsilon_i \overset{\text{iid}}{\sim} \mathcal{N}(0, I).

This is the direct stochastic analogue of Euler’s method and has strong convergence rate O(Δt)O(\sqrt{\Delta t}). The Milstein scheme adds a correction using the Ito lemma applied to σ\sigma,

uti+1=uti+fΔt+σεiΔt+12σσ((εi)21)Δt,u_{t_{i+1}} = u_{t_i} + f\,\Delta t + \sigma\,\varepsilon_i\sqrt{\Delta t} + \tfrac{1}{2}\sigma\sigma'\bigl((\varepsilon_i)^2 - 1\bigr)\Delta t,

improving the strong convergence rate to O(Δt)O(\Delta t) when σ\sigma depends on the state. When the transition density pti,ti+1(uti+1uti)p_{t_i, t_{i+1}}(u_{t_{i+1}} \mid u_{t_i}) is known in closed form — as it is for linear SDEs — exact simulation is possible without any discretization error.

Linear SDEs and the Ornstein–Uhlenbeck Process

The most tractable family of SDEs has a drift that is linear in the state:

dut=(A(t)ut+b(t))dt+G(t)dwt.d\mathbf{u}_t = \bigl(A(t)\mathbf{u}_t + \mathbf{b}(t)\bigr)dt + G(t)\,d\mathbf{w}_t.

Because the noise enters linearly, solutions have Gaussian transition densities and therefore define Gaussian processes.

Solution via integrating factor. For the scalar time-invariant case dut=autdt+gdwtdu_t = a\,u_t\,dt + g\,dw_t, apply Ito’s lemma to ϕ(ut,t)=eatut\phi(u_t, t) = e^{-at}u_t:

d(eatut)=eatgdwt.d(e^{-at}u_t) = e^{-at}\,g\,dw_t.

Integrating both sides from 0 to tt and rearranging,

ut=eatu0+g0tea(ts)dws.u_t = e^{at}u_0 + g\int_0^t e^{a(t-s)}\,dw_s.

The Ito integral is Gaussian (as an L2L^2 limit of Gaussian sums), so the transition density is

p(utu0)=N ⁣(eatu0,    g20te2a(ts)ds).p(u_t \mid u_0) = \mathcal{N}\!\left(e^{at}u_0,\;\; g^2\int_0^t e^{2a(t-s)}\,ds\right).

For a<0a < 0, the mean decays exponentially toward zero and the variance saturates at g2/(2a)g^2/(2|a|) as tt \to \infty.

The Ornstein–Uhlenbeck process. The canonical mean-reverting SDE is

dut=θ(utμ)dt+σdwt,θ>0.du_t = -\theta(u_t - \mu)\,dt + \sigma\,dw_t, \qquad \theta > 0.

The drift pulls utu_t toward the long-run mean μ\mu at rate θ\theta; σ\sigma controls the noise level. The transition density is

p(utus)=N ⁣(μ+eθ(ts)(usμ),    σ22θ(1e2θ(ts))).p(u_t \mid u_s) = \mathcal{N}\!\left(\mu + e^{-\theta(t-s)}(u_s - \mu),\;\; \frac{\sigma^2}{2\theta}\bigl(1 - e^{-2\theta(t-s)}\bigr)\right).

The marginal distribution converges to the stationary distribution N(μ,σ2/(2θ))\mathcal{N}(\mu,\, \sigma^2/(2\theta)) regardless of the initial condition, with the exponential covariance kernel k(s,t)=σ22θeθtsk(s, t) = \frac{\sigma^2}{2\theta}e^{-\theta|t-s|}.

Multivariate case. For dut=Autdt+Gdwtd\mathbf{u}_t = A\mathbf{u}_t\,dt + G\,d\mathbf{w}_t with stable AA, the solution is

ut=eAtu0+0teA(ts)Gdws,\mathbf{u}_t = e^{At}\mathbf{u}_0 + \int_0^t e^{A(t-s)}G\,d\mathbf{w}_s,

with transition covariance P(t)=0teAsGGeAsdsP(t) = \int_0^t e^{As}GG^\top e^{A^\top s}\,ds. The stationary covariance PP_\infty is the unique positive definite solution to the continuous-time Lyapunov equation

AP+PA+GG=0.AP_\infty + P_\infty A^\top + GG^\top = 0.

Stationary Distributions

When does the density ptp_t converge as tt \to \infty? A stationary distribution pp^* satisfies the time-independent Fokker-Planck equation (right-hand side set to zero). For the scalar constant-diffusion case σ(u,t)=σ\sigma(u, t) = \sigma, one can integrate the Fokker-Planck equation twice to obtain

p(u)exp ⁣(2σ2uf(v)dv).p^*(u) \propto \exp\!\left(\frac{2}{\sigma^2}\int^u f(v)\,dv\right).

This is the Boltzmann distribution with energy E(u)=uf(v)dvE(u) = -\int^u f(v)\,dv.

Langevin dynamics. This connection between drift and stationary distribution is the foundation of gradient-based MCMC. To sample from a target distribution p(u)eE(u)p^*(u) \propto e^{-E(u)}, one runs the SDE

dut=E(ut)dt+2dwt,du_t = -\nabla E(u_t)\,dt + \sqrt{2}\,dw_t,

which has p(u)eE(u)p^*(u) \propto e^{-E(u)} as its unique stationary distribution. In the Bayesian context, E(θ)=logp(xθ)logp(θ)E(\boldsymbol{\theta}) = -\log p(\mathbf{x} \mid \boldsymbol{\theta}) - \log p(\boldsymbol{\theta}), and Langevin dynamics provides a continuous-time limit of gradient-based MCMC. Discretization via Euler–Maruyama gives the unadjusted Langevin algorithm (ULA); adding a Metropolis–Hastings correction step recovers the Metropolis-adjusted Langevin algorithm (MALA).

Linear SDEs as Gaussian Processes

Since solutions to linear SDEs have Gaussian finite-dimensional distributions, they are Gaussian processes. The GP mean and covariance can be computed from the SDE coefficients:

m(t)=E[ut],k(s,t)=Cov(us,ut),m(t) = \mathbb{E}[u_t], \qquad k(s, t) = \operatorname{Cov}(u_s, u_t),

both of which satisfy ODEs derivable from the SDE.

Conversely, many classical stationary covariance kernels correspond exactly to the stationary GPs generated by specific linear SDEs:

KernelCorresponding SDE
Exponential: $k(r) = \sigma^2 e^{-\ellr
Matérn-3/2: k(r)=(1+3r)e3rk(r) = (1 + \sqrt{3}\ell r)e^{-\sqrt{3}\ell r}Second-order linear SDE (two-state system)
Matérn-5/2: k(r)=(1+5r+532r2)e5rk(r) = (1 + \sqrt{5}\ell r + \tfrac{5}{3}\ell^2 r^2)e^{-\sqrt{5}\ell r}Third-order linear SDE (three-state system)
Squared exponential (approximate)Infinite-order SDE, approximated by truncating a series expansion

This SDE–GP equivalence has an important computational consequence. Standard GP regression with NN observations requires O(N3)O(N^3) time to compute (and invert) the N×NN \times N covariance matrix. When the covariance function corresponds to a linear SDE, the Kalman filter performs the same inference in O(N)O(N) time, by exploiting the Markov structure of the SDE solution. This makes SDE-based GP approximation a key technique for large-scale time series Solin & Särkkä, 2020.

The Reverse-Time SDE

A foundational result due to Anderson, 1982 shows that the time-reverse of a diffusion process is also a diffusion. If {ut:t[0,T]}\{u_t : t \in [0, T]\} satisfies the forward SDE

dut=f(ut,t)dt+σ(t)dwt,du_t = f(u_t, t)\,dt + \sigma(t)\,dw_t,

then the reverse-time process uˉt=uTt\bar{u}_t = u_{T-t} satisfies

duˉt=[f(uˉt,Tt)+σ(Tt)2ulogpTt(uˉt)]dt+σ(Tt)dwˉt,d\bar{u}_t = \bigl[-f(\bar{u}_t, T-t) + \sigma(T-t)^2\,\nabla_u \log p_{T-t}(\bar{u}_t)\bigr]dt + \sigma(T-t)\,d\bar{w}_t,

where wˉt\bar{w}_t is a new Brownian motion (running backward) and ulogpt(u)\nabla_u \log p_t(u) is the score function of the marginal density at time tt.

The reverse SDE has the same marginal distributions as the forward SDE, but run in reverse time. Starting from uˉ0pT\bar{u}_0 \sim p_T (a simple noise distribution) and simulating the reverse SDE to time TT produces a sample from p0p_0 (the data distribution).

This result is the mathematical heart of score-based generative modeling: design a forward SDE that converts data into noise, estimate the score ulogpt\nabla_u \log p_t using a neural network trained by denoising score matching, then run the reverse SDE to generate new data. The connection between the reverse-time correction and the score function explains why learning to denoise is equivalent to learning to generate — a remarkable duality discussed further in the Denoising Diffusion Models chapter.

Conclusion

Stochastic differential equations extend ordinary differential equations by adding Brownian noise, yielding continuous-time models whose sample paths are nowhere differentiable. The Ito integral and Ito’s lemma provide the calculus needed to work with these processes. Linear SDEs are especially tractable: their solutions are Gaussian processes with kernels that correspond to classical covariance functions (exponential, Matérn), and the Kalman filter performs GP inference in linear time by exploiting the Markov structure. Stationary distributions of SDEs are Boltzmann densities, connecting SDEs to Langevin MCMC. The reverse-time SDE of Anderson (1982) is the mathematical foundation of score-based generative modeling, providing a principled framework for understanding denoising diffusion models.

References
  1. Solin, A., & Särkkä, S. (2020). Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, 30(2), 419–446.
  2. Anderson, B. D. O. (1982). Reverse-time diffusion equation models. Stochastic Processes and Their Applications, 12(3), 313–326.