Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Markov Chain Monte Carlo

Exact posterior inference is possible for conjugate models like the ones we studied in the previous chapters. But most interesting models — including the hierarchical Gaussian model with unknown per-school variances — are non-conjugate, and the posterior cannot be computed in closed form.

Markov chain Monte Carlo (MCMC) is the workhorse of Bayesian computation in these settings. Instead of evaluating the posterior analytically, we construct a Markov chain whose stationary distribution is the posterior and then collect samples. Those samples can be used to approximate any posterior expectation.

In this chapter we:

  • Motivate the Monte Carlo approach to posterior inference

  • Introduce Markov chains and their key properties (stationarity, detailed balance, ergodicity)

  • Derive the Metropolis–Hastings algorithm and its Gibbs sampling special case

  • Implement a Gibbs sampler for the hierarchical Gaussian model from the previous chapter

  • Assess convergence and efficiency with MCMC diagnostics

Source
import torch
from torch.distributions import Normal, Gamma, TransformedDistribution
from torch.distributions.transforms import PowerTransform
from pyro.ops.stats import effective_sample_size, autocorrelation
import matplotlib.pyplot as plt

class ScaledInvChiSq(TransformedDistribution):
    '''Scaled inverse chi-squared distribution chi^{-2}(nu, sigma^2).

    Implemented as PowerTransform(-1) of Gamma(nu/2, nu*sigma^2/2),
    matching the parameterization from the previous chapter.
    '''
    def __init__(self, ν, σ2):
        super().__init__(Gamma(ν / 2, ν * σ2 / 2), [PowerTransform(-1)])
/opt/hostedtoolcache/Python/3.13.13/x64/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Posterior Expectations

The central object of Bayesian inference is the posterior distribution p(θX)p(\mbtheta \mid \mbX). In practice we rarely need the full density; instead we interact with it through expectations:

  • Ep(θX)[θ]\E_{p(\mbtheta \mid \mbX)}[\mbtheta] — the posterior mean,

  • Ep(θX)[1[θA]]\E_{p(\mbtheta \mid \mbX)}[\mathbb{1}[\mbtheta \in \mathcal{A}]] — the posterior probability that θ\mbtheta lies in set A\mathcal{A},

  • Ep(θX)[p(Xθ)]\E_{p(\mbtheta \mid \mbX)}[p(\mbX' \mid \mbtheta)] — the posterior predictive density of new data X\mbX'.

All of these have the form Ep(θX)[f(θ)]\E_{p(\mbtheta \mid \mbX)}[f(\mbtheta)] for some function ff.

Quadrature

One numerical approach is quadrature: place a grid {θm}m=1M\{\mbtheta_m\}_{m=1}^M over Θ\Theta and approximate

Ep(θX)[f(θ)]m=1Mp(θmX)f(θm)Δm,\begin{align} \E_{p(\mbtheta \mid \mbX)}[f(\mbtheta)] &\approx \sum_{m=1}^M p(\mbtheta_m \mid \mbX) \, f(\mbtheta_m) \, \Delta_m, \end{align}

where Δm\Delta_m is the volume element around θm\mbtheta_m. This works for low-dimensional posteriors (say D5D \leq 5), but the number of grid points needed grows as M=O(ϵD)M = O(\epsilon^{-D}) for a desired accuracy ϵ\epsilon, so it quickly becomes infeasible.

Monte Carlo

Monte Carlo replaces the grid with samples:

Ep(θX)[f(θ)]1Mm=1Mf(θ(m)),θ(m)p(θX).\begin{align} \E_{p(\mbtheta \mid \mbX)}[f(\mbtheta)] &\approx \frac{1}{M} \sum_{m=1}^M f(\mbtheta^{(m)}), \qquad \mbtheta^{(m)} \sim p(\mbtheta \mid \mbX). \end{align}

Letting f^=1Mmf(θ(m))\hat{f} = \frac{1}{M}\sum_m f(\mbtheta^{(m)}):

  • Unbiasedness: E[f^]=Ep(θX)[f(θ)]\E[\hat{f}] = \E_{p(\mbtheta \mid \mbX)}[f(\mbtheta)].

  • Variance: if the samples are uncorrelated, Var[f^]=1MVarp[f(θ)]\Var[\hat{f}] = \frac{1}{M}\Var_{p}[f(\mbtheta)], so the RMSE scales as O(M1/2)O(M^{-1/2}) regardless of dimension DD.

The catch: how do we draw samples from p(θX)p(\mbtheta \mid \mbX) when we cannot even compute the normalizing constant p(X)p(\mbX)?

Markov Chain Monte Carlo

Idea: design a Markov chain whose stationary distribution is the posterior. Run it long enough and the chain’s states become approximate draws from p(θX)p(\mbtheta \mid \mbX).

Markov Chains

A Markov chain is a sequence of random variables θ(1),θ(2),\mbtheta^{(1)}, \mbtheta^{(2)}, \ldots with the Markov property: each state depends on the past only through the immediately preceding state,

π(θ(1),,θ(M))=π1(θ(1))m=2Mπ(θ(m)θ(m1)).\begin{align} \pi(\mbtheta^{(1)}, \ldots, \mbtheta^{(M)}) &= \pi_1(\mbtheta^{(1)}) \prod_{m=2}^M \pi(\mbtheta^{(m)} \mid \mbtheta^{(m-1)}). \end{align}

π1\pi_1 is the initial distribution and π(θθ)\pi(\mbtheta' \mid \mbtheta) is the (homogeneous) transition kernel.

Stationary Distributions

Let πm(θ)\pi_m(\mbtheta) denote the marginal distribution of the mm-th state. A distribution π(θ)\pi^*(\mbtheta) is a stationary distribution if

π(θ)=π(θ)π(θθ)dθ.\begin{align} \pi^*(\mbtheta) &= \int \pi^*(\mbtheta') \, \pi(\mbtheta \mid \mbtheta') \, d\mbtheta'. \end{align}

Starting from π\pi^* and applying the transition keeps the distribution unchanged.

Detailed Balance

A sufficient condition for π\pi^* to be stationary is detailed balance:

π(θ)π(θθ)=π(θ)π(θθ)for all θ,θ.\begin{align} \pi^*(\mbtheta') \, \pi(\mbtheta \mid \mbtheta') &= \pi^*(\mbtheta) \, \pi(\mbtheta' \mid \mbtheta) \quad \text{for all } \mbtheta, \mbtheta'. \end{align}

Integrating both sides over θ\mbtheta' recovers the stationarity equation, so detailed balance implies stationarity.

Ergodicity

Detailed balance shows π\pi^* is a stationary distribution, but not necessarily the unique one. A chain is ergodic if πm(θ)π(θ)\pi_m(\mbtheta) \to \pi^*(\mbtheta) regardless of the starting point π1\pi_1. For our purposes, ergodicity holds whenever the transition kernel assigns positive probability to any region of Θ\Theta from any starting point (i.e., the chain is irreducible and aperiodic).

The Metropolis–Hastings Algorithm

Our goal is to build an ergodic Markov chain with stationary distribution π(θ)=p(θX)\pi^*(\mbtheta) = p(\mbtheta \mid \mbX).

Key observation: even though we cannot compute p(θX)p(\mbtheta \mid \mbX) (because p(X)p(\mbX) is intractable), we can compute ratios of posterior densities, since the normalizing constant cancels:

p(θX)p(θX)=p(θ,X)p(θ,X).\begin{align} \frac{p(\mbtheta \mid \mbX)}{p(\mbtheta' \mid \mbX)} = \frac{p(\mbtheta, \mbX)}{p(\mbtheta', \mbX)}. \end{align}

The Accept–Reject Step

Metropolis–Hastings constructs the transition kernel from two steps:

  1. Propose a new state θ\mbtheta by sampling from a proposal distribution q(θθ)q(\mbtheta \mid \mbtheta').

  2. Accept the proposal with probability a(θθ)a(\mbtheta' \to \mbtheta); otherwise stay at θ\mbtheta'.

The resulting transition kernel is

π(θθ)={q(θθ)a(θθ)θθ1q(θθ)a(θθ)dθθ=θ.\begin{align} \pi(\mbtheta \mid \mbtheta') &= \begin{cases} q(\mbtheta \mid \mbtheta') \, a(\mbtheta' \to \mbtheta) & \mbtheta \neq \mbtheta' \\ 1 - \int q(\mbtheta'' \mid \mbtheta') \, a(\mbtheta' \to \mbtheta'') \, d\mbtheta'' & \mbtheta = \mbtheta'. \end{cases} \end{align}

Plugging into the detailed balance condition and solving for the acceptance probability gives the Metropolis–Hastings acceptance probability:

a(θθ)=min{1,  p(θ,X)  q(θθ)p(θ,X)  q(θθ)}.\begin{align} a(\mbtheta' \to \mbtheta) &= \min \left\{ 1, \; \frac{p(\mbtheta, \mbX) \; q(\mbtheta' \mid \mbtheta)} {p(\mbtheta', \mbX) \; q(\mbtheta \mid \mbtheta')} \right\}. \end{align}

When the ratio exceeds 1 the proposal is always accepted; otherwise it is accepted with probability equal to the ratio. Any proposal qq that covers Θ\Theta yields an ergodic chain.

The Metropolis Algorithm

When the proposal is symmetric (q(θθ)=q(θθ)q(\mbtheta \mid \mbtheta') = q(\mbtheta' \mid \mbtheta)), the proposal densities cancel and

a(θθ)=min{1,  p(θ,X)p(θ,X)}.\begin{align} a(\mbtheta' \to \mbtheta) &= \min \left\{ 1, \; \frac{p(\mbtheta, \mbX)}{p(\mbtheta', \mbX)} \right\}. \end{align}

This Metropolis algorithm always accepts moves that increase the joint probability and accepts downhill moves with some probability.

Gibbs Sampling

Gibbs sampling is a special case of Metropolis–Hastings that updates one coordinate θd\theta_d at a time by sampling from its complete conditional p(θdθ¬d,X)p(\theta_d \mid \mbtheta_{\neg d}, \mbX).

The proposal is

qd(θθ)=p(θdθ¬d,X)  δθ¬d(θ¬d),\begin{align} q_d(\mbtheta \mid \mbtheta') &= p(\theta_d \mid \mbtheta'_{\neg d}, \mbX) \; \delta_{\mbtheta'_{\neg d}}(\mbtheta_{\neg d}), \end{align}

i.e., draw a new θd\theta_d from its conditional and keep all other coordinates fixed. Substituting into the MH acceptance probability shows that this proposal is always accepted:

ad(θθ)=1.\begin{align} a_d(\mbtheta' \to \mbtheta) = 1. \end{align}

Cycling through all coordinates produces an ergodic chain as long as each complete conditional covers the full support.

def gibbs(theta, X, num_samples):
    samples = []
    for _ in range(num_samples):
        for d in range(len(theta)):
            theta[d] = sample_conditional(d, theta, X)
        samples.append(theta.copy())
    return samples

The key requirement is that we can sample each complete conditional. In conjugate models (like the hierarchical Gaussian below), closed-form conditionals make Gibbs extremely efficient. When some conditionals are intractable, we can mix Gibbs and MH updates in the same chain — the Metropolis-Hastings within Gibbs strategy.

Worked Example: Hierarchical Gaussian Model

We return to the “8 Schools” example from the previous chapter, but now place a prior on the per-school variances σs2\sigma_s^2 as well. This makes the model non-conjugate and exact inference intractable.

τ2χ2(ν0,τ02)μN(μ0,τ2/κ0)θsN(μ,τ2),s=1,,Sσs2χ2(α0,σ02),s=1,,Sxs,nN(θs,σs2),n=1,,Ns\begin{align} \tau^2 &\sim \chi^{-2}(\nu_0, \tau_0^2) \\ \mu &\sim \mathcal{N}(\mu_0, \, \tau^2 / \kappa_0) \\ \theta_s &\sim \mathcal{N}(\mu, \, \tau^2), \quad s = 1, \ldots, S \\ \sigma_s^2 &\sim \chi^{-2}(\alpha_0, \, \sigma_0^2), \quad s = 1, \ldots, S \\ x_{s,n} &\sim \mathcal{N}(\theta_s, \, \sigma_s^2), \quad n = 1, \ldots, N_s \end{align}

We derive the complete conditionals for each variable and implement a Gibbs sampler.

# Hyperparameters
S = 8                    # number of schools
Ns = torch.tensor(20)    # students per school
μ0, κ0 = 0.0, 0.1  # prior on μ
ν0, τ2_0 = 0.1, 100.0 # prior on τ²
α0, σ2_0 = 0.1, 10.0  # prior on σ_s²

# Synthetic data: match sample means/stds to the 8 Schools values
torch.manual_seed(305)
x_bars  = torch.tensor([28.,  8., -3.,  7., -1.,  1., 18., 12.])
σ_bars  = torch.tensor([15., 10., 16., 11.,  9., 11., 10., 18.])
xs_raw  = Normal(x_bars, torch.sqrt(Ns.float()) * σ_bars).sample((Ns,))
zs      = (xs_raw - xs_raw.mean(0)) / xs_raw.std(0)
xs      = x_bars + torch.sqrt(Ns.float()) * σ_bars * zs  # shape (Ns, S)
assert torch.allclose(xs.mean(0), x_bars)
assert torch.allclose(xs.std(0),  torch.sqrt(Ns.float()) * σ_bars)

Log Joint Probability

We track logp(θ,X)\log p(\mbtheta, \mbX) during sampling as a convergence diagnostic.

def log_joint(τ2, μ, θs, σ2s, xs, τ2_0, ν0, μ0, κ0, α0, σ2_0):
    lp  = ScaledInvChiSq(ν0, τ2_0).log_prob(τ2)
    lp += Normal(μ0, torch.sqrt(τ2 / κ0)).log_prob(μ)
    lp += Normal(μ,  torch.sqrt(τ2)).log_prob(θs).sum()
    lp += ScaledInvChiSq(α0, σ2_0).log_prob(σ2s).sum()
    lp += Normal(θs, torch.sqrt(σ2s)).log_prob(xs).sum()
    return lp

Complete Conditional for θs\theta_s

Combining the prior N(θsμ,τ2)\mathcal{N}(\theta_s \mid \mu, \tau^2) with the NsN_s likelihoods N(xs,nθs,σs2)\mathcal{N}(x_{s,n} \mid \theta_s, \sigma_s^2) gives

p(θsμ,τ2,σs2,X)=N(θsθ^s,vs),\begin{align} p(\theta_s \mid \mu, \tau^2, \sigma_s^2, \mbX) &= \mathcal{N}(\theta_s \mid \hat{\theta}_s, v_s), \end{align}

where

vs=(Nsσs2+1τ2)1,θ^s=vs(nxs,nσs2+μτ2).\begin{align} v_s &= \left(\frac{N_s}{\sigma_s^2} + \frac{1}{\tau^2}\right)^{-1}, \qquad \hat{\theta}_s = v_s \left(\frac{\sum_n x_{s,n}}{\sigma_s^2} + \frac{\mu}{\tau^2}\right). \end{align}

Because each θs\theta_s depends only on its own school’s data, all SS can be sampled in parallel as a blocked Gibbs step.

def gibbs_θs(τ2, μ, σ2s, xs):
    Ns = xs.shape[0]
    v_θ   = 1 / (Ns / σ2s + 1 / τ2)
    θ_hat = v_θ * (xs.sum(0) / σ2s + μ / τ2)
    return Normal(θ_hat, torch.sqrt(v_θ)).sample()

Complete Conditional for σs2\sigma_s^2

Combining χ2(σs2α0,σ02)\chi^{-2}(\sigma_s^2 \mid \alpha_0, \sigma_0^2) with the residuals gives a conjugate update:

p(σs2θs,X)=χ2(σs2αN,σN2),\begin{align} p(\sigma_s^2 \mid \theta_s, \mbX) &= \chi^{-2}(\sigma_s^2 \mid \alpha_N, \sigma_N^2), \end{align}

where

αN=α0+Ns,σN2=1αN ⁣(α0σ02+n=1Ns(xs,nθs)2).\begin{align} \alpha_N &= \alpha_0 + N_s, \qquad \sigma_N^2 = \frac{1}{\alpha_N} \!\left(\alpha_0 \sigma_0^2 + \sum_{n=1}^{N_s}(x_{s,n} - \theta_s)^2\right). \end{align}

All SS variances are conditionally independent and sampled in parallel.

def gibbs_σ2s(α0, σ2_0, θs, xs):
    Ns  = xs.shape[0]
    αN  = α0 + Ns
    σ2N = (α0 * σ2_0 + torch.sum((xs - θs) ** 2, dim=0)) / αN
    return ScaledInvChiSq(αN, σ2N).sample()

Complete Conditional for μ\mu

The prior N(μμ0,τ2/κ0)\mathcal{N}(\mu \mid \mu_0, \tau^2/\kappa_0) with SS terms N(θsμ,τ2)\mathcal{N}(\theta_s \mid \mu, \tau^2) gives

p(μ{θs},τ2)=N(μμ^,vμ),\begin{align} p(\mu \mid \{\theta_s\}, \tau^2) &= \mathcal{N}(\mu \mid \hat{\mu}, v_\mu), \end{align}

where

vμ=τ2κ0+S,μ^=κ0μ0+s=1Sθsκ0+S.\begin{align} v_\mu = \frac{\tau^2}{\kappa_0 + S}, \qquad \hat{\mu} = \frac{\kappa_0 \mu_0 + \sum_{s=1}^S \theta_s}{\kappa_0 + S}. \end{align}
def gibbs_μ(μ0, κ0, τ2, θs):
    S     = len(θs)
    v_μ   = τ2 / (κ0 + S)
    μ_hat = (κ0 * μ0 + θs.sum()) / (κ0 + S)
    return Normal(μ_hat, torch.sqrt(v_μ)).sample()

Complete Conditional for τ2\tau^2

The joint factor χ2(τ2ν0,τ02)N(μμ0,τ2/κ0)sN(θsμ,τ2)\chi^{-2}(\tau^2 \mid \nu_0, \tau_0^2) \cdot \mathcal{N}(\mu \mid \mu_0, \tau^2/\kappa_0) \cdot \prod_s \mathcal{N}(\theta_s \mid \mu, \tau^2) is conjugate:

p(τ2μ,{θs})=χ2(τ2νN,τN2),\begin{align} p(\tau^2 \mid \mu, \{\theta_s\}) &= \chi^{-2}(\tau^2 \mid \nu_N, \tau_N^2), \end{align}

where

νN=ν0+S+1,τN2=1νN ⁣(ν0τ02+κ0(μμ0)2+s=1S(θsμ)2).\begin{align} \nu_N &= \nu_0 + S + 1, \qquad \tau_N^2 = \frac{1}{\nu_N}\!\left( \nu_0 \tau_0^2 + \kappa_0(\mu - \mu_0)^2 + \sum_{s=1}^S (\theta_s - \mu)^2 \right). \end{align}
def gibbs_τ2(ν0, τ2_0, μ0, κ0, μ, θs):
    S   = len(θs)
    νN  = ν0 + S + 1
    τ2N = (ν0 * τ2_0 + κ0 * (μ - μ0)**2 + torch.sum((θs - μ)**2)) / νN
    return ScaledInvChiSq(νN, τ2N).sample()

The Gibbs Sampler

Cycle through the four updates, recording the log joint probability at each step.

def gibbs(xs, τ2_0, ν0, μ0, κ0, α0, σ2_0, num_samples=1000):
    Ns, S = xs.shape

    # Initialise from the prior
    τ2  = ScaledInvChiSq(ν0, τ2_0).sample()
    μ   = Normal(μ0, torch.sqrt(τ2 / κ0)).sample()
    θs  = Normal(μ,  torch.sqrt(τ2)).sample((S,))
    σ2s = ScaledInvChiSq(α0, σ2_0).sample((S,))

    records = []
    for _ in range(num_samples):
        τ2  = gibbs_τ2(ν0, τ2_0, μ0, κ0, μ, θs)
        μ   = gibbs_μ(μ0, κ0, τ2, θs)
        θs  = gibbs_θs(τ2, μ, σ2s, xs)
        σ2s = gibbs_σ2s(α0, σ2_0, θs, xs)
        lp  = log_joint(τ2, μ, θs, σ2s, xs,
                       τ2_0, ν0, μ0, κ0, α0, σ2_0)
        records.append((τ2, μ, θs, σ2s, lp))

    keys = ['τ2', 'μ', 'θs', 'σ2s', 'lps']
    return {k: torch.stack(v) for k, v in zip(keys, zip(*records))}
torch.manual_seed(305)
num_samples = 10000
samples = gibbs(xs, τ2_0, ν0, μ0, κ0, α0, σ2_0, num_samples)

MCMC Diagnostics

Before trusting the samples, we check that the chain has converged to the stationary distribution and is mixing efficiently.

Burn-in

The first few iterations reflect the arbitrary initial state. We discard this burn-in period before computing estimates.

Source
fig, axs = plt.subplots(1, 2, figsize=(10, 3))

axs[0].plot(samples['lps'].numpy())
axs[0].set_xlabel('Iteration')
axs[0].set_ylabel('Log joint probability')
axs[0].set_title('All iterations')

burnin = 100
axs[1].plot(torch.arange(burnin, num_samples).numpy(),
            samples['lps'][burnin:].numpy())
axs[1].set_xlabel('Iteration')
axs[1].set_title(f'Iterations {burnin}+')

plt.tight_layout()
<Figure size 1000x300 with 2 Axes>

Trace Plots and Marginal Posteriors

After discarding burn-in, we visualize traces and marginal posterior histograms.

Source
fig, axs = plt.subplots(2, 2, figsize=(10, 6))

τ_samp = torch.sqrt(samples['τ2'][burnin:])
axs[0,0].plot(τ_samp.numpy()); axs[0,0].set_ylabel(r'$\tau$'); axs[0,0].set_title('Trace')
axs[0,1].hist(τ_samp.numpy(), 50, density=True, alpha=0.6, ec='k')
axs[0,1].set_xlabel(r'$\tau$'); axs[0,1].set_title(r'Posterior $p(\tau \mid \mathbf{X})$')

μ_samp = samples['μ'][burnin:]
axs[1,0].plot(μ_samp.numpy()); axs[1,0].set_ylabel(r'$\mu$')
axs[1,1].hist(μ_samp.numpy(), 50, density=True, alpha=0.6, ec='k')
axs[1,1].set_xlabel(r'$\mu$'); axs[1,1].set_title(r'Posterior $p(\mu \mid \mathbf{X})$')

plt.tight_layout()
<Figure size 1000x600 with 4 Axes>
Source
bins = torch.linspace(-30, 30, 80).numpy()
fig, axs = plt.subplots(S, 2, figsize=(10, 10), sharex='col')
for s in range(S):
    axs[s, 0].plot(samples['θs'][burnin:, s].numpy())
    axs[s, 0].set_ylabel(rf'$\theta_{{{s+1}}}$', fontsize=9)
    axs[s, 1].hist(samples['θs'][:, s].numpy(), bins, density=True, alpha=0.6, ec='k')
    axs[s, 1].axvline(xs.mean(0)[s].item(), color='r', lw=1.5,
                      label=r'$\bar{x}_s$' if s == 0 else None)
    axs[s, 1].set_ylabel(rf'$p(\theta_{{{s+1}}}\mid\mathbf{{X}})$', fontsize=8)
axs[0, 0].set_title('Traces'); axs[0, 1].set_title('Posteriors')
axs[0, 1].legend()
axs[-1, 0].set_xlabel('Iteration'); axs[-1, 1].set_xlabel(r'$\theta_s$')
plt.tight_layout()
<Figure size 1000x1000 with 16 Axes>
Source
bins = torch.linspace(0, 30, 60).numpy()
σ_samp = torch.sqrt(samples['σ2s'] / Ns.float())  # sigma_bar_s = sigma_s / sqrt(Ns)
fig, axs = plt.subplots(S, 2, figsize=(10, 10), sharex='col')
for s in range(S):
    axs[s, 0].plot(σ_samp[burnin:, s].numpy())
    axs[s, 0].set_ylabel(rf'$\bar{{\sigma}}_{{{s+1}}}$', fontsize=9)
    axs[s, 1].hist(σ_samp[:, s].numpy(), bins, density=True, alpha=0.6, ec='k')
    axs[s, 1].axvline(σ_bars[s].item(), color='r', lw=1.5,
                      label=r'$\bar{\sigma}_s$' if s == 0 else None)
    axs[s, 1].set_ylabel(rf'$p(\bar{{\sigma}}_{{{s+1}}}\mid\mathbf{{X}})$', fontsize=8)
axs[0, 0].set_title('Traces'); axs[0, 1].set_title('Posteriors')
axs[0, 1].legend()
axs[-1, 0].set_xlabel('Iteration'); axs[-1, 1].set_xlabel(r'$\bar{\sigma}_s$')
plt.tight_layout()
<Figure size 1000x1000 with 16 Axes>

Autocorrelation and Effective Sample Size

Even after burn-in, consecutive MCMC samples are correlated. This inflates the variance of Monte Carlo estimates.

The autocorrelation function at lag \ell is

acff()=Cov[f(θ(m)),f(θ(m+))]Var[f(θ)].\begin{align} \mathrm{acf}_f(\ell) &= \frac{\mathrm{Cov}[f(\mbtheta^{(m)}),\, f(\mbtheta^{(m+\ell)})]}{\mathrm{Var}[f(\mbtheta)]}. \end{align}

The effective sample size (ESS) accounts for autocorrelation:

Meff=M1+2=1acf().\begin{align} M_\text{eff} &= \frac{M}{1 + 2\sum_{\ell=1}^\infty \mathrm{acf}(\ell)}. \end{align}

MeffMM_\text{eff} \ll M signals slow mixing. For intuition: a random walk with autocorrelation ρ\rho has Meff=M(1ρ)/(1+ρ)M_\text{eff} = M(1-\rho)/(1+\rho), which goes to zero as ρ1\rho \to 1.

Source
acf_τ  = autocorrelation(torch.sqrt(samples['τ2'][burnin:]))
acf_μ  = autocorrelation(samples['μ'][burnin:])
acf_θ1 = autocorrelation(samples['θs'][burnin:, 0])
acf_σ1 = autocorrelation(torch.sqrt(samples['σ2s'])[burnin:, 0])

plt.figure(figsize=(7, 4))
for acf, label in [(acf_τ, r'$\tau$'), (acf_μ, r'$\mu$'),
                   (acf_θ1, r'$\theta_1$'), (acf_σ1, r'$\sigma_1$')]:
    plt.plot(acf[:250].numpy(), label=label)
plt.axhline(0, color='k', lw=0.5)
plt.xlabel('Lag'); plt.ylabel('Autocorrelation'); plt.legend()
plt.tight_layout()
<Figure size 700x400 with 1 Axes>
print(f'Effective sample sizes (out of {num_samples - burnin} post-burnin samples):')
print(f"  τ²:   {effective_sample_size(samples['τ2'][None, burnin:]).item():.0f}")
print(f"  μ:    {effective_sample_size(samples['μ'][None, burnin:]).item():.0f}")
print(f"  θ₁:   {effective_sample_size(samples['θs'][None, burnin:, 0]).item():.0f}")
print(f"  σ₁²:  {effective_sample_size(samples['σ2s'][None, burnin:, 0]).item():.0f}")
Effective sample sizes (out of 9900 post-burnin samples):
  τ²:   1538
  μ:    734
  θ₁:   1113
  σ₁²:  9069

Conclusion

This chapter introduced MCMC as a general-purpose engine for posterior inference and implemented a Gibbs sampler for the hierarchical Gaussian model. Key points:

  • Posterior expectations are the operative output; Monte Carlo provides a dimension-free approximation.

  • Metropolis–Hastings constructs a valid transition kernel from any proposal, using only the ratio of joint probabilities (no normalizing constant needed).

  • Gibbs sampling is MH with exact-conditional proposals that always accept — highly efficient in conjugate or partially-conjugate models.

  • Blocked Gibbs samples conditionally-independent variables jointly, which is faster and reduces autocorrelation.

  • Diagnostics — log joint traces, ACF plots, and ESS — are essential for verifying convergence and mixing.

References
  1. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
  2. Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. https://probml.github.io/pml-book/book2.html
  3. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman.