Markov Chain Monte Carlo

Exact posterior inference is possible for conjugate models like the ones we studied in the previous chapters. But most interesting models — including the hierarchical Gaussian model with unknown per-school variances — are non-conjugate, and the posterior cannot be computed in closed form.

Markov chain Monte Carlo (MCMC) is the workhorse of Bayesian computation in these settings. Instead of evaluating the posterior analytically, we construct a Markov chain whose stationary distribution is the posterior and then collect samples. Those samples can be used to approximate any posterior expectation.

In this chapter we:

Motivate the Monte Carlo approach to posterior inference
Introduce Markov chains and their key properties (stationarity, detailed balance, ergodicity)
Derive the Metropolis–Hastings algorithm and its Gibbs sampling special case
Implement a Gibbs sampler for the hierarchical Gaussian model from the previous chapter
Assess convergence and efficiency with MCMC diagnostics

import torch
from torch.distributions import Normal, Gamma, TransformedDistribution
from torch.distributions.transforms import PowerTransform
from pyro.ops.stats import effective_sample_size, autocorrelation
import matplotlib.pyplot as plt

class ScaledInvChiSq(TransformedDistribution):
    '''Scaled inverse chi-squared distribution chi^{-2}(nu, sigma^2).

    Implemented as PowerTransform(-1) of Gamma(nu/2, nu*sigma^2/2),
    matching the parameterization from the previous chapter.
    '''
    def __init__(self, ν, σ2):
        super().__init__(Gamma(ν / 2, ν * σ2 / 2), [PowerTransform(-1)])

/opt/hostedtoolcache/Python/3.13.13/x64/lib/python3.13/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Posterior Expectations¶

The central object of Bayesian inference is the posterior distribution $p(\mbtheta \mid \mbX)$ . In practice we rarely need the full density; instead we interact with it through expectations:

$\E_{p(\mbtheta \mid \mbX)}[\mbtheta]$ — the posterior mean,
$\E_{p(\mbtheta \mid \mbX)}[\mathbb{1}[\mbtheta \in \mathcal{A}]]$ — the posterior probability that $\mbtheta$ lies in set $\mathcal{A}$ ,
$\E_{p(\mbtheta \mid \mbX)}[p(\mbX' \mid \mbtheta)]$ — the posterior predictive density of new data $\mbX'$ .

All of these have the form $\E_{p(\mbtheta \mid \mbX)}[f(\mbtheta)]$ for some function $f$ .

Quadrature¶

One numerical approach is quadrature: place a grid $\{\mbtheta_m\}_{m=1}^M$ over $\Theta$ and approximate

\begin{align} \E_{p(\mbtheta \mid \mbX)}[f(\mbtheta)] &\approx \sum_{m=1}^M p(\mbtheta_m \mid \mbX) \, f(\mbtheta_m) \, \Delta_m, \end{align}

(1)

where $\Delta_m$ is the volume element around $\mbtheta_m$ . This works for low-dimensional posteriors (say $D \leq 5$ ), but the number of grid points needed grows as $M = O(\epsilon^{-D})$ for a desired accuracy $\epsilon$ , so it quickly becomes infeasible.

Monte Carlo¶

Monte Carlo replaces the grid with samples:

\begin{align} \E_{p(\mbtheta \mid \mbX)}[f(\mbtheta)] &\approx \frac{1}{M} \sum_{m=1}^M f(\mbtheta^{(m)}), \qquad \mbtheta^{(m)} \sim p(\mbtheta \mid \mbX). \end{align}

(2)

Letting $\hat{f} = \frac{1}{M}\sum_m f(\mbtheta^{(m)})$ :

Unbiasedness: $\E[\hat{f}] = \E_{p(\mbtheta \mid \mbX)}[f(\mbtheta)]$ .
Variance: if the samples are uncorrelated, $\Var[\hat{f}] = \frac{1}{M}\Var_{p}[f(\mbtheta)]$ , so the RMSE scales as $O(M^{-1/2})$ regardless of dimension $D$ .

The catch: how do we draw samples from $p(\mbtheta \mid \mbX)$ when we cannot even compute the normalizing constant $p(\mbX)$ ?

Markov Chain Monte Carlo¶

Idea: design a Markov chain whose stationary distribution is the posterior. Run it long enough and the chain’s states become approximate draws from $p(\mbtheta \mid \mbX)$ .

Markov Chains¶

A Markov chain is a sequence of random variables $\mbtheta^{(1)}, \mbtheta^{(2)}, \ldots$ with the Markov property: each state depends on the past only through the immediately preceding state,

\begin{align} \pi(\mbtheta^{(1)}, \ldots, \mbtheta^{(M)}) &= \pi_1(\mbtheta^{(1)}) \prod_{m=2}^M \pi(\mbtheta^{(m)} \mid \mbtheta^{(m-1)}). \end{align}

(3)

$\pi_1$ is the initial distribution and $\pi(\mbtheta' \mid \mbtheta)$ is the (homogeneous) transition kernel.

Stationary Distributions¶

Let $\pi_m(\mbtheta)$ denote the marginal distribution of the $m$ -th state. A distribution $\pi^*(\mbtheta)$ is a stationary distribution if

\begin{align} \pi^*(\mbtheta) &= \int \pi^*(\mbtheta') \, \pi(\mbtheta \mid \mbtheta') \, d\mbtheta'. \end{align}

(4)

Starting from $\pi^*$ and applying the transition keeps the distribution unchanged.

Detailed Balance¶

A sufficient condition for $\pi^*$ to be stationary is detailed balance:

\begin{align} \pi^*(\mbtheta') \, \pi(\mbtheta \mid \mbtheta') &= \pi^*(\mbtheta) \, \pi(\mbtheta' \mid \mbtheta) \quad \text{for all } \mbtheta, \mbtheta'. \end{align}

(5)

Integrating both sides over $\mbtheta'$ recovers the stationarity equation, so detailed balance implies stationarity.

Ergodicity¶

Detailed balance shows $\pi^*$ is a stationary distribution, but not necessarily the unique one. A chain is ergodic if $\pi_m(\mbtheta) \to \pi^*(\mbtheta)$ regardless of the starting point $\pi_1$ . For our purposes, ergodicity holds whenever the transition kernel assigns positive probability to any region of $\Theta$ from any starting point (i.e., the chain is irreducible and aperiodic).

The Metropolis–Hastings Algorithm¶

Our goal is to build an ergodic Markov chain with stationary distribution $\pi^*(\mbtheta) = p(\mbtheta \mid \mbX)$ .

Key observation: even though we cannot compute $p(\mbtheta \mid \mbX)$ (because $p(\mbX)$ is intractable), we can compute ratios of posterior densities, since the normalizing constant cancels:

\begin{align} \frac{p(\mbtheta \mid \mbX)}{p(\mbtheta' \mid \mbX)} = \frac{p(\mbtheta, \mbX)}{p(\mbtheta', \mbX)}. \end{align}

(6)

The Accept–Reject Step¶

Metropolis–Hastings constructs the transition kernel from two steps:

Propose a new state $\mbtheta$ by sampling from a proposal distribution $q(\mbtheta \mid \mbtheta')$ .
Accept the proposal with probability $a(\mbtheta' \to \mbtheta)$ ; otherwise stay at $\mbtheta'$ .

The resulting transition kernel is

\begin{align} \pi(\mbtheta \mid \mbtheta') &= \begin{cases} q(\mbtheta \mid \mbtheta') \, a(\mbtheta' \to \mbtheta) & \mbtheta \neq \mbtheta' \\ 1 - \int q(\mbtheta'' \mid \mbtheta') \, a(\mbtheta' \to \mbtheta'') \, d\mbtheta'' & \mbtheta = \mbtheta'. \end{cases} \end{align}

(7)

Plugging into the detailed balance condition and solving for the acceptance probability gives the Metropolis–Hastings acceptance probability:

\begin{align} a(\mbtheta' \to \mbtheta) &= \min \left\{ 1, \; \frac{p(\mbtheta, \mbX) \; q(\mbtheta' \mid \mbtheta)} {p(\mbtheta', \mbX) \; q(\mbtheta \mid \mbtheta')} \right\}. \end{align}

(8)

When the ratio exceeds 1 the proposal is always accepted; otherwise it is accepted with probability equal to the ratio. Any proposal $q$ that covers $\Theta$ yields an ergodic chain.

The Metropolis Algorithm¶

When the proposal is symmetric ( $q(\mbtheta \mid \mbtheta') = q(\mbtheta' \mid \mbtheta)$ ), the proposal densities cancel and

\begin{align} a(\mbtheta' \to \mbtheta) &= \min \left\{ 1, \; \frac{p(\mbtheta, \mbX)}{p(\mbtheta', \mbX)} \right\}. \end{align}

(9)

This Metropolis algorithm always accepts moves that increase the joint probability and accepts downhill moves with some probability.

Gibbs Sampling¶

Gibbs sampling is a special case of Metropolis–Hastings that updates one coordinate $\theta_d$ at a time by sampling from its complete conditional $p(\theta_d \mid \mbtheta_{\neg d}, \mbX)$ .

The proposal is

\begin{align} q_d(\mbtheta \mid \mbtheta') &= p(\theta_d \mid \mbtheta'_{\neg d}, \mbX) \; \delta_{\mbtheta'_{\neg d}}(\mbtheta_{\neg d}), \end{align}

(10)

i.e., draw a new $\theta_d$ from its conditional and keep all other coordinates fixed. Substituting into the MH acceptance probability shows that this proposal is always accepted:

\begin{align} a_d(\mbtheta' \to \mbtheta) = 1. \end{align}

(11)

Cycling through all coordinates produces an ergodic chain as long as each complete conditional covers the full support.

def gibbs(theta, X, num_samples):
    samples = []
    for _ in range(num_samples):
        for d in range(len(theta)):
            theta[d] = sample_conditional(d, theta, X)
        samples.append(theta.copy())
    return samples

The key requirement is that we can sample each complete conditional. In conjugate models (like the hierarchical Gaussian below), closed-form conditionals make Gibbs extremely efficient. When some conditionals are intractable, we can mix Gibbs and MH updates in the same chain — the Metropolis-Hastings within Gibbs strategy.

Worked Example: Hierarchical Gaussian Model¶

We return to the “8 Schools” example from the previous chapter, but now place a prior on the per-school variances $\sigma_s^2$ as well. This makes the model non-conjugate and exact inference intractable.

\begin{align} \tau^2 &\sim \chi^{-2}(\nu_0, \tau_0^2) \\ \mu &\sim \mathcal{N}(\mu_0, \, \tau^2 / \kappa_0) \\ \theta_s &\sim \mathcal{N}(\mu, \, \tau^2), \quad s = 1, \ldots, S \\ \sigma_s^2 &\sim \chi^{-2}(\alpha_0, \, \sigma_0^2), \quad s = 1, \ldots, S \\ x_{s,n} &\sim \mathcal{N}(\theta_s, \, \sigma_s^2), \quad n = 1, \ldots, N_s \end{align}

(12)

We derive the complete conditionals for each variable and implement a Gibbs sampler.

# Hyperparameters
S = 8                    # number of schools
Ns = torch.tensor(20)    # students per school
μ0, κ0 = 0.0, 0.1  # prior on μ
ν0, τ2_0 = 0.1, 100.0 # prior on τ²
α0, σ2_0 = 0.1, 10.0  # prior on σ_s²

# Synthetic data: match sample means/stds to the 8 Schools values
torch.manual_seed(305)
x_bars  = torch.tensor([28.,  8., -3.,  7., -1.,  1., 18., 12.])
σ_bars  = torch.tensor([15., 10., 16., 11.,  9., 11., 10., 18.])
xs_raw  = Normal(x_bars, torch.sqrt(Ns.float()) * σ_bars).sample((Ns,))
zs      = (xs_raw - xs_raw.mean(0)) / xs_raw.std(0)
xs      = x_bars + torch.sqrt(Ns.float()) * σ_bars * zs  # shape (Ns, S)
assert torch.allclose(xs.mean(0), x_bars)
assert torch.allclose(xs.std(0),  torch.sqrt(Ns.float()) * σ_bars)

Log Joint Probability¶

We track $\log p(\mbtheta, \mbX)$ during sampling as a convergence diagnostic.

def log_joint(τ2, μ, θs, σ2s, xs, τ2_0, ν0, μ0, κ0, α0, σ2_0):
    lp  = ScaledInvChiSq(ν0, τ2_0).log_prob(τ2)
    lp += Normal(μ0, torch.sqrt(τ2 / κ0)).log_prob(μ)
    lp += Normal(μ,  torch.sqrt(τ2)).log_prob(θs).sum()
    lp += ScaledInvChiSq(α0, σ2_0).log_prob(σ2s).sum()
    lp += Normal(θs, torch.sqrt(σ2s)).log_prob(xs).sum()
    return lp

Complete Conditional for $\theta_s$ ¶

Combining the prior $\mathcal{N}(\theta_s \mid \mu, \tau^2)$ with the $N_s$ likelihoods $\mathcal{N}(x_{s,n} \mid \theta_s, \sigma_s^2)$ gives

\begin{align} p(\theta_s \mid \mu, \tau^2, \sigma_s^2, \mbX) &= \mathcal{N}(\theta_s \mid \hat{\theta}_s, v_s), \end{align}

(13)

where

\begin{align} v_s &= \left(\frac{N_s}{\sigma_s^2} + \frac{1}{\tau^2}\right)^{-1}, \qquad \hat{\theta}_s = v_s \left(\frac{\sum_n x_{s,n}}{\sigma_s^2} + \frac{\mu}{\tau^2}\right). \end{align}

(14)

Because each $\theta_s$ depends only on its own school’s data, all $S$ can be sampled in parallel as a blocked Gibbs step.

def gibbs_θs(τ2, μ, σ2s, xs):
    Ns = xs.shape[0]
    v_θ   = 1 / (Ns / σ2s + 1 / τ2)
    θ_hat = v_θ * (xs.sum(0) / σ2s + μ / τ2)
    return Normal(θ_hat, torch.sqrt(v_θ)).sample()

Complete Conditional for $\sigma_s^2$ ¶

Combining $\chi^{-2}(\sigma_s^2 \mid \alpha_0, \sigma_0^2)$ with the residuals gives a conjugate update:

\begin{align} p(\sigma_s^2 \mid \theta_s, \mbX) &= \chi^{-2}(\sigma_s^2 \mid \alpha_N, \sigma_N^2), \end{align}

(15)

where

\begin{align} \alpha_N &= \alpha_0 + N_s, \qquad \sigma_N^2 = \frac{1}{\alpha_N} \!\left(\alpha_0 \sigma_0^2 + \sum_{n=1}^{N_s}(x_{s,n} - \theta_s)^2\right). \end{align}

(16)

All $S$ variances are conditionally independent and sampled in parallel.

def gibbs_σ2s(α0, σ2_0, θs, xs):
    Ns  = xs.shape[0]
    αN  = α0 + Ns
    σ2N = (α0 * σ2_0 + torch.sum((xs - θs) ** 2, dim=0)) / αN
    return ScaledInvChiSq(αN, σ2N).sample()

Complete Conditional for $\mu$ ¶

The prior $\mathcal{N}(\mu \mid \mu_0, \tau^2/\kappa_0)$ with $S$ terms $\mathcal{N}(\theta_s \mid \mu, \tau^2)$ gives

\begin{align} p(\mu \mid \{\theta_s\}, \tau^2) &= \mathcal{N}(\mu \mid \hat{\mu}, v_\mu), \end{align}

(17)

where

\begin{align} v_\mu = \frac{\tau^2}{\kappa_0 + S}, \qquad \hat{\mu} = \frac{\kappa_0 \mu_0 + \sum_{s=1}^S \theta_s}{\kappa_0 + S}. \end{align}

(18)

def gibbs_μ(μ0, κ0, τ2, θs):
    S     = len(θs)
    v_μ   = τ2 / (κ0 + S)
    μ_hat = (κ0 * μ0 + θs.sum()) / (κ0 + S)
    return Normal(μ_hat, torch.sqrt(v_μ)).sample()

Complete Conditional for $\tau^2$ ¶

The joint factor $\chi^{-2}(\tau^2 \mid \nu_0, \tau_0^2) \cdot \mathcal{N}(\mu \mid \mu_0, \tau^2/\kappa_0) \cdot \prod_s \mathcal{N}(\theta_s \mid \mu, \tau^2)$ is conjugate:

\begin{align} p(\tau^2 \mid \mu, \{\theta_s\}) &= \chi^{-2}(\tau^2 \mid \nu_N, \tau_N^2), \end{align}

(19)

where

\begin{align} \nu_N &= \nu_0 + S + 1, \qquad \tau_N^2 = \frac{1}{\nu_N}\!\left( \nu_0 \tau_0^2 + \kappa_0(\mu - \mu_0)^2 + \sum_{s=1}^S (\theta_s - \mu)^2 \right). \end{align}

(20)

def gibbs_τ2(ν0, τ2_0, μ0, κ0, μ, θs):
    S   = len(θs)
    νN  = ν0 + S + 1
    τ2N = (ν0 * τ2_0 + κ0 * (μ - μ0)**2 + torch.sum((θs - μ)**2)) / νN
    return ScaledInvChiSq(νN, τ2N).sample()

The Gibbs Sampler¶

Cycle through the four updates, recording the log joint probability at each step.

def gibbs(xs, τ2_0, ν0, μ0, κ0, α0, σ2_0, num_samples=1000):
    Ns, S = xs.shape

    # Initialise from the prior
    τ2  = ScaledInvChiSq(ν0, τ2_0).sample()
    μ   = Normal(μ0, torch.sqrt(τ2 / κ0)).sample()
    θs  = Normal(μ,  torch.sqrt(τ2)).sample((S,))
    σ2s = ScaledInvChiSq(α0, σ2_0).sample((S,))

    records = []
    for _ in range(num_samples):
        τ2  = gibbs_τ2(ν0, τ2_0, μ0, κ0, μ, θs)
        μ   = gibbs_μ(μ0, κ0, τ2, θs)
        θs  = gibbs_θs(τ2, μ, σ2s, xs)
        σ2s = gibbs_σ2s(α0, σ2_0, θs, xs)
        lp  = log_joint(τ2, μ, θs, σ2s, xs,
                       τ2_0, ν0, μ0, κ0, α0, σ2_0)
        records.append((τ2, μ, θs, σ2s, lp))

    keys = ['τ2', 'μ', 'θs', 'σ2s', 'lps']
    return {k: torch.stack(v) for k, v in zip(keys, zip(*records))}

torch.manual_seed(305)
num_samples = 10000
samples = gibbs(xs, τ2_0, ν0, μ0, κ0, α0, σ2_0, num_samples)

MCMC Diagnostics¶

Before trusting the samples, we check that the chain has converged to the stationary distribution and is mixing efficiently.

Burn-in¶

The first few iterations reflect the arbitrary initial state. We discard this burn-in period before computing estimates.

fig, axs = plt.subplots(1, 2, figsize=(10, 3))

axs[0].plot(samples['lps'].numpy())
axs[0].set_xlabel('Iteration')
axs[0].set_ylabel('Log joint probability')
axs[0].set_title('All iterations')

burnin = 100
axs[1].plot(torch.arange(burnin, num_samples).numpy(),
            samples['lps'][burnin:].numpy())
axs[1].set_xlabel('Iteration')
axs[1].set_title(f'Iterations {burnin}+')

plt.tight_layout()

Trace Plots and Marginal Posteriors¶

After discarding burn-in, we visualize traces and marginal posterior histograms.

fig, axs = plt.subplots(2, 2, figsize=(10, 6))

τ_samp = torch.sqrt(samples['τ2'][burnin:])
axs[0,0].plot(τ_samp.numpy()); axs[0,0].set_ylabel(r'$\tau$'); axs[0,0].set_title('Trace')
axs[0,1].hist(τ_samp.numpy(), 50, density=True, alpha=0.6, ec='k')
axs[0,1].set_xlabel(r'$\tau$'); axs[0,1].set_title(r'Posterior $p(\tau \mid \mathbf{X})$')

μ_samp = samples['μ'][burnin:]
axs[1,0].plot(μ_samp.numpy()); axs[1,0].set_ylabel(r'$\mu$')
axs[1,1].hist(μ_samp.numpy(), 50, density=True, alpha=0.6, ec='k')
axs[1,1].set_xlabel(r'$\mu$'); axs[1,1].set_title(r'Posterior $p(\mu \mid \mathbf{X})$')

plt.tight_layout()

bins = torch.linspace(-30, 30, 80).numpy()
fig, axs = plt.subplots(S, 2, figsize=(10, 10), sharex='col')
for s in range(S):
    axs[s, 0].plot(samples['θs'][burnin:, s].numpy())
    axs[s, 0].set_ylabel(rf'$\theta_{{{s+1}}}$', fontsize=9)
    axs[s, 1].hist(samples['θs'][:, s].numpy(), bins, density=True, alpha=0.6, ec='k')
    axs[s, 1].axvline(xs.mean(0)[s].item(), color='r', lw=1.5,
                      label=r'$\bar{x}_s$' if s == 0 else None)
    axs[s, 1].set_ylabel(rf'$p(\theta_{{{s+1}}}\mid\mathbf{{X}})$', fontsize=8)
axs[0, 0].set_title('Traces'); axs[0, 1].set_title('Posteriors')
axs[0, 1].legend()
axs[-1, 0].set_xlabel('Iteration'); axs[-1, 1].set_xlabel(r'$\theta_s$')
plt.tight_layout()

bins = torch.linspace(0, 30, 60).numpy()
σ_samp = torch.sqrt(samples['σ2s'] / Ns.float())  # sigma_bar_s = sigma_s / sqrt(Ns)
fig, axs = plt.subplots(S, 2, figsize=(10, 10), sharex='col')
for s in range(S):
    axs[s, 0].plot(σ_samp[burnin:, s].numpy())
    axs[s, 0].set_ylabel(rf'$\bar{{\sigma}}_{{{s+1}}}$', fontsize=9)
    axs[s, 1].hist(σ_samp[:, s].numpy(), bins, density=True, alpha=0.6, ec='k')
    axs[s, 1].axvline(σ_bars[s].item(), color='r', lw=1.5,
                      label=r'$\bar{\sigma}_s$' if s == 0 else None)
    axs[s, 1].set_ylabel(rf'$p(\bar{{\sigma}}_{{{s+1}}}\mid\mathbf{{X}})$', fontsize=8)
axs[0, 0].set_title('Traces'); axs[0, 1].set_title('Posteriors')
axs[0, 1].legend()
axs[-1, 0].set_xlabel('Iteration'); axs[-1, 1].set_xlabel(r'$\bar{\sigma}_s$')
plt.tight_layout()

Autocorrelation and Effective Sample Size¶

Even after burn-in, consecutive MCMC samples are correlated. This inflates the variance of Monte Carlo estimates.

The autocorrelation function at lag $\ell$ is

\begin{align} \mathrm{acf}_f(\ell) &= \frac{\mathrm{Cov}[f(\mbtheta^{(m)}),\, f(\mbtheta^{(m+\ell)})]}{\mathrm{Var}[f(\mbtheta)]}. \end{align}

(21)

The effective sample size (ESS) accounts for autocorrelation:

\begin{align} M_\text{eff} &= \frac{M}{1 + 2\sum_{\ell=1}^\infty \mathrm{acf}(\ell)}. \end{align}

(22)

$M_\text{eff} \ll M$ signals slow mixing. For intuition: a random walk with autocorrelation $\rho$ has $M_\text{eff} = M(1-\rho)/(1+\rho)$ , which goes to zero as $\rho \to 1$ .

acf_τ  = autocorrelation(torch.sqrt(samples['τ2'][burnin:]))
acf_μ  = autocorrelation(samples['μ'][burnin:])
acf_θ1 = autocorrelation(samples['θs'][burnin:, 0])
acf_σ1 = autocorrelation(torch.sqrt(samples['σ2s'])[burnin:, 0])

plt.figure(figsize=(7, 4))
for acf, label in [(acf_τ, r'$\tau$'), (acf_μ, r'$\mu$'),
                   (acf_θ1, r'$\theta_1$'), (acf_σ1, r'$\sigma_1$')]:
    plt.plot(acf[:250].numpy(), label=label)
plt.axhline(0, color='k', lw=0.5)
plt.xlabel('Lag'); plt.ylabel('Autocorrelation'); plt.legend()
plt.tight_layout()

print(f'Effective sample sizes (out of {num_samples - burnin} post-burnin samples):')
print(f"  τ²:   {effective_sample_size(samples['τ2'][None, burnin:]).item():.0f}")
print(f"  μ:    {effective_sample_size(samples['μ'][None, burnin:]).item():.0f}")
print(f"  θ₁:   {effective_sample_size(samples['θs'][None, burnin:, 0]).item():.0f}")
print(f"  σ₁²:  {effective_sample_size(samples['σ2s'][None, burnin:, 0]).item():.0f}")

Effective sample sizes (out of 9900 post-burnin samples):
  τ²:   1538
  μ:    734

  θ₁:   1113
  σ₁²:  9069

Conclusion¶

This chapter introduced MCMC as a general-purpose engine for posterior inference and implemented a Gibbs sampler for the hierarchical Gaussian model. Key points:

Posterior expectations are the operative output; Monte Carlo provides a dimension-free approximation.
Metropolis–Hastings constructs a valid transition kernel from any proposal, using only the ratio of joint probabilities (no normalizing constant needed).
Gibbs sampling is MH with exact-conditional proposals that always accept — highly efficient in conjugate or partially-conjugate models.
Blocked Gibbs samples conditionally-independent variables jointly, which is faster and reduces autocorrelation.
Diagnostics — log joint traces, ACF plots, and ESS — are essential for verifying convergence and mixing.

Markov Chain Monte Carlo

Posterior Expectations¶

Quadrature¶

Monte Carlo¶

Markov Chain Monte Carlo¶

Markov Chains¶

Stationary Distributions¶

Detailed Balance¶

Ergodicity¶

The Metropolis–Hastings Algorithm¶

The Accept–Reject Step¶

The Metropolis Algorithm¶

Gibbs Sampling¶

Worked Example: Hierarchical Gaussian Model¶

Log Joint Probability¶

Complete Conditional for θs\theta_sθs​¶

Complete Conditional for σs2\sigma_s^2σs2​¶

Complete Conditional for μ\muμ¶

Complete Conditional for τ2\tau^2τ2¶

The Gibbs Sampler¶

MCMC Diagnostics¶

Burn-in¶

Trace Plots and Marginal Posteriors¶

Autocorrelation and Effective Sample Size¶

Conclusion¶

Complete Conditional for $\theta_s$ ¶

Complete Conditional for $\sigma_s^2$ ¶

Complete Conditional for $\mu$ ¶

Complete Conditional for $\tau^2$ ¶