Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Hierarchical Gaussian Models

This lecture works through exact Bayesian inference in a hierarchical Gaussian model — a multi-level model that pools information across groups. The running example is the Eight Schools dataset from Gelman et al., 2013, used as a benchmark by many probabilistic programming systems.

The model places a shared prior over school-level means, allowing estimates to borrow strength from one another while still respecting per-school data. Because the model is conditionally linear and Gaussian, the posterior can be computed exactly by sequential marginalization — making this one of the most complex models for which closed-form inference is still achievable.

In this lecture we cover:

  • The hierarchical Gaussian model and its graphical representation

  • Sequential marginalization: p(θ)p(\mbtheta \mid \cdot), p(μτ2,)p(\mu \mid \tau^2, \cdot), p(τ2)p(\tau^2 \mid \cdot)

  • Numerical normalization of the one-dimensional marginal p(τ2X)p(\tau^2 \mid \mbX)

  • Posterior sampling via ancestral sampling

  • Comparison to classical ANOVA / partial pooling

Source
import torch
from torch.distributions import Normal, Gamma, Categorical, TransformedDistribution
from torch.distributions.transforms import PowerTransform
import matplotlib.pyplot as plt
from matplotlib.cm import Blues


class ScaledInvChiSq(TransformedDistribution):
    """Scaled inverse chi-squared: χ⁻²(ν, σ²).

    Equivalent to IGa(ν/2, νσ²/2). Implemented as a transformation of Gamma.
    """
    def __init__(self, ν, σ2):
        base = Gamma(ν / 2, ν * σ2 / 2)
        TransformedDistribution.__init__(self, base, [PowerTransform(-1)])
        self.ν  = ν
        self.σ2 = σ2

Motivation

Data are often organized into groups — students within schools, patients within hospitals, experiments within labs. The individual observations within a group are not exchangeable with those in other groups (because group membership matters), but the groups themselves may be exchangeable.

Hierarchical models handle this by introducing group-level parameters θs\theta_s (one per group ss), drawn i.i.d. from a shared population distribution. Group-level parameters explain within-group variation, while the population distribution allows information to be shared across groups.

The key insight is that the scores within each school are not exchangeable with those across schools, but the schools themselves are exchangeable. This motivates the following hierarchical model:

μ,τ2p(μ,τ2),θsμ,τ2iidN(μ,τ2),s=1,,S,xs,nθsiidN(θs,σs2),n=1,,Ns.\begin{align} \mu, \tau^2 &\sim p(\mu, \tau^2), \\ \theta_s \mid \mu, \tau^2 &\iid{\sim} \cN(\mu, \tau^2), \quad s = 1, \ldots, S, \\ x_{s,n} \mid \theta_s &\iid{\sim} \cN(\theta_s, \sigma_s^2), \quad n = 1, \ldots, N_s. \end{align}

Each school has its own mean θs\theta_s, drawn from a global distribution with mean μ\mu and variance τ2\tau^2. The global mean μ\mu captures the overall population effect; the global variance τ2\tau^2 controls how much schools differ from each other.

For the prior on (μ,τ2)(\mu, \tau^2), we use a normal-inverse-chi-squared distribution (as in Lecture 2 but for the scalar case):

p(μ,τ2)=NIX(μ,τ2μ0,κ0,ν0,τ02)=N(μμ0,τ2/κ0)χ2(τ2ν0,τ02).\begin{align} p(\mu, \tau^2) &= \mathrm{NIX}(\mu, \tau^2 \mid \mu_0, \kappa_0, \nu_0, \tau_0^2) = \cN(\mu \mid \mu_0, \tau^2 / \kappa_0) \, \chi^{-2}(\tau^2 \mid \nu_0, \tau_0^2). \end{align}

The hyperparameters are η=(μ0,κ0,ν0,τ02,{σs2}s=1S)\boldsymbol{\eta} = (\mu_0, \kappa_0, \nu_0, \tau_0^2, \{\sigma_s^2\}_{s=1}^S).

The Eight Schools Dataset

We illustrate with a classic dataset from Gelman et al., 2013, Ch 5.5, also used as a benchmark by probabilistic programming systems such as Stan and NumPyro. Eight schools participated in a study of an SAT coaching program. For each school ss, we are given the estimated treatment effect xˉs\bar{x}_s (the difference in mean SAT scores between coached and uncoached students) and its standard error σˉs\bar{\sigma}_s.

Schoolxˉs\bar{x}_sσˉs\bar{\sigma}_s
A2815
B810
C−316
D711
E−19
F111
G1810
H1218

The standard errors σˉs\bar{\sigma}_s are treated as known (they are themselves estimates from each school’s data, but we condition on them for simplicity). We set weakly informative hyperparameters: μ0=0\mu_0 = 0, κ0=0.1\kappa_0 = 0.1, ν0=0.1\nu_0 = 0.1, and τ02=100\tau_0^2 = 100.

torch.manual_seed(305)

S = 8                        # number of schools

# Hyperparameters
μ0   = torch.tensor(0.0)     # prior mean of global effect
κ0   = torch.tensor(0.1)     # prior concentration (NIX)
ν0   = torch.tensor(0.1)     # degrees of freedom (NIX)
τ2_0 = torch.tensor(100.0)   # prior scale of global variance

# Observed school-level sample means and standard errors
x_bars  = torch.tensor([28., 8., -3., 7., -1., 1., 18., 12.])
σ_bars  = torch.tensor([15., 10., 16., 11., 9., 11., 10., 18.])

Bayesian Inference in the Hierarchical Gaussian Model

Our goal is to compute the joint posterior,

p(μ,τ2,θX,η),\begin{align} p(\mu, \tau^2, \mbtheta \mid \mbX, \boldsymbol{\eta}), \end{align}

where θ=(θ1,,θS)\mbtheta = (\theta_1, \ldots, \theta_S). We proceed by sequential marginalization, decomposing the posterior using the product rule:

p(μ,τ2,θX,η)=p(θμ,τ2,X,η)  p(μτ2,X,η)  p(τ2X,η).\begin{align} p(\mu, \tau^2, \mbtheta \mid \mbX, \boldsymbol{\eta}) &= p(\mbtheta \mid \mu, \tau^2, \mbX, \boldsymbol{\eta}) \; p(\mu \mid \tau^2, \mbX, \boldsymbol{\eta}) \; p(\tau^2 \mid \mbX, \boldsymbol{\eta}). \end{align}

We will compute each factor in turn.

Step 1: Sufficient Statistics

As a function of θs\theta_s, the likelihood for school ss depends on the data only through the school mean xˉs=1Nsnxs,n\bar{x}_s = \frac{1}{N_s}\sum_n x_{s,n}:

n=1NsN(xs,nθs,σs2)N(xˉsθs,σˉs2),\begin{align} \prod_{n=1}^{N_s} \cN(x_{s,n} \mid \theta_s, \sigma_s^2) &\propto \cN(\bar{x}_s \mid \theta_s, \bar{\sigma}_s^2), \end{align}

where σˉs2=σs2/Ns\bar{\sigma}_s^2 = \sigma_s^2 / N_s is the variance of the school mean. The school mean is thus a sufficient statistic for θs\theta_s. In the Eight Schools dataset, xˉs\bar{x}_s and σˉs\bar{\sigma}_s are given directly.

Step 2: Per-School Posteriors (given μ\mu and τ2\tau^2)

The per-school parameters θs\theta_s are conditionally independent given (μ,τ2)(\mu, \tau^2), so the posterior factors:

p(θμ,τ2,X,η)=s=1Sp(θsμ,τ2,xˉs).\begin{align} p(\mbtheta \mid \mu, \tau^2, \mbX, \boldsymbol{\eta}) &= \prod_{s=1}^S p(\theta_s \mid \mu, \tau^2, \bar{x}_s). \end{align}

Each factor is a product of a Gaussian prior and a Gaussian likelihood, so by the results of Lecture 1,

p(θsμ,τ2,xˉs)=N(θsθ^s,vs),\begin{align} p(\theta_s \mid \mu, \tau^2, \bar{x}_s) &= \cN(\theta_s \mid \hat{\theta}_s, v_s), \end{align}

where

vs=(1σˉs2+1τ2)1,θ^s=vs(xˉsσˉs2+μτ2).\begin{align} v_s &= \left(\frac{1}{\bar{\sigma}_s^2} + \frac{1}{\tau^2}\right)^{-1}, \qquad \hat{\theta}_s = v_s \left(\frac{\bar{x}_s}{\bar{\sigma}_s^2} + \frac{\mu}{\tau^2}\right). \end{align}

The conditional posterior mean θ^s\hat{\theta}_s is a precision-weighted average of the school observation xˉs\bar{x}_s and the global mean μ\mu. When τ2\tau^2 is large (schools vary a lot), the school mean dominates; when τ2\tau^2 is small, the global mean dominates — this is the shrinkage or partial pooling effect.

Step 3: Posterior of the Global Mean (given τ2\tau^2)

To find p(μτ2,X,η)p(\mu \mid \tau^2, \mbX, \boldsymbol{\eta}), we integrate over θ\mbtheta. Because each θs\theta_s enters the model linearly, the integral is tractable via the linear Gaussian model results from Lecture 2:

p(μτ2,X,η)N(μμ0,τ2/κ0)s=1SN(θsμ,τ2)N(xˉsθs,σˉs2) ⁣dθs=N(μμ0,τ2/κ0)s=1SN(xˉsμ,σˉs2+τ2).\begin{align} p(\mu \mid \tau^2, \mbX, \boldsymbol{\eta}) &\propto \cN(\mu \mid \mu_0, \tau^2/\kappa_0) \prod_{s=1}^S \int \cN(\theta_s \mid \mu, \tau^2)\, \cN(\bar{x}_s \mid \theta_s, \bar{\sigma}_s^2)\, \dif\theta_s \\ &= \cN(\mu \mid \mu_0, \tau^2/\kappa_0) \prod_{s=1}^S \cN(\bar{x}_s \mid \mu, \bar{\sigma}_s^2 + \tau^2). \end{align}

Each integral collapses θs\theta_s by the linear Gaussian marginalization formula: xˉsμN(μ,σˉs2+τ2)\bar{x}_s \mid \mu \sim \cN(\mu, \bar{\sigma}_s^2 + \tau^2).

Collecting terms quadratic and linear in μ\mu and completing the square gives,

p(μτ2,X,η)=N(μμ^,vμ),\begin{align} p(\mu \mid \tau^2, \mbX, \boldsymbol{\eta}) &= \cN(\mu \mid \hat{\mu}, v_\mu), \end{align}

where

λ0=κ0/τ2,λs=1/(σˉs2+τ2),vμ=1λ0+sλs,μ^=vμ ⁣(λ0μ0+sλsxˉs).\begin{align} \lambda_0 &= \kappa_0 / \tau^2, \quad \lambda_s = 1 / (\bar{\sigma}_s^2 + \tau^2), \\ v_\mu &= \frac{1}{\lambda_0 + \sum_s \lambda_s}, \quad \hat{\mu} = v_\mu \!\left(\lambda_0 \mu_0 + \sum_s \lambda_s \bar{x}_s\right). \end{align}

The posterior mean μ^\hat{\mu} is a precision-weighted average of the prior mean μ0\mu_0 and the school means xˉs\bar{x}_s. Schools with smaller total variance σˉs2+τ2\bar{\sigma}_s^2 + \tau^2 (i.e. more precisely measured and more similar to the global mean) receive more weight.

def compute_posterior_mu(τ2, μ0, κ0, x_bars, σ_bars):
    """Posterior mean and variance of μ given τ² and data.

    Returns (μ_hat, v_μ), each of shape (T,) for a length-T tensor of τ² values.
    """
    λ0    = κ0 / τ2                                         # (T,)
    λs    = 1 / (σ_bars[None, :]**2 + τ2[:, None])         # (T, S)
    v_μ   = 1 / (λ0 + λs.sum(-1))                          # (T,)
    μ_hat = v_μ * (λ0 * μ0 + (λs * x_bars[None, :]).sum(-1))
    return μ_hat, v_μ
Source
τ2s = torch.tensor([1., 4., 9., 16., 25.])
μ_hat, v_μ = compute_posterior_mu(τ2s, μ0, κ0, x_bars, σ_bars)

μ_grid = torch.linspace(-10, 30, 200)
fig, ax = plt.subplots()
for τ2, mean, var in zip(τ2s, μ_hat, v_μ):
    ax.plot(μ_grid,
            torch.exp(Normal(mean, torch.sqrt(var)).log_prob(μ_grid)),
            color=Blues((τ2 / τ2s.max()).item()),
            label=r"$\tau^2 = {:.0f}$".format(τ2))
ax.set_xlabel(r"$\mu$")
ax.set_ylabel(r"$p(\mu \mid \tau^2, \mathbf{X}, {\eta})$")
ax.legend()
plt.tight_layout()
<Figure size 640x480 with 1 Axes>

Step 4: Marginal Posterior of τ2\tau^2

The last and hardest term is the marginal posterior of τ2\tau^2, obtained by integrating out μ\mu:

p(τ2X,η)p(μ,τ2,Xη) ⁣dμ.\begin{align} p(\tau^2 \mid \mbX, \boldsymbol{\eta}) &\propto \int p(\mu, \tau^2, \mbX \mid \boldsymbol{\eta})\, \dif\mu. \end{align}

Rather than evaluating this integral directly, we use a trick: by Bayes’ rule,

p(τ2X,η)=p(μ,τ2X,η)p(μτ2,X,η),\begin{align} p(\tau^2 \mid \mbX, \boldsymbol{\eta}) &= \frac{p(\mu, \tau^2 \mid \mbX, \boldsymbol{\eta})}{p(\mu \mid \tau^2, \mbX, \boldsymbol{\eta})}, \end{align}

which holds for any choice of μ\mu. Substituting the known expressions and evaluating at μ=μ^\mu = \hat{\mu} (which cancels the denominator’s exponential term), we obtain an unnormalized function,

f(τ2)p(τ2)vμτeκ02τ2(μ^μ0)2s=1S1σˉs2+τ2exp ⁣((xˉsμ^)22(σˉs2+τ2)),\begin{align} f(\tau^2) \triangleq p(\tau^2) \cdot \frac{\sqrt{v_\mu}}{\tau} \cdot e^{-\frac{\kappa_0}{2\tau^2}(\hat{\mu} - \mu_0)^2} \cdot \prod_{s=1}^S \frac{1}{\sqrt{\bar{\sigma}_s^2 + \tau^2}} \exp\!\left(-\frac{(\bar{x}_s - \hat{\mu})^2}{2(\bar{\sigma}_s^2 + \tau^2)}\right), \end{align}

where both μ^\hat{\mu} and vμv_\mu depend on τ2\tau^2. This function has no closed form, but since τ2\tau^2 is one-dimensional we can evaluate it on a dense grid and normalize numerically.

def compute_log_f(τ2, x_bars, σ_bars, μ0, κ0, ν0, τ2_0):
    """Compute log f(τ²) — the unnormalized log marginal posterior of τ².

    Args:
        τ2:     (T,) tensor of τ² grid values
        x_bars: (S,) observed school means
        σ_bars: (S,) school standard errors
        μ0, κ0, ν0, τ2_0: NIX hyperparameters

    Returns:
        (T,) tensor of log f(τ²) values
    """
    # Prior on τ²
    log_f = ScaledInvChiSq(ν0, τ2_0).log_prob(τ2)

    # Posterior on μ | τ²  (derived in Step 3)
    λ0    = κ0 / τ2
    λs    = 1 / (σ_bars[None, :]**2 + τ2[:, None])         # (T, S)
    v_μ   = 1 / (λ0 + λs.sum(-1))
    μ_hat = v_μ * (λ0 * μ0 + (λs * x_bars[None, :]).sum(-1))

    # Bayes-rule trick: evaluate at μ = μ_hat
    log_f += 0.5 * torch.log(v_μ) - 0.5 * torch.log(τ2)
    log_f += -0.5 * κ0 / τ2 * (μ_hat - μ0)**2
    log_f += 0.5 * torch.log(λs).sum(-1)
    log_f += -0.5 * (λs * (x_bars[None, :] - μ_hat[:, None])**2).sum(-1)

    return log_f
τ2_grid = torch.linspace(1e-1, 256, 1000)

log_f   = compute_log_f(τ2_grid, x_bars, σ_bars, μ0, κ0, ν0, τ2_0)
dτ2     = τ2_grid[1] - τ2_grid[0]
p_τ2    = torch.exp(log_f - torch.logsumexp(log_f, 0) - torch.log(dτ2))

# Change of variables to get p(τ)
τ_grid  = torch.sqrt(τ2_grid)
p_τ     = 2 * p_τ2 * τ_grid

τ_mean  = (p_τ[:-1] * torch.diff(τ_grid) * τ_grid[:-1]).sum()
τ_var   = (p_τ[:-1] * torch.diff(τ_grid) * (τ_grid[:-1] - τ_mean)**2).sum()
print(f"Posterior  E[τ] = {τ_mean:.2f},  Std[τ] = {τ_var.sqrt():.2f}")
Posterior  E[τ] = 4.46,  Std[τ] = 2.54
Source
fig, axes = plt.subplots(1, 2, figsize=(10, 3))
axes[0].plot(τ2_grid, p_τ2)
axes[0].set_xlabel(r"$\tau^2$")
axes[0].set_ylabel(r"$p(\tau^2 \mid \mathbf{X}, {\eta})$")
axes[0].set_ylim(0)
axes[0].grid(True)

axes[1].plot(τ_grid, p_τ)
axes[1].set_xlabel(r"$\tau$")
axes[1].set_ylabel(r"$p(\tau \mid \mathbf{X}, {\eta})$")
axes[1].set_ylim(0)
axes[1].grid(True)

plt.tight_layout()
<Figure size 1000x300 with 2 Axes>

Step 5: Per-School Effects Marginalizing over μ\mu

Finally, we want the marginal posterior of each θs\theta_s given τ2\tau^2 and the data, having integrated out μ\mu. Using the product rule,

p(θsτ2,X,η)=p(θsμ,τ2,xˉs)p(μτ2,X,η) ⁣dμ.\begin{align} p(\theta_s \mid \tau^2, \mbX, \boldsymbol{\eta}) &= \int p(\theta_s \mid \mu, \tau^2, \bar{x}_s) \, p(\mu \mid \tau^2, \mbX, \boldsymbol{\eta}) \, \dif\mu. \end{align}

Both factors are Gaussian in μ\mu, so the integral is again tractable via linear Gaussian marginalization:

p(θsτ2,X,η)=N(θsθ^s,vθs),\begin{align} p(\theta_s \mid \tau^2, \mbX, \boldsymbol{\eta}) &= \cN(\theta_s \mid \hat{\theta}_s, v_{\theta_s}), \end{align}

where

vθs=(1σˉs2+1τ2+vμ)1,θ^s=vθs(xˉsσˉs2+μ^τ2+vμ).\begin{align} v_{\theta_s} &= \left(\frac{1}{\bar{\sigma}_s^2} + \frac{1}{\tau^2 + v_\mu}\right)^{-1}, \qquad \hat{\theta}_s = v_{\theta_s}\left(\frac{\bar{x}_s}{\bar{\sigma}_s^2} + \frac{\hat{\mu}}{\tau^2 + v_\mu}\right). \end{align}

The additional vμv_\mu term in the denominator inflates the effective variance of the prior on θs\theta_s, reflecting our residual uncertainty about μ\mu.

def compute_posterior_theta(τ2, μ, x_bars, σ_bars):
    """Posterior mean/variance of θ_s given τ², μ, and data.

    Broadcasts over (T,) tensors of τ² and μ.
    Returns (θ_hat, v_θ) each of shape (T, S).
    """
    v_θ   = 1 / (1/σ_bars[None,:]**2 + 1/τ2[:,None])
    θ_hat = v_θ * (x_bars[None,:]/σ_bars[None,:]**2 + μ[:,None]/τ2[:,None])
    return θ_hat, v_θ


def compute_posterior_theta_marg(τ2, μ0, κ0, x_bars, σ_bars):
    """Posterior mean/variance of θ_s given τ², marginalizing over μ.

    Returns (θ_hat, v_θ) each of shape (T, S).
    """
    μ_hat, v_μ = compute_posterior_mu(τ2, μ0, κ0, x_bars, σ_bars)
    v_θ   = 1 / (1/σ_bars[None,:]**2 + 1/(τ2[:,None] + v_μ[:,None]))
    θ_hat = v_θ * (x_bars[None,:]/σ_bars[None,:]**2
                   + μ_hat[:,None]/(τ2[:,None] + v_μ[:,None]))
    return θ_hat, v_θ
Source
θ_hat, v_θ = compute_posterior_theta_marg(τ2_grid, μ0, κ0, x_bars, σ_bars)

fig, ax = plt.subplots()
for s in range(S):
    ax.plot(τ_grid, θ_hat[:, s], label=f"School {s+1}")
ax.set_xlabel(r"$\tau$")
ax.set_ylabel(r"$\mathbb{E}[\theta_s \mid \tau, \mathbf{X}, {\eta}]$")
ax.legend(fontsize=8, ncol=2)
plt.tight_layout()
<Figure size 640x480 with 1 Axes>

Posterior Sampling via Ancestral Sampling

We now have all the pieces needed to draw samples from the joint posterior p(μ,τ2,θX,η)p(\mu, \tau^2, \mbtheta \mid \mbX, \boldsymbol{\eta}). Because we computed the posterior in the order τ2μθ\tau^2 \to \mu \to \mbtheta (by sequential marginalization), we can draw samples in the reverse order — this is called ancestral sampling:

  1. Sample τ\tau from the normalized grid approximation p(τX,η)p(\tau \mid \mbX, \boldsymbol{\eta}).

  2. Sample μ\mu from the conditional p(μτ2,X,η)=N(μ^,vμ)p(\mu \mid \tau^2, \mbX, \boldsymbol{\eta}) = \cN(\hat{\mu}, v_\mu).

  3. Sample θ\mbtheta from p(θμ,τ2,X,η)=sN(θ^s,vs)p(\mbtheta \mid \mu, \tau^2, \mbX, \boldsymbol{\eta}) = \prod_s \cN(\hat{\theta}_s, v_s).

Discarding τ\tau and μ\mu from each sample gives draws from the marginal posterior p(θX,η)p(\mbtheta \mid \mbX, \boldsymbol{\eta}).

torch.manual_seed(305)
N_samp = 10000

# Step 1: sample τ from the discrete grid approximation
centers = 0.5 * (τ_grid[:-1] + τ_grid[1:])
widths  = torch.diff(τ_grid)
inds    = Categorical(probs=p_τ[:-1] * widths).sample((N_samp,))
τ_samp  = centers[inds]                                  # (N_samp,)

# Step 2: sample μ | τ²
μ_hat, v_μ = compute_posterior_mu(τ_samp**2, μ0, κ0, x_bars, σ_bars)
μ_samp = Normal(μ_hat, torch.sqrt(v_μ)).sample()        # (N_samp,)

# Step 3: sample θ_s | μ, τ²
θ_hat, v_θ = compute_posterior_theta(τ_samp**2, μ_samp, x_bars, σ_bars)
θ_samp = Normal(θ_hat, torch.sqrt(v_θ)).sample()        # (N_samp, S)

# Posterior summary statistics
print(f"{'School':>8}  {'mean':>7}  {'std':>6}  {'5%':>7}  {'95%':>7}")
print("-" * 42)
for s in range(S):
    print(f"{s+1:>8}  "
          f"{θ_samp[:,s].mean():>7.2f}  "
          f"{θ_samp[:,s].std():>6.2f}  "
          f"{torch.quantile(θ_samp[:,s], 0.05):>7.2f}  "
          f"{torch.quantile(θ_samp[:,s], 0.95):>7.2f}")
  School     mean     std       5%      95%
------------------------------------------
       1     8.65    6.46    -0.51    20.19
       2     6.81    5.38    -1.73    15.80
       3     5.75    5.94    -3.83    15.45
       4     6.75    5.56    -1.91    16.11
       5     5.05    5.19    -3.78    13.32
       6     5.81    5.48    -3.09    14.79
       7     8.59    5.79    -0.02    18.78
       8     6.99    6.24    -2.72    17.45
Source
bins = torch.linspace(-30, 70, 50)
fig, axs = plt.subplots(S, 1, figsize=(8, 10), sharex=True)
for s in range(S):
    axs[s].hist(θ_samp[:, s].numpy(), bins.numpy(),
                edgecolor='k', alpha=0.5, density=True)
    if s == S - 1:
        axs[s].set_xlabel(r"$\theta_s$")
    axs[s].set_ylabel(r"$p(\theta_" + str(s+1) + r" \mid \mathbf{X})$", fontsize=9)
    axs[s].set_title(f"School {s+1}: mean = {θ_samp[:,s].mean():.1f}, "
                     f"std = {θ_samp[:,s].std():.1f}", fontsize=9)
plt.tight_layout()
<Figure size 800x1000 with 8 Axes>

Comparison to Classical Analysis of Variance

A classical approach to estimating θs\theta_s offers two extremes:

  • No pooling (unpooled): θ^s=xˉs\hat{\theta}_s = \bar{x}_s. Treat each school independently; ignore any similarity between schools.

  • Complete pooling: θ^s=xˉ\hat{\theta}_s = \bar{x}. Treat all schools as identical; pool all data.

An ANOVA F-test helps choose between them: if the between-group mean square is significantly larger than the within-group mean square, use unpooled estimates; otherwise use pooled.

The hierarchical Bayesian approach automatically interpolates between these extremes. The posterior mean E[θsX]\E[\theta_s \mid \mbX] is a precision-weighted blend of the school mean xˉs\bar{x}_s and the global mean μ^\hat{\mu}, with the blend determined by the posterior over τ2\tau^2. When τ2\tau^2 is large (schools differ a lot), estimates are close to the unpooled xˉs\bar{x}_s; when τ2\tau^2 is small (schools are similar), estimates shrink toward the pooled mean.

Two advantages of the Bayesian approach:

  1. Uncertainty propagation. The posterior over θs\theta_s reflects uncertainty about μ\mu and τ2\tau^2, not just about θs\theta_s given fixed global parameters. A plug-in estimate τ^2\hat{\tau}^2 underestimates posterior variance.

  2. Coherence. The point estimate τ^2=(MSBMSW)/N\hat{\tau}^2 = (\mathrm{MS}_B - \mathrm{MS}_W)/N can be negative (set to zero in practice), which amounts to an overly strong claim that τ2=0\tau^2 = 0. The posterior remains well-behaved.

Conclusion

The hierarchical Gaussian model demonstrates exact Bayesian inference in a multi-level setting. The key steps were:

  • Express the joint posterior as a product p(θ)p(μτ2,)p(τ2)p(\mbtheta \mid \cdot)\, p(\mu \mid \tau^2, \cdot)\, p(\tau^2 \mid \cdot) using the product rule.

  • Compute p(θμ,τ2,)p(\mbtheta \mid \mu, \tau^2, \cdot) and p(μτ2,)p(\mu \mid \tau^2, \cdot) analytically using linear Gaussian identities.

  • Evaluate and normalize p(τ2X)p(\tau^2 \mid \mbX) numerically on a 1D grid.

  • Draw samples from the joint posterior via ancestral sampling: sample τμθ\tau \to \mu \to \mbtheta in the order implied by the factorization.

The posterior mean of each θs\theta_s automatically partially pools the school estimate toward the global mean, with the degree of pooling governed by the posterior over τ2\tau^2.

References
  1. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). Chapman.
  2. Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. https://probml.github.io/pml-book/book2.html