Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Dirichlet Process Mixture Models

Finite Bayesian mixture models require specifying the number of components KK in advance. Dirichlet process mixture models (DPMMs) are a Bayesian nonparametric alternative that allows the number of clusters to grow with the data. In this chapter we derive DPMMs from finite mixture models by taking KK \to \infty, then study the random measure perspective and its connection to Poisson processes.

Source
import torch
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t as student_t

Finite Bayesian Mixture Models

A finite Bayesian mixture model with KK components is defined by the following generative process:

  1. Sample mixing proportions from a Dirichlet prior with concentration αR+K\boldsymbol{\alpha} \in \mathbb{R}_+^K:

πDirichlet(α).\boldsymbol{\pi} \sim \mathrm{Dirichlet}(\boldsymbol{\alpha}).
  1. Sample component parameters:

θkiidp(θϕ,ν)k=1,,K.\boldsymbol{\theta}_k \overset{\text{iid}}{\sim} p(\boldsymbol{\theta} \mid \boldsymbol{\phi}, \nu) \qquad k = 1, \ldots, K.
  1. Sample assignments and observations:

zniidπ,xnp(xθzn)n=1,,N.z_n \overset{\text{iid}}{\sim} \boldsymbol{\pi}, \qquad \mathbf{x}_n \sim p(\mathbf{x} \mid \boldsymbol{\theta}_{z_n}) \qquad n = 1, \ldots, N.

The joint distribution factorizes as:

p(π,{θk},{zn,xn})=Dir(πα)k=1Kp(θkϕ,ν)n=1Nk=1K[πkp(xnθk)]I[zn=k].p(\boldsymbol{\pi}, \{\boldsymbol{\theta}_k\}, \{z_n, \mathbf{x}_n\}) = \mathrm{Dir}(\boldsymbol{\pi} \mid \boldsymbol{\alpha}) \prod_{k=1}^K p(\boldsymbol{\theta}_k \mid \boldsymbol{\phi}, \nu) \prod_{n=1}^N \prod_{k=1}^K \left[\pi_k \, p(\mathbf{x}_n \mid \boldsymbol{\theta}_k)\right]^{\mathbb{I}[z_n = k]}.

Collapsing Out Parameters

When the likelihood is an exponential family and the prior is its conjugate,

p(xθk)=h(x)exp{t(x),θkA(θk)},p(θkϕ,ν)=1Zθ(ϕ,ν)exp{ϕ,θkνA(θk)},p(\mathbf{x} \mid \boldsymbol{\theta}_k) = h(\mathbf{x}) \exp\{\langle t(\mathbf{x}), \boldsymbol{\theta}_k \rangle - A(\boldsymbol{\theta}_k)\}, \qquad p(\boldsymbol{\theta}_k \mid \boldsymbol{\phi}, \nu) = \frac{1}{Z_\theta(\boldsymbol{\phi}, \nu)} \exp\{\langle \boldsymbol{\phi}, \boldsymbol{\theta}_k \rangle - \nu A(\boldsymbol{\theta}_k)\},

we can marginalize (or collapse) the component parameters and mixture proportions in closed form.

After integrating out {θk}\{\boldsymbol{\theta}_k\} and π\boldsymbol{\pi}, the marginal likelihood over assignments {zn}\{z_n\} and observations {xn}\{\mathbf{x}_n\} is:

p({zn,xn}ϕ,ν,α)=Zπ(α)Zπ(α)k=1KZθ(ϕk,νk)Zθ(ϕ,ν),p(\{z_n, \mathbf{x}_n\} \mid \boldsymbol{\phi}, \nu, \boldsymbol{\alpha}) = \frac{Z_\pi(\boldsymbol{\alpha}')}{Z_\pi(\boldsymbol{\alpha})} \prod_{k=1}^K \frac{Z_\theta(\boldsymbol{\phi}_k', \nu_k')}{Z_\theta(\boldsymbol{\phi}, \nu)},

where the updated hyperparameters are:

α=[α1+N1,,αK+NK],ϕk=ϕ+n:zn=kt(xn),νk=ν+Nk,\boldsymbol{\alpha}' = [\alpha_1 + N_1, \ldots, \alpha_K + N_K], \qquad \boldsymbol{\phi}_k' = \boldsymbol{\phi} + \sum_{n:\, z_n=k} t(\mathbf{x}_n), \qquad \nu_k' = \nu + N_k,

and Zπ(α)=kΓ(αk)/Γ(kαk)Z_\pi(\boldsymbol{\alpha}) = \prod_k \Gamma(\alpha_k) / \Gamma(\sum_k \alpha_k) is the Dirichlet normalizing constant.

This is a general pattern: in conjugate exponential families, marginal likelihoods are ratios of posterior to prior normalizing functions.

Collapsed Gibbs Sampling

Working with the collapsed distribution enables efficient collapsed Gibbs sampling. The conditional distribution of znz_n, holding all other assignments fixed, simplifies to:

p(zn=kxn,{zn}nn,ϕ,ν,α)(αk+Nk(¬n))cluster sizep(xn{xn:zn=k},ϕ,ν)posterior predictive,p(z_n = k \mid \mathbf{x}_n, \{z_{n'}\}_{n' \neq n}, \boldsymbol{\phi}, \nu, \boldsymbol{\alpha}) \propto \underbrace{(\alpha_k + N_k^{(\neg n)})}_{\text{cluster size}} \cdot \underbrace{p(\mathbf{x}_n \mid \{\mathbf{x}_{n'} : z_{n'} = k\}, \boldsymbol{\phi}, \nu)}_{\text{posterior predictive}},

where Nk(¬n)=nnI[zn=k]N_k^{(\neg n)} = \sum_{n' \neq n} \mathbb{I}[z_{n'} = k] is the number of other points in cluster kk.

The two factors have intuitive interpretations:

  • The cluster size term αk+Nk(¬n)\alpha_k + N_k^{(\neg n)} comes from the Dirichlet–Multinomial conjugacy: larger clusters attract new points.

  • The posterior predictive term p(xn)p(\mathbf{x}_n \mid \ldots) measures how well xn\mathbf{x}_n fits with the other points in cluster kk.

The Dirichlet Process Mixture Model

Now set α=αK1K\boldsymbol{\alpha} = \frac{\alpha}{K} \mathbf{1}_K and take KK \to \infty. In this limit, the collapsed Gibbs update simplifies to:

p(zn=kxn,{zn}nn){Nk(¬n)p(xn{xn:zn=k},ϕ,ν)if k is an existing clusterαp(xnϕ,ν)if k is a new cluster.p(z_n = k \mid \mathbf{x}_n, \{z_{n'}\}_{n' \neq n}) \propto \begin{cases} N_k^{(\neg n)} \cdot p(\mathbf{x}_n \mid \{\mathbf{x}_{n'} : z_{n'} = k\}, \boldsymbol{\phi}, \nu) & \text{if } k \text{ is an existing cluster} \\ \alpha \cdot p(\mathbf{x}_n \mid \boldsymbol{\phi}, \nu) & \text{if } k \text{ is a new cluster.} \end{cases}

This is the DPMM collapsed Gibbs update. It is nonparametricKK is never fixed, and the number of occupied clusters can grow or shrink at each step:

  • A cluster is created when a data point is assigned to a new cluster (probability controlled by α\alpha).

  • A cluster is deleted when its last data point is reassigned elsewhere.

The parameter α>0\alpha > 0 is the concentration: larger α\alpha creates more clusters.

The Chinese Restaurant Process (CRP)

The prior on partitions induced by the DPMM has a beautiful sequential description called the Chinese restaurant process (CRP). Imagine NN customers entering a restaurant one at a time:

  • Customer 1 sits at a new table.

  • Customer nn sits at an existing table kk with probability Ckα+n1\frac{|\mathcal{C}_k|}{\alpha + n - 1}, or starts a new table with probability αα+n1\frac{\alpha}{\alpha + n - 1}.

The resulting partition C={C1,}\mathcal{C} = \{\mathcal{C}_1, \ldots\} is the CRP prior on partitions of [N][N].

Stick-Breaking Construction

The DPMM can also be defined directly via a random measure:

Θ=k=1πkδθk,θkiidG,\Theta = \sum_{k=1}^\infty \pi_k \, \delta_{\boldsymbol{\theta}_k}, \qquad \boldsymbol{\theta}_k \overset{\text{iid}}{\sim} G,

where the weights follow a stick-breaking construction: sample kBeta(1,α)\ell_k \sim \mathrm{Beta}(1, \alpha) and set

πk=kj=1k1(1j).\pi_k = \ell_k \prod_{j=1}^{k-1} (1 - \ell_j).

We write ΘDP(α,G)\Theta \sim \mathrm{DP}(\alpha, G) and call α\alpha the concentration and GG the base measure.

def normal_log_posterior_predictive(x, cluster_points, mu0, kappa0, alpha0, beta0):
    '''Log posterior predictive p(x | cluster_points) under Normal-NormalInverseGamma prior.

    Model: x_n | mu, sigma^2 ~ N(mu, sigma^2)
           mu | sigma^2 ~ N(mu0, sigma^2 / kappa0)
           sigma^2 ~ IG(alpha0, beta0)

    The marginal p(x | data) is a Student-t distribution.

    Args:
        x: new point, shape (D,)
        cluster_points: tensor of shape (N_k, D); empty if prior predictive
        mu0: prior mean, shape (D,)
        kappa0: prior precision scaling (scalar)
        alpha0: prior shape (scalar)
        beta0: prior rate (scalar)
    Returns:
        log predictive density (scalar)
    '''
    D = x.shape[0]
    n = len(cluster_points)
    if n > 0:
        x_bar = cluster_points.mean(0)
        kappa_n = kappa0 + n
        mu_n = (kappa0 * mu0 + n * x_bar) / kappa_n
        alpha_n = alpha0 + n / 2.0
        S = ((cluster_points - x_bar) ** 2).sum(0)  # shape (D,)
        diff = x_bar - mu0
        beta_n = beta0 + 0.5 * S + 0.5 * kappa0 * n / kappa_n * diff ** 2
    else:
        kappa_n, mu_n, alpha_n, beta_n = kappa0, mu0, alpha0, beta0 * torch.ones(D)

    # Marginal is a Student-t: t(2*alpha_n, mu_n, beta_n*(kappa_n+1)/(alpha_n*kappa_n))
    df = 2.0 * alpha_n
    scale2 = beta_n * (kappa_n + 1) / (alpha_n * kappa_n)
    log_p = 0.0
    for d in range(D):
        log_p += float(student_t.logpdf(x[d].item(), df=df, loc=mu_n[d].item(),
                                         scale=scale2[d].item() ** 0.5))
    return log_p


def collapsed_gibbs_dpmm(X, alpha, mu0, kappa0, alpha0, beta0, num_iters=100, seed=0):
    '''Collapsed Gibbs sampler for a DPMM with Normal-NormalInverseGamma components.

    Args:
        X: data, shape (N, D)
        alpha: DP concentration
        mu0, kappa0, alpha0, beta0: NIG prior hyperparameters
        num_iters: number of Gibbs sweeps
        seed: random seed
    Returns:
        z: final cluster assignments, shape (N,)
        K_history: number of clusters at each iteration
    '''
    torch.manual_seed(seed)
    N, D = X.shape
    mu0 = mu0 * torch.ones(D)

    # Initialize: one cluster per data point
    z = torch.arange(N)
    K_history = []

    for it in range(num_iters):
        for n in range(N):
            # Remove n from its current cluster
            z_n = z[n].item()
            z[n] = -1
            mask_n = (z >= 0)
            unique_ks = z[mask_n].unique().tolist()

            # Compute log probabilities for each existing cluster and a new one
            log_probs = []
            labels = []
            for k in unique_ks:
                pts = X[(z == k)]
                N_k = len(pts)
                log_like = normal_log_posterior_predictive(X[n], pts, mu0, kappa0, alpha0, beta0)
                log_probs.append(np.log(N_k) + log_like)
                labels.append(k)

            # New cluster
            log_like_new = normal_log_posterior_predictive(X[n], X[:0], mu0, kappa0, alpha0, beta0)
            log_probs.append(np.log(alpha) + log_like_new)
            labels.append(max(unique_ks + [-1]) + 1 if unique_ks else 0)

            # Sample
            log_probs = np.array(log_probs)
            log_probs -= log_probs.max()
            probs = np.exp(log_probs)
            probs /= probs.sum()
            chosen = np.random.choice(len(labels), p=probs)
            z[n] = labels[chosen]

        # Re-label clusters 0, 1, 2, ...
        unique_labels = z.unique().tolist()
        remap = {old: new for new, old in enumerate(unique_labels)}
        z = torch.tensor([remap[zi.item()] for zi in z])
        K_history.append(len(unique_labels))

    return z, K_history
Source
torch.manual_seed(1)
np.random.seed(1)

# Generate data from 3 Gaussians
true_means = torch.tensor([[-2.0, 0.0], [2.0, 0.0], [0.0, 2.5]])
true_std = 0.6
N_per = 40
X_list = [true_means[k] + true_std * torch.randn(N_per, 2) for k in range(3)]
X = torch.cat(X_list)
N = len(X)

# Prior hyperparameters
mu0 = torch.zeros(2)
kappa0 = 0.5
alpha0 = 2.0
beta0 = 1.0
alpha_dp = 2.0

z_final, K_hist = collapsed_gibbs_dpmm(X, alpha=alpha_dp, mu0=mu0, kappa0=kappa0,
                                        alpha0=alpha0, beta0=beta0,
                                        num_iters=80, seed=42)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Left: final clustering
K_final = z_final.max().item() + 1
colors = plt.cm.tab10(np.linspace(0, 1, max(K_final, 10)))
for k in range(K_final):
    mask = (z_final == k).numpy()
    axes[0].scatter(X[mask, 0], X[mask, 1], c=[colors[k]], s=20, alpha=0.8)
for km, mu in enumerate(true_means):
    axes[0].scatter(*mu, marker='*', s=200, c='k', zorder=5)
axes[0].set_title(f'DPMM clustering ($\\alpha={alpha_dp}$): {K_final} clusters found')
axes[0].set_xlabel('$x_1$'); axes[0].set_ylabel('$x_2$')

# Right: K over iterations
axes[1].plot(K_hist, color='tab:blue')
axes[1].axhline(3, ls='--', color='k', label='True K=3')
axes[1].set_xlabel('Gibbs iteration')
axes[1].set_ylabel('Number of clusters')
axes[1].set_title('Cluster count over iterations')
axes[1].legend()

plt.tight_layout()
plt.savefig('dpmm_demo.png', dpi=100, bbox_inches='tight')
plt.show()
print(f"Final number of clusters: {K_final}")
<Figure size 1100x400 with 2 Axes>
Final number of clusters: 4

Pitman–Yor Process

The Pitman–Yor process Orbanz, 2014 generalizes the DP by adding a discount parameter d[0,1)d \in [0, 1). We write ΘPYP(α,d,G)\Theta \sim \mathrm{PYP}(\alpha, d, G) where the weights use the modified stick-breaking:

kBeta(1d,α+kd),πk=kj=1k1(1j).\ell_k \sim \mathrm{Beta}(1 - d, \, \alpha + kd), \qquad \pi_k = \ell_k \prod_{j=1}^{k-1}(1 - \ell_j).

When d=0d = 0 we recover the DP. When d>0d > 0, the cluster sizes follow a power law, making the PYP well-suited to linguistic data where word frequencies exhibit Zipfian behavior.

The CRP analogue assigns customer nn to existing table kk with probability Ckdα+n1\frac{|\mathcal{C}_k| - d}{\alpha + n - 1} or to a new table with probability α+dK(n)α+n1\frac{\alpha + d \cdot K^{(n)}}{\alpha + n - 1}, where K(n)K^{(n)} is the current number of tables.

Mixture of Finite Mixtures

DPMMs are often used to select KK automatically, but the DP almost surely generates infinitely many clusters as NN \to \infty. When the true number of clusters is finite but unknown, mixture of finite mixture models (MFMMs) Miller & Harrison, 2018 are more appropriate:

Kp(K),πDir(α1K),θkiidG,zniidπ,xnp(xθzn).K \sim p(K), \quad \boldsymbol{\pi} \sim \mathrm{Dir}(\alpha \mathbf{1}_K), \quad \boldsymbol{\theta}_k \overset{\text{iid}}{\sim} G, \quad z_n \overset{\text{iid}}{\sim} \boldsymbol{\pi}, \quad \mathbf{x}_n \sim p(\mathbf{x} \mid \boldsymbol{\theta}_{z_n}).

A natural choice is K1Poisson(λ)K - 1 \sim \mathrm{Poisson}(\lambda). Collapsed Gibbs samplers for MFMMs have a similar form to the DPMM sampler but converge to a finite KK.

Poisson Random Measures

Dirichlet processes and Poisson processes are deeply connected. In fact, DPs are instances of completely random measures (CRMs) constructed from Poisson processes.

Random Measures

A random measure on RD\mathbb{R}^D is a measure-valued random variable. An atomic random measure takes the form:

μ=k=1wkδθk,\mu = \sum_{k=1}^\infty w_k \, \delta_{\boldsymbol{\theta}_k},

where the weights wkR+w_k \in \mathbb{R}_+ and locations θkRD\boldsymbol{\theta}_k \in \mathbb{R}^D are random.

Poisson Random Measures

Construct a random measure by sampling weight–location pairs from a marked Poisson process on R+×RD\mathbb{R}_+ \times \mathbb{R}^D:

{(wk,θk)}k=1PP(λ(w,θ)).\{(w_k, \boldsymbol{\theta}_k)\}_{k=1}^\infty \sim \mathrm{PP}(\lambda(w, \boldsymbol{\theta})).

For a homogeneous measure the intensity factors: λ(w,θ)=λ(w)g(θ)\lambda(w, \boldsymbol{\theta}) = \lambda(w) \cdot g(\boldsymbol{\theta}), where gg is a density on RD\mathbb{R}^D (the base measure GG).

The Gamma Process and Dirichlet Process

Choose the weight intensity

λ(w)=αw1eβw.\lambda(w) = \alpha w^{-1} e^{-\beta w}.

Then 0λ(w)dw=\int_0^\infty \lambda(w)\, dw = \infty, so μ\mu has infinitely many atoms. Nevertheless, the total mass W=kwkGamma(α,1/β)W = \sum_k w_k \sim \mathrm{Gamma}(\alpha, 1/\beta) is almost surely finite. This is the gamma process μGaP(α,G)\mu \sim \mathrm{GaP}(\alpha, G).

Normalizing gives the Dirichlet process:

μGaP(α,G)Θ=μW=k=1wkWδθkDP(α,G).\mu \sim \mathrm{GaP}(\alpha, G) \quad \Longrightarrow \quad \Theta = \frac{\mu}{W} = \sum_{k=1}^\infty \frac{w_k}{W} \, \delta_{\boldsymbol{\theta}_k} \sim \mathrm{DP}(\alpha, G).

This is a key result: Dirichlet processes are normalized gamma processes.

Other Completely Random Measures

Different weight intensities yield other useful random measures:

Weight intensity λ(w)\lambda(w)Name
αw1eβw\alpha w^{-1} e^{-\beta w}Gamma process → DP (after normalization)
γw(α+1)\gamma w^{-(\alpha+1)}Stable process
γw1(1w)α1\gamma w^{-1}(1-w)^{\alpha-1}Beta process

A CRM μ\mu has the remarkable property that Θ=μ/W\Theta = \mu/W is independent of WW if and only if μ\mu is a gamma process — the DP is the unique normalized CRM with this independence property Orbanz, 2014.

References
  1. Orbanz, P. (2014). Lecture notes on Bayesian nonparametrics. http://www.gatsby.ucl.ac.uk/~porbanz/papers/porbanz_BNP_draft.pdf
  2. Miller, J. W., & Harrison, M. T. (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association, 113(521), 340–356.