Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Probabilistic PCA

High-dimensional data often has far less intrinsic variation than the dimension DD would suggest. A class of students’ grade transcripts, or a collection of images, may vary along just a few underlying “axes.” Latent variable models make this intuition precise by positing that the high-dimensional observations xnRD\mbx_n \in \reals^D are generated from a low-dimensional latent variable znRM\mbz_n \in \reals^M with MDM \ll D.

In this chapter we study the simplest such models, where the relationship between zn\mbz_n and xn\mbx_n is linear and all distributions are Gaussian:

  • Principal Components Analysis (PCA): two classical optimization formulations and their connection to the eigenvectors of the sample covariance matrix

  • Probabilistic PCA (PPCA): PCA as the maximum likelihood solution to a Gaussian latent variable model, enabling Bayesian inference

  • Factor Analysis (FA): a generalization of PPCA with per-dimension noise variances

  • Other linear LVMs: Independent Components Analysis (ICA) and Probabilistic Canonical Correlation Analysis (PCCA)

Source
import torch
import matplotlib.pyplot as plt

Principal Components Analysis

Motivation

Suppose we observe grade transcripts {xn}n=1N\{\mbx_n\}_{n=1}^N where xnRD\mbx_n \in \reals^D is a vector of DD grades for student nn. We might want to:

  • Reduce dimensionality: are there a few axes of variation?

  • Visualize: embed the points in 2–3 dimensions for plotting.

  • Compress: summarize each student’s record compactly.

PCA addresses all three. There are two classical equivalent formulations.

Maximum Variance Formulation

Goal: project the data onto an MM-dimensional subspace while maximizing the variance of the projected data.

For M=1M = 1, let u1RD\mbu_1 \in \reals^D be a unit vector defining the subspace. The projection of xn\mbx_n is the scalar u1xn\mbu_1^\top \mbx_n (the score or embedding). Its variance over the NN data points is:

1Nn=1N[u1xnu1xˉ]2=u1Su1,\begin{align} \frac{1}{N} \sum_{n=1}^N \bigl[\mbu_1^\top \mbx_n - \mbu_1^\top \bar{\mbx}\bigr]^2 = \mbu_1^\top \mbS \mbu_1, \end{align}

where xˉ=1Nnxn\bar{\mbx} = \frac{1}{N}\sum_n \mbx_n is the sample mean and S=1Nn(xnxˉ)(xnxˉ)\mbS = \frac{1}{N} \sum_n (\mbx_n - \bar{\mbx})(\mbx_n - \bar{\mbx})^\top is the sample covariance matrix.

Maximizing u1Su1\mbu_1^\top \mbS \mbu_1 subject to u12=1\|\mbu_1\|^2 = 1 via a Lagrange multiplier gives Su1=λ1u1\mbS \mbu_1 = \lambda_1 \mbu_1, so u1\mbu_1 must be an eigenvector of S\mbS. Left-multiplying by u1\mbu_1^\top shows the projected variance equals λ1\lambda_1, so we should choose the eigenvector with the largest eigenvalue.

More generally, the MM-dimensional principal subspace is spanned by the MM eigenvectors u1,,uM\mbu_1, \ldots, \mbu_M with the largest eigenvalues λ1λM\lambda_1 \geq \cdots \geq \lambda_M.

Linear Autoencoder Formulation

Goal: find an orthogonal matrix WRD×M\mbW \in \reals^{D \times M} (with WW=IM\mbW^\top \mbW = \mbI_M) that minimizes mean squared reconstruction error.

We encode as zn=W(xnxˉ)\mbz_n = \mbW^\top (\mbx_n - \bar{\mbx}) and decode as x^n=Wzn+xˉ\hat{\mbx}_n = \mbW \mbz_n + \bar{\mbx}. The loss is:

L(W)=1Nn=1Nxnx^n2=Tr[(IWW)S(IWW)]=Tr[S]Tr[WSW].\begin{align} \mathcal{L}(\mbW) &= \frac{1}{N}\sum_{n=1}^N \|\mbx_n - \hat{\mbx}_n\|^2 = \Tr\bigl[(\mbI - \mbW\mbW^\top)\mbS(\mbI - \mbW\mbW^\top)\bigr] = \Tr[\mbS] - \Tr[\mbW^\top \mbS \mbW]. \end{align}

Minimizing L(W)\mathcal{L}(\mbW) is equivalent to maximizing Tr[WSW]=m=1Mλm(wmum)2\Tr[\mbW^\top \mbS \mbW] = \sum_{m=1}^M \lambda_m (\mbw_m^\top \mbu_m)^2, which is again solved by W=UM\mbW = \mbU_M, the leading MM eigenvectors.

PCA and the Singular Value Decomposition

Let Y=1NXc\mbY = \frac{1}{\sqrt{N}} \mbX_c where Xc\mbX_c has rows (xnxˉ)(\mbx_n - \bar{\mbx})^\top. Then YY=S\mbY^\top \mbY = \mbS.

The SVD of Y=VΛ1/2U\mbY = \mbV \mbLambda^{1/2} \mbU^\top reveals:

  • Right singular vectors of Y\mbY (columns of U\mbU) = eigenvectors of S\mbS = principal components.

  • Squared singular values = eigenvalues λ1,,λD\lambda_1, \ldots, \lambda_D of S\mbS.

In practice, PCA is computed via SVD of Y\mbY rather than by eigendecomposing S\mbS, since SVD is numerically more stable.

Explained Variance

The fraction of total variance captured by MM principal components is:

variance explained=m=1Mλmd=1Dλd.\begin{align} \text{variance explained} = \frac{\sum_{m=1}^M \lambda_m}{\sum_{d=1}^D \lambda_d}. \end{align}

A scree plot shows this quantity (per-component and cumulative) as a function of MM. Below we generate synthetic data from a rank-3 linear Gaussian model and visualize the scree plot.

# Generate synthetic data from a rank-3 linear Gaussian model
torch.manual_seed(305)
N, D, M_true = 300, 20, 3   # datapoints, features, true latent dims
W_true = torch.randn(D, M_true)         # D x M weight matrix
Z_true = torch.randn(N, M_true)         # N x M latent variables
σ2     = torch.tensor(0.5)              # noise variance
X = Z_true @ W_true.T + torch.sqrt(σ2) * torch.randn(N, D)

# Centre the data and compute PCA via SVD
μ  = X.mean(0)                          # sample mean, shape (D,)
Xc = X - μ                              # centred data, shape (N, D)
Y  = Xc / N**0.5                        # Y.T @ Y = sample covariance S
_, s, Vt = torch.linalg.svd(Y, full_matrices=False)
UM  = Vt.T                              # principal components, columns of UM
λ   = s**2                              # eigenvalues of S (variances)
Source
fig, axs = plt.subplots(1, 2, figsize=(9, 4))

pct = 100 * λ / λ.sum()
axs[0].bar(range(1, D+1), pct.numpy(), color='steelblue', alpha=0.8)
axs[0].set_xlabel('Principal component')
axs[0].set_ylabel('Variance explained (%)')
axs[0].set_title('Per-component variance')

axs[1].plot(range(1, D+1), torch.cumsum(pct, 0).numpy(), 'o-', color='steelblue')
axs[1].axhline(90, color='r', ls='--', lw=1, label='90%')
axs[1].set_xlabel('Number of components')
axs[1].set_ylabel('Cumulative variance explained (%)')
axs[1].set_title('Scree plot')
axs[1].legend()

plt.tight_layout()
<Figure size 900x400 with 2 Axes>

Probabilistic PCA

Model

The two classical formulations above cast PCA as an optimization problem. Probabilistic PCA (PPCA) instead defines a joint distribution over xn\mbx_n and a latent variable zn\mbz_n:

zniidN(0,IM)xnznN(Wzn+μ,  σ2ID),\begin{align} \mbz_n &\iid{\sim} \cN(\mbzero, \mbI_M) \\ \mbx_n \mid \mbz_n &\sim \cN(\mbW \mbz_n + \mbmu,\; \sigma^2 \mbI_D), \end{align}

where WRD×M\mbW \in \reals^{D \times M} are the loadings (analogous to the principal components), μRD\mbmu \in \reals^D is the mean, and σ2\sigma^2 is an isotropic noise variance.

Equivalently, xn=Wzn+μ+ϵn\mbx_n = \mbW \mbz_n + \mbmu + \mbepsilon_n with ϵnN(0,σ2ID)\mbepsilon_n \sim \cN(\mbzero, \sigma^2 \mbI_D).

The marginal distribution of xn\mbx_n is:

p(xnW,μ,σ2)=N(xnμ,  WW+σ2IDC),\begin{align} p(\mbx_n \mid \mbW, \mbmu, \sigma^2) = \cN(\mbx_n \mid \mbmu,\; \underbrace{\mbW\mbW^\top + \sigma^2 \mbI_D}_{\mbC}), \end{align}

a Gaussian with low-rank plus diagonal covariance — only O(MD)O(MD) parameters instead of O(D2)O(D^2).

Maximum Likelihood Estimation

The marginal log-likelihood is:

L(W,μ,σ2)=ND2log2πN2logC12n=1N(xnμ)C1(xnμ).\begin{align} \mathcal{L}(\mbW, \mbmu, \sigma^2) = -\frac{ND}{2}\log 2\pi - \frac{N}{2}\log|\mbC| - \frac{1}{2}\sum_{n=1}^N (\mbx_n - \mbmu)^\top \mbC^{-1} (\mbx_n - \mbmu). \end{align}

Tipping and Bishop (1999) showed the MLE has a closed form:

μML=xˉ,WML=UM(ΛMσ2I)1/2R,σML2=1DMm=M+1Dλm,\begin{align} \mbmu_{\mathsf{ML}} &= \bar{\mbx}, \\ \mbW_{\mathsf{ML}} &= \mbU_M (\mbLambda_M - \sigma^2 \mbI)^{1/2} \mbR, \\ \sigma^2_{\mathsf{ML}} &= \frac{1}{D - M} \sum_{m=M+1}^D \lambda_m, \end{align}

where UM\mbU_M contains the MM leading eigenvectors of the sample covariance S\mbS, ΛM=diag(λ1,,λM)\mbLambda_M = \diag(\lambda_1, \ldots, \lambda_M), and RRM×M\mbR \in \reals^{M \times M} is an arbitrary orthogonal matrix.

The weights are identifiable only up to orthogonal rotation — only the subspace spanned by UM\mbU_M is uniquely determined. The MLE noise variance σML2\sigma^2_{\mathsf{ML}} is the average variance discarded in the trailing DMD - M dimensions.

Posterior Distribution on Latent Variables

Given parameters W\mbW, μ\mbmu, σ2\sigma^2, the posterior on zn\mbz_n combines the Gaussian prior with the Gaussian likelihood:

p(znxn,W,μ,σ2)=N(znJ1hn,  J1),\begin{align} p(\mbz_n \mid \mbx_n, \mbW, \mbmu, \sigma^2) = \cN(\mbz_n \mid \mbJ^{-1}\mbh_n,\; \mbJ^{-1}), \end{align}

where

J=IM+1σ2WWRM×M,hn=1σ2W(xnμ)RM.\begin{align} \mbJ &= \mbI_M + \frac{1}{\sigma^2} \mbW^\top \mbW \in \reals^{M \times M}, \qquad \mbh_n = \frac{1}{\sigma^2} \mbW^\top (\mbx_n - \mbmu) \in \reals^M. \end{align}

Note that the posterior precision matrix J\mbJ is the same for all nn; only the information vector hn\mbh_n varies. The M×MM \times M system is much cheaper to invert than the D×DD \times D marginal covariance.

Zero-noise limit: as σ20\sigma^2 \to 0, the posterior mean converges to (WW)1W(xnμ)(\mbW^\top \mbW)^{-1} \mbW^\top (\mbx_n - \mbmu), the ordinary least-squares projection — identical to classical PCA (up to scaling).

# MLE for PPCA with M components
M_fit  = 3
UM_fit = Vt[:M_fit].T              # D x M leading principal components
ΛM     = torch.diag(λ[:M_fit])    # M x M eigenvalue matrix
σ2_ML  = λ[M_fit:].mean()         # MLE noise variance

# MLE weights (orthogonal rotation R = I)
W_ML = UM_fit @ (ΛM - σ2_ML * torch.eye(M_fit)).sqrt()

# Posterior on latent variables
J     = torch.eye(M_fit) + (1/σ2_ML) * W_ML.T @ W_ML   # M x M precision
Σz    = torch.linalg.inv(J)                              # M x M posterior cov
H     = Xc @ W_ML / σ2_ML                               # N x M info vectors
Z_hat = H @ Σz.T                                         # N x M posterior means

print(f'MLE noise variance σ²  = {σ2_ML:.3f}  (true = {σ2.item():.3f})')
print(f'Posterior covariance Σz:\n{Σz.numpy().round(3)}')
MLE noise variance σ²  = 0.483  (true = 0.500)
Posterior covariance Σz:
[[ 0.024  0.    -0.   ]
 [ 0.     0.038  0.   ]
 [-0.     0.     0.082]]
Source
# Visualize the 2D posterior mean embedding
fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(Z_hat[:, 0].numpy(), Z_hat[:, 1].numpy(),
           c=Z_true[:, 0].numpy(), cmap='coolwarm', s=20, alpha=0.8)
ax.set_xlabel(r'$\hat{z}_1$ (posterior mean)')
ax.set_ylabel(r'$\hat{z}_2$ (posterior mean)')
ax.set_title('PPCA posterior mean embeddings')
plt.tight_layout()
<Figure size 500x500 with 1 Axes>

Gibbs Sampling for Probabilistic PCA

If we want a fully Bayesian treatment, we place a prior on the parameters. For simplicity assume μ=0\mbmu = \mbzero (center the data first) and put:

p(W,σ2)=χ2(σ2ν0,σ02)d=1DN(wd0,σ2κ0IM),\begin{align} p(\mbW, \sigma^2) = \chi^{-2}(\sigma^2 \mid \nu_0, \sigma_0^2) \prod_{d=1}^D \cN(\mbw_d \mid \mbzero,\, \tfrac{\sigma^2}{\kappa_0} \mbI_M), \end{align}

where wdRM\mbw_d \in \reals^M is the dd-th row of W\mbW.

Complete conditional for each row wd\mbw_d: combining the row prior with the NN likelihoods for the dd-th output gives a Gaussian:

p(wd{zn,xn},σ2)=N(wdJd1hd,  Jd1),\begin{align} p(\mbw_d \mid \{\mbz_n, \mbx_n\}, \sigma^2) = \cN(\mbw_d \mid \mbJ_d^{-1} \mbh_d,\; \mbJ_d^{-1}), \end{align}

where

Jd=κ0σ2IM+1σ2n=1Nznzn,hd=1σ2n=1Nxn,dzn.\begin{align} \mbJ_d &= \frac{\kappa_0}{\sigma^2} \mbI_M + \frac{1}{\sigma^2} \sum_{n=1}^N \mbz_n \mbz_n^\top, \qquad \mbh_d = \frac{1}{\sigma^2} \sum_{n=1}^N x_{n,d}\, \mbz_n. \end{align}

Complete conditional for σ2\sigma^2: conjugate with the scaled inverse chi-squared family:

p(σ2{zn,xn},W)=χ2(σ2νN,σN2),νN=ν0+DM+DN,σN2=1νN ⁣[ν0σ02+κ0WF2+n=1NxnWzn2].\begin{align} p(\sigma^2 \mid \{\mbz_n, \mbx_n\}, \mbW) &= \chi^{-2}(\sigma^2 \mid \nu_N, \sigma_N^2), \\ \nu_N &= \nu_0 + DM + DN, \\ \sigma_N^2 &= \frac{1}{\nu_N}\!\left[\nu_0 \sigma_0^2 + \kappa_0 \|\mbW\|_F^2 + \sum_{n=1}^N \|\mbx_n - \mbW \mbz_n\|^2\right]. \end{align}

Complete conditional for zn\mbz_n: same as the PPCA posterior derived above.

Factor Analysis

Factor analysis (FA) relaxes the isotropic noise assumption of PPCA by allowing each output dimension to have its own noise variance:

zniidN(0,IM)xnznN(Wzn+μ,  diag(σ2)),\begin{align} \mbz_n &\iid{\sim} \cN(\mbzero, \mbI_M) \\ \mbx_n \mid \mbz_n &\sim \cN(\mbW \mbz_n + \mbmu,\; \diag(\mbsigma^2)), \end{align}

where σ2=[σ12,,σD2]\mbsigma^2 = [\sigma_1^2, \ldots, \sigma_D^2]^\top. The marginal covariance is C=WW+diag(σ2)\mbC = \mbW\mbW^\top + \diag(\mbsigma^2) — still low-rank plus diagonal, but with DD independent noise scales instead of one.

With the per-dimension prior:

p(W,σ2)=d=1D[χ2(σd2ν0,σ02)N(wd0,σd2κ0IM)],\begin{align} p(\mbW, \mbsigma^2) = \prod_{d=1}^D \left[ \chi^{-2}(\sigma_d^2 \mid \nu_0, \sigma_0^2)\, \cN(\mbw_d \mid \mbzero,\, \tfrac{\sigma_d^2}{\kappa_0} \mbI_M) \right], \end{align}

the Gibbs updates for each (wd,σd2)(\mbw_d, \sigma_d^2) pair are identical to those for PPCA but with σd2\sigma_d^2 in place of the shared σ2\sigma^2. All rows are independent given {zn}\{\mbz_n\}, so they can be updated in parallel.

Other Linear Latent Variable Models

Independent Components Analysis

ICA replaces the Gaussian latent prior with a product of non-Gaussian marginals:

zniidp(z)=m=1Mp(zm),xnznN(Wzn+μ,  diag(σ2)).\begin{align} \mbz_n &\iid{\sim} p(\mbz) = \prod_{m=1}^M p(z_m), \\ \mbx_n \mid \mbz_n &\sim \cN(\mbW \mbz_n + \mbmu,\; \diag(\mbsigma^2)). \end{align}

The non-Gaussian prior is essential: if p(z)p(\mbz) were Gaussian, any correlation in p(z)p(\mbz) could be absorbed into W\mbW, reducing the model to factor analysis. A common choice is a Laplace (heavy-tailed) prior, p(zm)ezmp(z_m) \propto e^{-|z_m|}, which encourages the latent factors to be sparse.

ICA is widely used in signal processing (e.g., separating audio sources) and neuroscience.

Probabilistic Canonical Correlation Analysis

Suppose we observe two data modalities, xnRDx\mbx_n \in \reals^{D_x} and ynRDy\mby_n \in \reals^{D_y}. Probabilistic CCA posits:

zn(s)iidN(0,IMs),zn(x)iidN(0,IMx),zn(y)iidN(0,IMy),xnN(Wxxzn(x)+Wxszn(s)+μx,  σ2IDx),ynN(Wyyzn(y)+Wyszn(s)+μy,  σ2IDy).\begin{align} \mbz_n^{(s)} &\iid{\sim} \cN(\mbzero, \mbI_{M_s}), & \mbz_n^{(x)} &\iid{\sim} \cN(\mbzero, \mbI_{M_x}), & \mbz_n^{(y)} &\iid{\sim} \cN(\mbzero, \mbI_{M_y}), \\ \mbx_n &\sim \cN(\mbW_{xx}\mbz_n^{(x)} + \mbW_{xs}\mbz_n^{(s)} + \mbmu_x,\; \sigma^2 \mbI_{D_x}), \\ \mby_n &\sim \cN(\mbW_{yy}\mbz_n^{(y)} + \mbW_{ys}\mbz_n^{(s)} + \mbmu_y,\; \sigma^2 \mbI_{D_y}). \end{align}

The shared latent variables zn(s)\mbz_n^{(s)} capture common structure across modalities; private variables zn(x)\mbz_n^{(x)} and zn(y)\mbz_n^{(y)} capture modality-specific variation.

Bach & Jordan, 2005 showed that the MLE recovers the classical CCA solution: the shared weights Wxs\mbW_{xs} and Wys\mbW_{ys} correspond to the left and right singular vectors of the sample cross-correlation matrix diag(Sxx)1/2Sxydiag(Syy)1/2\diag(\mbS_{xx})^{-1/2} \mbS_{xy} \diag(\mbS_{yy})^{-1/2}.

Conclusion

This chapter developed a family of linear Gaussian latent variable models with progressively richer structure:

ModelNoise covarianceLatent prior
PPCAσ2I\sigma^2 \mbI (isotropic)N(0,I)\cN(\mbzero, \mbI)
Factor analysisdiag(σ2)\diag(\mbsigma^2) (diagonal)N(0,I)\cN(\mbzero, \mbI)
ICAdiag(σ2)\diag(\mbsigma^2)mp(zm)\prod_m p(z_m) (non-Gaussian)
PCCAσ2I\sigma^2 \mbI (per modality)shared + private Gaussians

Key takeaways:

  • PCA is the MLE for PPCA: the principal components are the leading eigenvectors of the sample covariance, computed efficiently via SVD.

  • The posterior on latent variables requires only inverting an M×MM \times M system (not D×DD \times D), making inference tractable.

  • Gibbs sampling for PPCA / FA reduces to iterated Bayesian linear regressions — one per output dimension.

  • The non-Gaussian prior in ICA breaks the rotational indeterminacy of FA and enables recovery of independent source signals.

References
  1. Bach, F. R., & Jordan, M. I. (2005). A probabilistic interpretation of canonical correlation analysis.
  2. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.