Linear Gaussian Models

Linear Gaussian Models#

Mixture models and HMMs were discrete latent variable models (LVMs). Data points were attributed to one of \(K\) clusters or latent states. These models naturally extend to continuous latent variables as well, and these extensions are especially nice when we assume linear Gaussian structure. In fact, linear Gaussian LVMs are so common that you have probably heard them under different names, like (probabilistic) PCA, factor analysis, and Kalman filters. Today, we will show how to derive these classic models and algorithms.

Motivating Example#

Take HW3 as an example: for each month, you have temperature measurements at 9504 locations across the globe. Are the measurements really 9504 dimensional?

Global Temperatures

With high-dimensional measurements like these, we might have a few objectives in mind:

Dimensionality reduction: are there a few dimensions along which the temperatures primarily vary? Maybe northern and southern hemisphere, or land and sea?
Visualization: Sometimes, we want to embed high-dimensional points in 2 or 3 dimensions for visualization. (Not applicable here since the data is already spatial, but this is a general concern with high-dimensional data.)
Compression: How can I best summarize the data if I am willing to sacrifice some reconstruction accuracy?

Principal Components of Global Temperature#

One way you could look for lower dimensional structure is using principal components analysis (PCA). The first three principal components (PCs) of the temperature data look like this. PC1

PC2

PC3

These PCs capture dimensions of maximal variance in the data. For example, the first PC captures whether a point is in the northern or southern hemisphere, which unsurprisingly accounts for a lot of the variance in temperature fluctuations over time. The second PC captures whether or not a point is over Antarctica, which must be another big determinant of temperature fluctuations.

Principal Components Analysis (PCA)#

There are two classical definitions of PCA (Quoted from Bishop, Ch 12):

An orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized (Hotelling, 1933).
The linear projection that minimizes the average projection cost, defined as the mean squared distance between the data points and their projections (Pearson, 1901).

We will focus on the first definition today and consider the second next time.

PCA: Maximum Variance Formulation#

Goal: Project data \(\{\mbx_n\}_{n=1}^N\) onto a lower dimensional space of dimension \(M < D\) while maximizing the variance of the projected data.

To start, assume \(M = 1\). The principal subspace is defined by a unit vector \(\mbu_1 \in \reals^D\). This is called the first principal component.

Projecting a data point \(\mbx_n\) onto this subspace amounts to taking an inner product, \(\mbu_1^\top \mbx_n\). These is variously called the scores, embeddings, or signals.

The mean of the projected data is,

\[\begin{align*} \frac{1}{N} \sum_{n=1}^N \mbu_1^\top \mbx_n = \mbu_1^\top \left( \frac{1}{N} \sum_{n=1}^N \mbx_n \right) = \mbu_1^\top \bar{\mbx}, \end{align*}\]

where \(\bar{\mbx}\) is the sample mean.

The variance is

\[\begin{align*} \frac{1}{N} \sum_{n=1}^N \left[ \mbu_1^\top \mbx_n - \mbu_1^\top \bar{\mbx} \right]^2 &= \frac{1}{N} \sum_{n=1}^N \left[ \mbu_1^\top (\mbx_n - \bar{\mbx}) \right]^2 \\ &=\frac{1}{N} \sum_{n=1}^N \mbu_1^\top (\mbx_n - \bar{\mbx}) (\mbx_n - \bar{\mbx})^\top \mbu_1 \\ % &=\mbu_1^\top \left(\frac{1}{N} \sum_{n=1}^N (\mbx_n - \bar{\mbx}) (\mbx_n - \bar{\mbx})^\top \right) \mbu_1 \\ &=\mbu_1^\top \mbS \mbu_1 \end{align*}\]

where \(\mbS = \frac{1}{N} \sum_{n=1}^N (\mbx_n - \bar{\mbx}) (\mbx_n - \bar{\mbx})^\top \in \reals^{D \times D}\) is the sample covariance matrix. \end{frame}

Now maximize the projected variance wrt \(\mbu_1\),

\[\begin{align*} \mbu_1 &= \arg \max_{\mbu \in \bbS_{D}} \mbu^\top \mbS \mbu. \end{align*}\]

This is the variational definition of the eigenvector with maximal eigenvalue!

I.e., \(\mbu_1\) is the eigenvector of \(\mbS\) with the largest eigenvalue, \(\lambda_1\).

More generally, to find an \(M\) dimensional principal subspace, take the \(M\) eigenvectors \(\mbu_1, \ldots, \mbu_M\) with the largest eigenvalues \(\lambda_1, \ldots, \lambda_M\).

Since \(\mbS\) is real and symmetric positive definite, the eigenvectors are orthogonal.

PCA and the Singular Value Decomposition#

The first \(M\) principal components are the leading \(M\) eigenvectors of the covariance matrix. Equivalently, they are the first \(M\) right singular vectors of the data matrix.

Let

\[\begin{align*} \mbY = \frac{1}{\sqrt{N}} \mbX = \frac{1}{\sqrt{N}} \begin{bmatrix} - & (\mbx_1 - \bar{\mbx})^\top & - \\ & \vdots & \\ - & (\mbx_N^\top - \bar{\mbx})^\top & - \end{bmatrix} \end{align*}\]

be the centered and scaled data matrix. Then \(\mbY^\top \mbY = \frac{1}{N} \mbX^\top \mbX = \mbS\) is the covariance matrix.

The singular value decomposition (SVD) of \(\mbY\) is,

\[\begin{align*} \mbY = \mbV \mbLambda^{\frac{1}{2}} \mbU^\top \Rightarrow \mbY^\top \mbY = \frac{1}{N} \mbU \mbLambda^{\frac{1}{2}} \mbV^\top \mbV \mbLambda^{\frac{1}{2}} \mbU^\top = \frac{1}{N} \mbU \mbLambda \mbU^\top \end{align*}\]

I.e. the right singular vectors of \(\mbY\) are the same (up to sign flips) as the eigenvectors of \(\mbS\), and singular values of \(\mbY\) are the square root of the eigenvalues of \(\mbS\).

PCA Explained Variance#

How well do the \(M\) principal components explain the data?

Let \(\mbz_n = \mbU_M^\top (\mbx_n - \bar{\mbx}) \in \reals^M\). Its covariance is,

\[\begin{align*} \mathrm{Cov}[\mbz] = \mathrm{Cov}[\mbU_M^\top (\mbx - \bar{\mbx})] = \mbU_M^\top \mathrm{Cov}[\mbx] \mbU_M = \diag([\lambda_1, \ldots, \lambda_M]). \end{align*}\]

Of course, if we let \(M = D\), then we have \(\mathrm{Cov}(\mbz) = \diag([\lambda_1, \ldots, \lambda_D])\).

One way of assessing how well \(M\) components fits the data is via the fraction of variance explained,

\[\begin{align*} \text{variance explained} = \frac{\Tr(\mathrm{Cov}[z; M \text{ components}])}{\Tr(\mathrm{Cov}[z; D \text{ components}])} = \frac{\sum_{m=1}^M \lambda_m}{\sum_{m=1}^D \lambda_m} \in [0, 1]. \end{align*}\]

Scree Plots#

Scree Plots

A scree plot showing percent variance explained per component and cumulatively.

Probabilistic PCA: A Continuous Latent Variable Model#

We cast the principal components as the solutions to an optimization problem: maximize the projected variance.

A more modern view of PCA is as the maximum likelihood estimate of a latent variable model.

Advantages#

Probabilistic PCA (PPCA) has many advantages:

It’s a multivariate normal model with low-rank plus diagonal covariance, which takes only \(O(MD)\) parameters.
We can fit the model using a host of inference algorithms, including EM.
It can handle missing data.
We can obtain posterior distributions of the principal components and scores.
It can be embedded in larger probabilistic models.

Model#

The PPCA model is quite simple,

\[\begin{align*} \mbz_n &\iid{\sim} \cN(\mbzero, \mbI) \\ \mbx_n \mid \mbz_n &\sim \cN(\mbW \mbz + \mbmu, \sigma^2 \mbI), \end{align*}\]

where \(\mbz_n \in \reals^M\) is a latent variable, \(\mbW \in \reals^{D \times M}\) are the weights, \(\mbmu \in \reals^D\) is the bias parameter, and \(\sigma^2 \in \reals_+\) is a variance.

Equivalently, we can think of \(\mbx_n\) as a linear function of \(\mbz_n\) with additive noise,

\[\begin{align*} \mbx_n &= \mbW \mbz_n + \mbmu + \mbepsilon_n, \end{align*}\]

where \(\mbepsilon_n \sim \cN(\mbzero, \sigma^2 \mbI) \in \reals^D\).

Maximum likelihood estimation of the parameters#

Suppose we only need a point estimate of the parameters \(\mbW\), \(\mbmu\), and \(\sigma^2\).

A natural approach is the maximum likelihood estimate (MLE),

\[\begin{align*} \mbW_{\mathsf{ML}}, \mbmu_{\mathsf{ML}}, \sigma^2_{\mathsf{ML}} = \arg \max \cL(\mbW, \mbmu, \sigma^2), \end{align*}\]

where \(\cL\) is the marginal likelihood,

\[\begin{align*} \cL(\mbW, \mbmu, \sigma^2) &= \log p(\mbX \mid \mbW, \mbmu, \sigma^2) \\ &= \log \int \prod_{n=1}^N p(\mbx_n \mid \mbz_n, \mbW, \mbmu, \sigma^2) \, p(\mbz_n) \dif \mbz_n \\ &= \log \int \prod_{n=1}^N \cN(\mbx_n \mid \mbW \mbz_n + \mbmu, \sigma^2 \mbI) \, \cN(\mbz_n \mid \mbzero, \mbI) \dif \mbz_n \end{align*}\]

Exercise

Simplify this expression to show,

\[\begin{align*} \cL(\mbW, \mbmu, \sigma^2) &= \sum_{n=1}^N \log \cN(\mbx_n \mid \mbmu, \mbW \mbW^\top + \sigma^2 \mbI) \end{align*}\]

Then the log likelihood simplifies to,

\[\begin{align*} \cL(\mbW, \mbmu, \sigma^2) -\frac{ND}{2} \log 2 \pi - \frac{N}{2} \log |\mbC| -\frac{1}{2} \sum_{n=1}^N (\mbx_n - \mbmu)^\top \mbC^{-1} (\mbx - \mbmu) \end{align*}\]

where \(\mbC = \mbW \mbW^\top + \sigma^2 \mbI\).

Setting the derivative wrt \(\mbmu\) to zero and solving yields \(\mbmu_{\mathsf{ML}} = \bar{\mbx}\), the sample mean.

Maximizing wrt \(\mbW\) and \(\sigma^2\) is more complex but still has a closed form solution,

\[\begin{align*} \mbW_{\mathsf{ML}} = \mbU_M (\mbLambda_M - \sigma^2 \mbI)^{\frac{1}{2}} \mbR, \end{align*}\]

where \(\mbU_M \in \reals^{D \times M}\) has columns given by the leading eigenvectors of the sample covariance matrix \(\mbS\), where \(\mbLambda_M = \diag([\lambda_1, \ldots, \lambda_M])\), and where \(\mbR \in \reals^{M \times M}\) is an arbitrary orthogonal matrix.

Put differently, the MLE weights are only identifiable up to orthogonal transformation. Or, only the subspace spanned by \(\mbU_M\) is identifiable.

Finally, the MLE of the variance is,

\[\begin{align*} \sigma^2_{\mathsf{ML}} &= \frac{1}{D - M} \sum_{m=M+1}^D \lambda_m, \end{align*}\]

the average variance in the remaining dimensions.

Questions

What is the marginal covariance \(\mbC\) using the MLE \(\mbW_{\mathsf{ML}}\) and \(\sigma^2_{\mathsf{ML}}\)?
Intuitively, why is the marginal covariance invariant to rotations of the weights?

The Posterior Distribution on the Latent Variables#

Now fix \(\mbW\), \(\mbmu\), and \(\sigma^2\) (e.g. to their maximum likelihood values). What is the posterior of\(\mbz_n\)?

\[\begin{align*} p(\mbz_n \mid \mbx_n, \mbW, \mbmu, \sigma^2) &\propto \cN(\mbz_n \mid \mbzero, \mbI) \, \cN(\mbx_n \mid \mbW \mbz_n + \mbmu, \sigma^2 \mbI) \\ &\propto \exp \left\{ -\frac{1}{2} \mbz_n^\top \mbz_n -\frac{1}{2} (\mbx_n - \mbW \mbz_n - \mbmu)^\top (\sigma^2 \mbI)^{-1} (\mbx_n - \mbW \mbz_n - \mbmu) \right\} \\ &\propto \exp \left\{ -\frac{1}{2} \mbz_n^\top \mbJ_n \mbz_n + \mbh_n^\top\mbz_n \right\} \\ \end{align*}\]

where \(\mbJ_n = \mbI + \frac{1}{\sigma^2} \mbW^\top \mbW\) and \( \mbh_n = \frac{1}{\sigma^2} \mbW^\top (\mbx_n - \mbmu)\)

Completing the square,

\[\begin{align*} p(\mbz_n \mid \mbx_n, \mbW, \mbmu, \sigma^2) &= \cN(\mbz_n \mid \mbJ_n^{-1} \mbh_n, \mbJ_n^{-1}) \\ &= \cN(\mbz_n \mid (\mbI + \frac{1}{\sigma^2} \mbW^\top \mbW)^{-1} (\frac{1}{\sigma^2} \mbW^\top (\mbx_n - \mbmu)), \, (\mbI + \frac{1}{\sigma^2} \mbW^\top \mbW)^{-1}) \\ &= \cN(\mbz_n \mid (\sigma^2 \mbI + \mbW^\top \mbW)^{-1} (\mbW^\top (\mbx_n - \mbmu)), \, (\mbI + \frac{1}{\sigma^2} \mbW^\top \mbW)^{-1} \end{align*}\]

The Posterior Distribution in the Zero Noise Limit#

In the limit where \(\sigma^2 \to 0\), the posterior mean of \(\mbz_n\) is,

\[\begin{align*} \lim_{\sigma^2 \to 0} \E[\mbz_n \mid \mbx_n, \mbW, \mbmu, \sigma^2] &= \lim_{\sigma^2 \to 0} (\mbI + \frac{1}{\sigma^2} \mbW^\top \mbW)^{-1} [\frac{1}{\sigma^2} \mbW^\top (\mbx_n - \mbmu)] \\ &= \lim_{\sigma^2 \to 0} (\sigma^2 \mbI + \mbW^\top \mbW)^{-1} \mbW^\top (\mbx_n - \mbmu) \\ &= (\mbW^\top \mbW)^{-1} \mbW^\top (\mbx_n - \mbmu) \end{align*}\]

Now suppose \(\mbW = \mbW_{\mathsf{ML}} = \mbU_M (\mbLambda_M - \sigma^2 \mbI)^{\frac{1}{2}} \mbR\) and set \(\mbR = \mbI\). This goes to \(\mbW = \mbU_M \mbLambda_M^{\frac{1}{2}}\). Then,

\[\begin{align*} \lim_{\sigma^2 \to 0} \E[\mbz_n \mid \mbx_n, \mbW, \mbmu, \sigma^2] &= (\mbW^\top \mbW)^{-1} \mbW^\top (\mbx_n - \mbmu) \\ &= \mbLambda_M^{-\frac{1}{2}} \mbU_M^\top (\mbx_n - \mbmu) \end{align*}\]

How does this relate to the scores in PCA? In PCA, the scores are simply \(\mbU_M^\top (\mbx_n - \mbmu)\). The difference is that here we have an extra scaling by \(\mbLambda_M^{-\frac{1}{2}}\). The difference is because in PCA, the weights are assumed to be orthonormal, and the scale is accounted for in the scores. In probabilistic PCA, the latent variables are assumed to be standard normal under the prior, and the scaling is accounted for in the weights.

EM for Probabilistic PCA#

The E-step is to compute the posterior \(q(\mbz_n) = p(\mbz_n \mid \mbx_n; \mbtheta)\), where \(\mbtheta = (\mbW, \mbmu, \sigma^2)\) are the current parameters. For simplicity, assume the data is centered so that \(\mbmu^\star = 0\).

The M-step is to maximize the expected complete data log likelihood,

\[\begin{align*} \cL(\mbtheta) &= \sum_{n=1}^N \E_{q(\mbz_n)} \left[ \log p(\mbx_n \mid \mbz_n; \mbtheta)\right] \\ &= \sum_{n=1}^N \E_{q(\mbz_n)} \left[ \log \cN(\mbx_n \mid \mbW \mbz_n, \sigma^2 \mbI) \right] \\ &= \sum_{n=1}^N \E_{q(\mbz_n)} \left[ -\frac{D}{2} \log \sigma^2 -\frac{1}{2\sigma^2} (\mbx_n - \mbW \mbz_n)^\top (\mbx_n - \mbW \mbz_n)\right]. \end{align*}\]

As a function of \(\mbW\),

\[\begin{align*} \cL(\mbW) &= \sum_{n=1}^N \E_{q(\mbz_n)} \left[ -\frac{1}{2\sigma^2} \left \langle \mbW^\top \mbW, \mbz_n \mbz_n^\top \right \rangle + \frac{1}{\sigma^2} \left \langle \mbW, \mbx_n \mbz_n^\top \right \rangle \right] \\ &= -\frac{1}{2\sigma^2} \left \langle \mbW^\top \mbW, \sum_{n=1}^N \E_{q(\mbz_n)} \left[ \mbz_n \mbz_n^\top \right] \right \rangle + \frac{1}{\sigma^2} \left \langle \mbW, \sum_{n=1}^N \E_{q(\mbz_n)} \left[ \mbx_n \mbz_n^\top \right] \right \rangle. \end{align*}\]

where \(\langle \mbA, \mbB \rangle = \Tr(\mbA^\top \mbB)\) is the Frobenius inner product for matrices \(\mbA\) and \(\mbB\).

Taking derivatives wrt \(\mbW\) and setting to zero yields,

\[\begin{align*} \mbW^\star &= \left( \sum_{n=1}^N \E_{q(\mbz_n)} \left[ \mbx_n \mbz_n^\top \right] \right) \left( \sum_{n=1}^N \E_{q(\mbz_n)} \left[ \mbz_n \mbz_n^\top \right] \right)^{-1}. \end{align*}\]

It depends on sums of expected sufficient statistics!

Exercise:

Derive the expected sufficient statistics and the M-step update for \(\sigma^2\).

Factor Analysis#

Factor analysis is another continuous latent variable model. In fact, it’s almost the same as probabilistic PCA!

The difference is that FA allows \(\sigma^2\) to vary across output dimensions. The generative model is,

\[\begin{align*} \mbz_n &\iid{\sim} \cN(\mbzero, \mbI) \\ \mbx_n &\sim \cN(\mbW \mbz_n + \mbmu, \diag(\mbsigma^2)) \end{align*}\]

where \(\mbsigma^2 = [\sigma_1^2, \ldots, \sigma_D^2]^\top\).

Exercise

Without doing any math, derive EM for this factor analysis model.

Linear Dynamical Systems#

We generalized from mixture models to HMMs by assuming that the latent states were dependent.

HMMs assume a particular factorization of the joint distribution on latent states (\(z_t\)) and observations \((\mbx_t)\). The graphical model looks like this:

HMM graphical model

This graphical model says that the joint distribution factors as,

\[\begin{align*} p(z_{1:T}, \mbx_{1:T}) &= p(z_1) \prod_{t=2}^T p(z_t \mid z_{t-1}) \prod_{t=1}^T p(\mbx_t \mid z_t). \end{align*}\]

We call this an HMM because \(p(z_1) \prod_{t=2}^T p(z_t \mid z_{t-1})\) is a Markov chain.

State space models (SSMs)#

Note that nothing above assumes that \(z_t\) is a discrete random variable!

HMM’s are a special case of more general state space models with discrete states.

State space models assume the same graphical model but allow for arbitrary types of latent states.

For example, suppose that \(\mbz_t \in \mathbb{R}^D\) are continuous valued latent states and that,

\[\begin{align*} p(\mbz_{1:T}) &= p(\mbz_1) \prod_{t=2}^T p(\mbz_t \mid \mbz_{t-1}) \\ &= \cN(\mbz_1 \mid \mbb_1, \mbQ_1) \prod_{t=2}^T \cN(\mbz_t \mid \mbA \mbz_{t-1} + \mbb, \mbQ) \end{align*}\]

This is called a Gaussian linear dynamical system (LDS).

Message passing in HMMs#

In the HMM with discrete states, we showed how to compute posterior marginal distributions using message passing,

\[\begin{align*} p(z_t \mid \mbx_{1:T}) &\propto \sum_{z_1} \cdots \sum_{z_{t-1}} \sum_{z_{t+1}} \cdots \sum_{z_T} p(z_{1:T}, \mbx_{1:T}) \\ &= \alpha_t(z_t) \, p(\mbx_t \mid z_t) \, \beta_t(z_t) \end{align*}\]

where the forward and backward messages are defined recursively

\[\begin{align*} \alpha_t(z_t) &= \sum_{z_{t-1}} p(z_t \mid z_{t-1}) \, p(\mbx_{t-1} \mid z_{t-1}) \, \alpha_{t-1}(z_{t-1}) \\ \beta_t(z_t) &= \sum_{z_{t+1}} \, p(z_{t+1} \mid z_t) \, p(\mbx_{t+1} \mid z_{t+1}) \, \beta_{t+1}(z_{t+1}) \end{align*}\]

The initial conditions are \(\alpha_1(z_1) = p(z_1)\) and \(\beta_{T}(z_T) = 1\).

What do the forward messages tell us?#

The forward messages are equivalent to,

\[\begin{align*} \alpha_t(z_t) &= \sum_{z_1} \cdots \sum_{z_{t-1}} p(z_{1:t}, \mbx_{1:t-1}) \\ &p(z_t, \mbx_{1:t-1}). \end{align*}\]

The normalized message is the predictive distribution,

\[\begin{align*} \frac{\alpha_t(z_t)}{\sum_{z_t'} \alpha_t(z_t')} &= \frac{p(z_t, \mbx_{1:t-1})}{\sum_{z_t'} p(z_t', \mbx_{1:t-1})} = \frac{p(z_t, \mbx_{1:t-1})}{p(\mbx_{1:t-1})} = p(z_t \mid \mbx_{1:t-1}), \end{align*}\]

The final normalizing constant yields the marginal likelihood, \(\sum_{z_T} \alpha_T(z_T) = p(\mbx_{1:T})\).

Message passing in state space models#

The same recursive algorithm applies (in theory) to any state space model with the same factorization, but the sums are replaced with integrals,

\[\begin{align*} p(\mbz_t \mid \mbx_{1:T}) &\propto \int \dif \mbz_1 \cdots \int \dif {\mbz_{t-1}} \int \dif{\mbz_{t+1}} \cdots \int \dif {\mbz_T} \, p(\mbz_{1:T}, \mbx_{1:T}) \\ &= \alpha_t(\mbz_t) \, p(\mbx_t \mid \mbz_t) \, \beta_t(\mbz_t) \end{align*}\]

where the forward and backward messages are defined recursively

\[\begin{align*} \alpha_t(\mbz_t) &= \int p(\mbz_t \mid \mbz_{t-1}) \, p(\mbx_{t-1} \mid \mbz_{t-1}) \, \alpha_{t-1}(\mbz_{t-1}) \dif {\mbz_{t-1}} \\ \beta_t(\mbz_t) &= \int p(\mbz_{t+1} \mid \mbz_t) \, p(\mbx_{t+1} \mid \mbz_{t+1}) \, \beta_{t+1}(\mbz_{t+1}) \dif {\mbz_{t+1}} \end{align*}\]

The initial conditions are \(\alpha_1(\mbz_1) = p(\mbz_1)\) and \(\beta_{T}(\mbz_T) \propto 1\).

Forward pass in a linear dynamical system#

Consider an linear dynamical system (LDS) with Gaussian emissions,

\[\begin{align*} p(\mbx_{1:T}, \mbz_{1:T}) &= p(\mbz_1) \prod_{t=2}^T p(\mbz_t \mid \mbz_{t-1}) \\ &= \cN(\mbz_1 \mid \mbb_1, \mbQ_1) \prod_{t=2}^T \cN(\mbz_t \mid \mbA \mbz_{t-1} + \mbb, \mbQ) \prod_{t=1}^ \cN(\mbx_t \mid \mbC \mbz_t + \mbd, \mbR) \end{align*}\]

Let’s derive the forward message \(\alpha_{t+1}(\mbz_{t+1})\). Assume \(\alpha_{t}(\mbz_{t}) \propto \cN(\mbz_{t} \mid \mbmu_{t|t-1}, \mbSigma_{t|t-1})\).

\[\begin{align*} \alpha_{t+1}(\mbz_{t+1}) &= \int p(\mbz_{t+1} \mid \mbz_{t}) \, p(\mbx_{t} \mid \mbz_{t}) \, \alpha_{t}(\mbz_{t}) \dif {\mbz_{t}} \\ &= \int \cN(\mbz_{t+1} \mid \mbA \mbz_{t} + \mbb, \mbQ) \, \cN(\mbx_{t} \mid \mbC \mbz_{t} + \mbd, \mbR) \, \cN(\mbz_{t} \mid \mbmu_{t|t-1}, \mbSigma_{t|t-1}) \dif \mbz_{t} \end{align*}\]

The update step#

The first step is the update step, where we condition on the emission \(\mbx_t\),

Exercise

Expand the densities, collect terms, and complete the square to compute the mean \(\mbmu_{t|t}\) and covariance \(\mbSigma_{t|t}\) after the update step, :class: tip

\[\begin{align*} \cN(\mbx_{t} \mid \mbC \mbz_{t} + \mbd, \mbR) \, \cN(\mbz_{t} \mid \mbmu_{t|t-1}, \mbSigma_{t|t-1}) &\propto \cN(\mbz_t \mid \mbmu_{t|t}, \mbSigma_{t|t}). \end{align*}\]

Write the joint distribution,

\[\begin{align*} p(\mbz_t, \mbx_t \mid \mbx_{1:t-1}) &= \cN(\mbx_{t} \mid \mbC \mbz_{t} + \mbd, \mbR) \, \cN(\mbz_{t} \mid \mbmu_{t|t-1}, \mbSigma_{t|t-1}) \\ &= \cN \left( \begin{bmatrix} \mbz_t \\ \mbx_t \end{bmatrix} \, \bigg| \, \begin{bmatrix} \mbmu_{t|t-1} \\ \mbC \mbmu_{t|t-1} + \mbd \end{bmatrix}, \begin{bmatrix} \mbSigma_{t|t-1} & \mbSigma_{t|t-1} \mbC^\top \\ \mbC \mbSigma_{t|t-1} & \mbC \mbSigma_{t|t-1} \mbC^\top + \mbR \end{bmatrix} \right) \end{align*}\]

Since \((\mbz_t, \mbx_t)\) are jointly Gaussian, \(\mbz_t\) must be conditionally Gaussian as well,

\[\begin{align*} p(\mbz_t \mid \mbx_{1:t}) &= \cN(\mbmu_{t|t}, \mbSigma_{t|t}). \end{align*}\]

Exercise

Now use the Schur complement to solve for \(\mbmu_{t|t}\) and \(\mbSigma_{t|t}\)

Kalman gain

Write \(\mbmu_{t|t}\) and \(\mbSigma_{t|t}\) in terms of the Kalman gain

\[\begin{align*} \mbK_t &= \mbSigma_{t|t-1} \mbC^\top (\mbC \mbSigma_{t|t-1} \mbC^\top + \mbR)^{-1} \end{align*}\]

What is the Kalman gain doing?

The predict step#

The predict step yields \(p(\mbz_t \mid \mbx_{1:t}) = \cN(\mbz_t \mid \mbmu_{t|t}, \mbSigma_{t|t})\). To complete the forward pass, we need the predict step,

\[\begin{align*} \alpha_{t+1}(\mbz_{t+1}) &= \int p(\mbz_{t+1} \mid \mbz_{t}) \, p(\mbx_{t} \mid \mbz_{t}) \, \alpha_{t}(\mbz_{t}) \dif {\mbz_{t}} \\ &= \int \cN(\mbz_{t+1} \mid \mbA \mbz_{t} + \mbb, \mbQ) \, \cN(\mbz_{t} \mid \mbmu_{t|t}, \mbSigma_{t|t}) \dif \mbz_{t} \\ &= \cN(\mbz_{t+1} \mid \mbmu_{t+1|t}, \mbSigma_{t+1|t}) \end{align*}\]

Exercise

Solve for the mean \(\mbmu_{t+1|t}\) and covariance \(\mbSigma_{t+1|t}\) after the predict step.

Completing the recursions#

That wraps up the recursions! All that’s left is the base case, which comes from the initial state distribution,

\[\begin{align*} \mbmu_{1|0} = \mbb_1 \qquad \text{and} \quad \mbSigma_{1|0} = \mbQ_1. \end{align*}\]

Computing the marginal likelihood#

Like in the discrete HMM, we can compute the marginal likelihood along the forward pass.

\[\begin{align*} p(\mbx_{1:T}) &= \prod_{t=1}^T p(\mbx_t \mid \mbx_{1:t-1}) \\ &= \prod_{t=1}^T \int p(\mbx_t \mid \mbz_t) \, p(\mbz_t \mid \mbx_{1:t-1}) \dif \mbz_t \\ &= \prod_{t=1}^T \int \cN(\mbx_t \mid \mbC \mbz_t + \mbd, \mbR) \, \cN(\mbz_t \mid \mbmu_{t|t-1}, \mbSigma_{t|t-1}) \dif \mbz_t \end{align*}\]

Exercise

Obtain a closed form expression for the integrals.

Backward pass in a linear dynamical system#

The backward pass in the LDS computes the smoothing distributions

The forward pass gives us the filtering distributions \(p(\mbz_t \mid \mbx_{1:t})\). How can we compute the smoothing distributions, \(p(\mbz_t \mid \mbx_{1:\textcolor{red}{T}})\)?
In the discrete HMM we essentially ran the same algorithm in reverse to get the backward messages, starting from \(\beta_{T}(\mbz_T) \propto 1\).
We can do the same sort of thing here, but it’s a bit funky because we need to start with an improper Gaussian distribution \(\beta_T(\mbz_T) \propto \cN(\mbzero, \infty \mbI)\).
Instead, we’ll derive an alternative way of computing the smoothing distributions.

Bayesian Smoothing#

Note: \(\mbz_t\) is conditionally independent of \(\mbx_{t+1:T}\) given \(\mbz_{t+1}\), so

\[\begin{align*} p(\mbz_t \mid \mbz_{t+1}, \mbx_{1:T}) &= p(\mbz_t \mid \mbz_{t+1}, \mbx_{1:t}) \\ &= \frac{p(\mbz_t, \mbz_{t+1} \mid \mbx_{1:t})}{p(\mbz_{t+1} \mid \mbx_{1:t})} \\ &= \frac{p(\mbz_t \mid \mbx_{1:t}) \, p(\mbz_{t+1} \mid \mbz_t)}{p(\mbz_{t+1} \mid \mbx_{1:t})} \end{align*}\]

Question

What rules of probability did we apply in each of these steps?

Now we can write the joint distribution as,

\[\begin{align*} p(\mbz_t, \mbz_{t+1} \mid \mbx_{1:T}) &= p(\mbz_t \mid \mbz_{t+1} \mid \mbx_{1:T}) p(\mbz_{t+1} \mid \mbx_{1:T}) \\ &= \frac{p(\mbz_t \mid \mbx_{1:t}) \, p(\mbz_{t+1} \mid \mbz_t) \, p(\mbz_{t+1} \mid \mbx_{1:T})}{p(\mbz_{t+1} \mid \mbx_{1:t})}. \end{align*}\]

Marginalizing over \(\mbz_{t+1}\) gives us,

\[\begin{align*} p(\mbz_t \mid \mbx_{1:T}) &= p(\mbz_t \mid \mbx_{1:t}) \int \frac{p(\mbz_{t+1} \mid \mbz_t) \, p(\mbz_{t+1} \mid \mbx_{1:T})}{p(\mbz_{t+1} \mid \mbx_{1:t})} \dif \mbz_{t+1} \end{align*}\]

Question

Can we compute each of these terms?

The Rauch-Tung-Striebel (RTS) Smoother, aka Kalman Smoother#

These recursions apply to any Markovian state space model. For the special case of a Gaussian linear dynamical system, the smoothing distributions are again Gaussians,

\[\begin{align*} p(\mbz_t \mid \mbx_{1:T}) &= \cN(\mbz_t \mid \mbmu_{t|T}, \mbSigma_{t|T}) \end{align*}\]

where

\[\begin{align*} \mbmu_{t|T} &= \mbmu_{t|t} + \mbG_t (\mbmu_{t+1|T} - \mbmu_{t+1|t}) \\ \mbSigma_{t|T} &= \mbSigma_{t|t} + \mbG_t (\mbSigma_{t+1|T} - \mbSigma_{t+1|t}) \mbG_t^\top \\ \mbG_t &\triangleq \mbSigma_{t|t} \mbA^\top \mbSigma_{t+1|t}^{-1}. \end{align*}\]

This is called the Rauch-Tung-Striebel (RTS) smoother or the Kalman smoother.

Conclusion#

Linear Gaussian latent variable models are ubiquitous in machine learning and statistics. They are even hiding in familiar places: whenever we apply PCA, we’re essentially using an LG-LVM. There are several more variants of these models, including canonical correlation analysis (CCA), independent component analysis (ICA), and many more.

State space models are near and dear to my heart, and I wish we had more time to cover them in this course. In particular, my lab has worked on switching linear dynamical systems, which are a mash-up of HMMs and LDSs. We also work on better algorithms for inference in SSMs, and deep SSMs that we’ll discuss in a later chapter.

Linear Gaussian Models

Contents

Linear Gaussian Models#

Motivating Example#

Principal Components of Global Temperature#

Principal Components Analysis (PCA)#

PCA: Maximum Variance Formulation#

PCA and the Singular Value Decomposition#

PCA Explained Variance#

Scree Plots#

Probabilistic PCA: A Continuous Latent Variable Model#

Advantages#

Model#

Maximum likelihood estimation of the parameters#

The Posterior Distribution on the Latent Variables#

The Posterior Distribution in the Zero Noise Limit#

EM for Probabilistic PCA#

Factor Analysis#

Linear Dynamical Systems#

State space models (SSMs)#

Message passing in HMMs#

What do the forward messages tell us?#

Message passing in state space models#

Forward pass in a linear dynamical system#

The update step#

The predict step#

Completing the recursions#

Computing the marginal likelihood#

Backward pass in a linear dynamical system#

Bayesian Smoothing#

The Rauch-Tung-Striebel (RTS) Smoother, aka Kalman Smoother#

Conclusion#