Mixture Models and the EM Algorithm#
Unit II was all about supervised learning problems. Both encoding and decoding involve predicting one signal from another: neural firing rates given stimuli, or behavioral outputs given neural activity. Now we turn our attention to unsupervised learning problems. Given measurements of neural activity or behavior, we aim to find latent variables — cluster, factors, etc. — that can explain the data.
Unsupervised learning has many applications in neuroscience. First and foremost, it’s a means of dimensionality reduction. Latent variables offer a compressed representation of high-dimensional neural spike trains or behavioral videos. Low dimensional representations can aid in visualization as well as hypothesis generation.
We encountered a few examples of unsupervised learning in Unit I. Spike sorting and calcium deconvolution were both unsupervised problems: the spike times and templates were a low dimensional representation of a high dimensional voltage trace or video.
Unsupervised learning is all about constraints. We need to set some boundaries on what types of latent variables we are looking for. I like to think of constraints in terms of the five D’s
Dimensionality: how many latent clusters, factors, etc.?
Domain: are the latent variables discrete, continuous, bounded, sparse, etc.?
Dynamics: how do the latent variables change over time?
Dependencies: how do the latent variables relate to the observed data?
Distribution: do we have prior knowledge about the variables’ probability?
This are certainly not an exhaustive list of the types of constraints, but it covers many bases. Thinking in terms of constraints allows us to organize many commonly used models.
Gaussian mixture models#
Let’s start with a simple and canonical example: the Gaussian mixture model (GMM).
Let
Assume each observation is independently drawn according to the following model:

The parameters include the prior probabilities of the cluster assignments,
The joint probability is,
MAP estimation and K-means#
In Chapter 2, we fit this type of model using MAP estimation,
We solved for the MAP estimate by coordinate ascent. For the GMM, the coordinates updates are,
For each data point
:For each cluster
In words, we set the cluster means and covariances equal to the same means and covariances of the data points assigned to that cluster. If we were to fix
Expectation-Maximization (EM)#
MAP estimation gives us a point estimate of the latent variables and parameters. However, point estimates can lead to an overly optimistic view of the model. Specifically, MAP estimation found the best assignment of latent, which may not reflect how well samples from the model match the observed data.
Ideally, we would like to find parameters that maximize the marginal likelihood, aka the model evidence,
Note
For discrete latent variable models, the integral becomes a sum.
Once we have those parameters, we can compute the posterior distribution over latent variables,
The expectation-maximization (EM) algorithm does exactly that: it finds a local maximum of the marginal likelihood via an iterative algorithm very similar to the MAP estimation algorithm above. The key difference is that instead of using hard assignments of data points to the most likely cluser, we use soft assignments of data points based on their posterior probabilities.
For each data point
and cluster , set:For each cluster
, update the parameters as follows:
The posterior probabilities
The Evidence Lower Bound (ELBO)#
Why does EM work? We can view it as coordinate ascent on the evidence lower bound (ELBO). First, let’s rewrite the log marginal likelihood as the log of an expectation,
Note
We dropped the subscripts on
This equality holds for any distribution
Since log is a concave function, Jensen’s inequality says that swapping the order of the log and the expectation gives a lower bound on the log marginal likelihood,
Jensen’s inequality
Jensen’s inequality relates convex functions of expectations to expectations of convex functions. If
If
The functional
EM and coordinate ascent#
We can view the EM algorithm as coordinate ascent on the ELBO.
M-step: Update the parameters,
That is, maximize the expected log joint probability.
E-step: Update the variational posterior by setting it equal to the posterior,
To see why this maximizes the ELBO for fixed
, note thatwhere
denotes the Kullback-Leibler divergence from to . The divergence is non-negative and zero iff . Thus, maximizing the ELBO wrt amounts to setting the variational posterior equal to the true posterior.
Note
The ELBO is tight after the E-step. Substituting
Iterating between the E- and M-steps converges to a local optimum of the ELBO. Since after each E-step the ELBO is tight, the local optimum of the ELBO must also be a local optimum of the log marginal likelihood.
Exponential family mixtures#
So far we’ve focused on Gaussian mixture models. Now let’s consider the more general case of exponential family mixture models for which,
where
Warning
We have switched to indexing data points by
Under this model, the E-step reduces to computing the responsibilities,
As a function of the parameters
where
Taking derivatives and setting to zero yields,
Recall that
Normalized sufficient statistics
Note that M-step is invariant to rescaling the ELBO. We could have multiplied
Then
Stochastic EM#
Finally, the EM algorithm presented above is a batch algorithm. The M-step involves aggregating sufficient statistics from the entire dataset. For very large datasets, it is more efficient to do an M-step after each mini-batch of data. That is how stochastic EM works.
The key idea is to maintain a running estimate of the sufficient statistics. Assume we have
In stochastic EM, we keep a running estimate of the normalized statistics. After processing each mini-batch, we update the running estimate by taking a convex combination of the previous estimate and the estimate from the current mini-batch.
where
To ensure convergence, we decay the learning rate according to a schedule, starting with
One sweep through the entire set of
Stochastic EM for GMM#
Let’s consider the special case of the Gaussian mixture model and derive the stochastic EM algorithm. The Gaussian probability can be written in exponential family form as,
where the sufficient statistics are
and the log normalizer is
Warning
Natural parameters
We could write this log probability in terms of its natural parameters
In the M-step of stochastic EM, we need to maximize the ELBO with respect to
where
Taking the derivative with respect to
Setting to zero and solving for the mean yields,
Substituting this into the ELBO yields,
Taking the derivative with respect to
Then solving for
Conclusion#
This chapter introduced unsupervised learning methods using latent variables.
We started with the Gaussian mixture model and recalled the coordinate ascent algorithm for MAP estimation, which is very similar to K-means.
Then we introduced the expectation-maximization algorithm and saw that for GMMs, it nicely parallels the MAP estimation algorithm. However, it yields parameters that maximize the marginal likelihood instead of the joint.
We derived EM as coordinate ascent on the evidence lower bound (ELBO). This view will generalize to more complex models in subsequent chapters.
Finally, we considered the more general class of exponential family mixture models and derived a stochastic EM algorithm that operates on running estimates of the expected sufficient statistics.
Next time, we’ll consider sequential latent variable models called hidden Markov models (HMMs).