Variational Autoencoders#
Generative Model#
Variational Autoencodres (VAEs) are “deep” but conceptually simple generative models. To sample a data point \(\mbx_n\),
First, sample latent variables \(\mbz_n\),
Then sample the data point \(\mbx_n\) from a conditional distribution with mean,
where \(g: \reals^H \to \reals^D\) is a nonlinear mapping parameterized by \(\mbtheta\).
We will assume \(g\) is a simple feedforward neural network of the form,
where each layer is a cascade of a linear mapping followed by an element-wise nonlinearity (except for the last layer, perhaps). For example,
The generative parameters consist of the weights and biases, \(\mbtheta = \{\mbW_\ell, \mbb_\ell\}_{\ell=1}^L\).
Learning and Inference#
We have two goals. The learning goal is to find the parameters that maximize the marginal likelihood of the data,
The inference goal is to find the posterior distribution of latent variables,
Both goals require an integral over \(\mbz_n\), but that is intractable for deep generative models.
The Evidence Lower Bound (ELBO)#
Idea: Use the ELBO to get a bound on the marginal probability and maximize that instead.
where \(\mblambda = \{\mblambda_n\}_{n=1}^N\).
Here, I’ve written the ELBO as a sum of local ELBOs \(\cL_n\).
Variational Inference#
The ELBO is still maximized (and the bound is tight) when each \(q_n\) is equal to the true posterior,
Unfortunately, the posterior no longer has a simple, closed form.
Question
Suppose \(\mbx_n \sim \mathrm{N}(g(\mbz_n; \mbtheta), \mbI)\). This deep generative model has a Gaussian prior on \(\mbz_n\) and a Gaussian likelihood for \(\mbx_n\) given \(\mbz_n\). Why isn’t the posterior Gaussian?
Nevertheless, we can still constrain \(q_n\) to belong to a simple family. For example, we could constrain it to be Gaussian and seek the best Gaussian approximation to the posterior. This is sometimes called fixed-form variational inference. Let,
Then, for fixed parameters \(\mbtheta\), the best \(q_n\) in this variational family is,
Variational Expectation-Maximization (vEM)#
Now we can introduce a new algorithm.
(Variational EM (vEM))
Repeat until either the ELBO or the parameters converges:
M-step: Set \(\mbtheta \leftarrow \arg \max_{\mbtheta} \cL(\mblambda, \mbtheta)\)
E-step: Set \(\mblambda_n \leftarrow \arg \max_{\mblambda_n \in \mbLambda} \cL_n(\mblambda_n, \mbtheta)\) for \(n=1,\ldots,N\)
Compute (an estimate of) the ELBO \(\cL(\mblambda, \mbtheta)\).
In general, none of these steps will have closed form solutions, so we’ll have to use approximations.
Generic M-step with Stochastic Gradient Ascent#
For exponential family mixture models, the M-step had a closed form solution. For deep generative models, we need a more general approach.
If the parameters are unconstrained and the ELBO is differentiable wrt \(\mbtheta\), we can use stochastic gradient ascent.
Note that the expected gradient wrt \(\mbtheta\) can be computed using ordinary Monte Carlo — nothing fancy needed!
The Variational E-step#
Assume \(\cQ\) is the family of Gaussian distributions with diagonal covariance:
This family is indexed by variational parameters \(\mblambda_n = (\mbmu_n, \log \mbsigma_n^2) \in \reals^{2H}\).
To perform SGD, we need an unbiased estimate of the gradient of the local ELBO, but
Reparameterization Trick#
One way around this problem is to use the reparameterization trick, aka the pathwise gradient estimator. Note that,
where \(r(\mblambda_n, \mbepsilon) = \mbmu_n + \mbsigma_n \mbepsilon\) is a reparameterization of \(\mbz_n\) in terms of parameters \(\mblambda_n\) and noise \(\mbepsilon\).
We can use the law of the unconscious statistician to rewrite the expectations as,
where
The distribution that the expectation is taken under no longer depends on the parameters \(\mblambda_n\), so we can simply take the gradient inside the expectation,
Now we can use Monte Carlo to obtain an unbiased estimate of the final expectation!
Working with mini-batches of data#
We can view the ELBO as an expectation over data indices,
We can use Monte Carlo to approximate the expectation (and its gradient) by drawing mini-batches of data points at random.
In practice, we often cycle through mini-batches of data points deterministically. Each pass over the whole dataset is called an epoch.
Algorithm#
Now we can add some detail to our variational expectation maximization algorithm.
(Variational EM (with the reparameterization trick))
For epoch \(i=1,\ldots,\infty\):
For \(n=1,\ldots,N\):
Sample \(\epsilon_n^{(m)} \iid{\sim} \cN(\mbzero, \mbI)\) for \(m=1,\ldots,M\).
M-Step:
a. Estimate
\[\begin{align*} \hat{\nabla}_{\mbtheta} \cL_n(\mblambda_n, \mbtheta) &= \frac{1}{M} \sum_{m=1}^M \left[ \nabla_{\mbtheta} \log p(\mbx_n, r(\mblambda_n, \mbepsilon_n^{(m)}); \mbtheta) \right] \end{align*}\]b. Set \(\mbtheta \leftarrow \mbtheta + \alpha_i N \hat{\nabla}_{\mbtheta} \cL_n(\mblambda_n, \mbtheta)\)
E-step:
a. Estimate
\[\begin{align*} \hat{\nabla}_{\mblambda} \cL_n(\mblambda_n, \mbtheta) &= \frac{1}{M} \sum_{m=1}^M \nabla_{\mblambda} \left[\log p(\mbx_n, r(\mblambda_n, \mbepsilon_n^{(m)}); \mbtheta) - \log q(r(\mblambda_n, \mbepsilon_n^{(m)}), \mblambda_n) \right] \end{align*}\]b. Set \(\mblambda_n \leftarrow \mblambda_n + \alpha_i \hat{\nabla}_{\mblambda} \cL_n(\mblambda_n, \mbtheta)\).
Estimate the ELBO
\[\begin{align*} \hat{\cL}(\mblambda, \mbtheta) &= \frac{N}{M} \sum_{m=1}^M \log p(\mbx_n, r(\mblambda_n, \mbepsilon_n^{(m)}); \mbtheta) - \log q(r(\mblambda_n, \mbepsilon_n^{(m)}); \mblambda_n) \end{align*}\]Decay step size \(\alpha_i\) according to schedule.
Amortized Inference#
Note that vEM involves optimizing separate variational parameters \(\mblambda_n\) for each data point. For large datasets where we are optimizing using mini-batches of data points, this leads to a strange asymmetry: we update the generative model parameters \(\mbtheta\) every mini-batch, but we only update the variational parameters for the \(n\)-th data point once per epoch. Is there any way to share information across data points?
Note that the optimal variational parameters are just a function of the data point and the model parameters,
for some implicit and generally nonlinear function \(f^\star\).
VAEs learn an approximation to \(f^\star(\mbx_n, \mbtheta)\) with an inference network, a.k.a. recognition network or encoder.
The inference network is (yet another) neural network that takes in a data point \(\mbx_n\) and outputs variational parameters \(\mblambda_n\),
where \(\mbphi\) are the weights of the network.
The advantage is that the inference network shares information across data points — it amortizes the cost of inference, hence the name. The disadvantage is the output will not minimize the KL divergence. However, in practice we might tolerate a worse variational posterior and a weaker lower bound if it leads to faster optimization of the ELBO overall.
Putting it all together#
Logically, I find it helpful to distinguish between the E and M steps, but with recognition networks and stochastic gradient ascent, the line is blurred.
The final algorithm looks like this.
(Variational EM (with amortized inference))
Repeat until either the ELBO or the parameters converges:
Sample data point \(n \sim \mathrm{Unif}(1, \ldots, N)\). [Or a minibatch of data points.]
Estimate the local ELBO \(\cL_n(\mbphi, \mbtheta)\) with Monte Carlo. [Note: it is a function of \(\mbphi\) instead of \(\mblambda_n\).]
Compute unbiased Monte Carlo estimates of the gradients \(\widehat{\nabla}_{\mbtheta} \cL_n(\mbphi, \mbtheta)\) and \(\widehat{\nabla}_{\mbphi} \cL_n(\mbphi, \mbtheta)\). [The latter requires the reparameterization trick.]
Set
with step size \(\alpha_i\) decreasing over iterations \(i\) according to a valid schedule.