Let’s turn our attention back to models for sequential data for the next few lectures. With all the excitement around large language modeling, this is a timely topic.
Consider a sequence of observations x1:T=(x1,…,xT) with each xt∈RD.
We can always factor a joint distribution over observation into a product of conditionals using the chain rule,
This is called an autoregressive model since the conditional of xt depends only on previous observations x1,…,xt−1.
Autoregressive models are well-suited to sequential modeling since they make forward generation or forecasting easy.
As long as we have access to the conditionals, we can sample forward indefinitely.
The question is, how should we parameterize these conditional distributions?
It looks challenging since each one takes in a variable-length history of previous observations.
Recurrent Neural Networks (RNNs) are autoregressive models in which the conditional distributions are functions of a finite-dimensional hidden state, ht∈RK,
From a machine learning perspective, RNNs are useful function approximators for sequential data. From a neuroscience perspective, they have a long history as theoretical models of neural computation.
In such models, the hidden state ht∈RK corresponds to the relative firing rates of K neurons. With a hyperbolic tangent nonlinearity, ht∈(−1,1)K, and negative rates don’t make sense. Instead, we think of ht as an additive offset to a baseline firing rate.
The dynamics weights correspond to synaptic connections, with positive weights as excitatory synapses and negative weights as inhibitory. When a presynaptic neuron spikes, it induces an electrical current in postsynaptic neurons. Under this interpretation, the activation Wht corresponds to the input current generated by inputs from other neurons.
As a neuron receives input current, its voltage steadily increases until it reaches a threshold, at which point the voltage spikes and the neuron fires an action potential. These spikes induce currents on downstream neurons, as described above. After a cell fires, there is a short refractory period before the neuron can spike again. Thus, there is an upper bound on firing rates, which the hyperbolic tangent is meant to capture.
Now, you would simply use automatic differentiation to compute the necessary gradients to minimize this loss, but we can gain some insight by working them out manually. With some algebra, we can show that the Jacobian of the loss with respect to the dynamics weights (other parameters are similar) is,
Backpropagation is the most effective way we know of to train artificial RNNs, so it’s reasonable to think that the brain might be using a similar learning algorithm. Unfortunately, it’s not clear how the backpropagation through time algorithm could be implemented by a neural circuit. The multiplication by At in the gradient recursions amounts to passing information backward across synapses, and canonical synaptic models don’t have mechanisms to do this. Recent years have seen a substantial amount of research into biologically plausible mechanisms of backpropagation.
When the Jacobians At have small eigenvalues (≪1), we run into problems of vanishing gradients. Consider the case of a linear RNN in which the tanh is replaced with identity: then the Jacobians are At=W⊤ for all time steps. If the eigenvalues of W are much less than one, the gradients will decay to zero exponentially quickly, absent strong inputs bt.
Vanishing gradients are especially problematic when xt depends on much earlier observations, xs for s≪t. In that case, the hidden state must propagate information about xs for many time steps during the forward pass, and likewise, the gradient must pass information backward many timesteps during the backward pass. If the weights have small eigenvalues, those gradients will decay and the learning signal will fail to propagate.
One way to combat the vanishing gradient problem is by modifying the RNN architecture. Architectures like long short-term memory (LSTM) networks achieve this via gated units.
An LSTM has internal (aka “cell”) states ct∈RK and hidden states ht∈RK. The internal states follow conditionally linear dynamics,
The bounded entries ft,k∈[0,1] ensure stability. When ft,k≈1, the state is propagated, and when ft,k≈0, the state is forgotten.
Thus, ft=(ft,1,…,ft,K)⊤∈[0,1]K are called the forget gates, and they are parameterized by the matrices W(f) and B(f).
where ot∈[0,1]K are the output gates.
As in a vanilla RNN, the final prediction depends on a (generalized) linear function of the hidden state, g(ht;θ).
We can think of an LSTM as an RNN that operates on an extended state (ct,ht)∈R+K×[−1,1]K.
The forget gates let the eigenvalues of Ft to be close to one, allowing cell states to be propagated for long periods of time on the forward pass, and gradients to be backpropagated without vanishing on the backward pass.
There are many variants of gated RNNs. Besides the LSTM, the most commonly used in the gated recurrent unit (GRU), which has a slightly simplified architecture. See Goodfellow et al. (2016) (ch. 10) for more detail.
We motivated RNNs from an autoregressive modeling persepective, but they are useful in other sequential data settings as well. For example, suppose we want to predict the sentiment of a review y∈R given a variable-length sequence of input words x1:T. We can use an RNN to summarize the input sequence in terms of a hidden state for prediction,
Sometimes we want to map one sequence x1:T to another sequence y1:T′. The sequences may be of different length; e.g., when we want to translate a sentence from English to French. Again, we can train an encoder RNN to produce a hidden state hT that then becomes the initial condition for a decoder RNN that generates the output sequence. Formally,
In the example above, one challenge is that the hidden state hT obtained by processing x1:T may not adequately represent early inputs like x1. For these purposes, you can use a bidirectional RNN that runs one recursion forward x1,…,xT and another backward xT,…,x1 to produce two hidden states at each time t. These combined states can then be passed into the decoder.
As with deep neural networks that stack layer upon layer, we can stack RNN upon RNN to construct a deeper model. In such models, the outputs of one layer, g(ht(i);θ(i)) become the inputs to the next layer, xt(i+1). Then we can backpropagate gradients through the entire stack to train the model.
It turns out, we’ve already seen an RNN in this class!
We presented Hidden Markov Models (HMMs) as latent variable models with hidden states z1:T and conditionally indepedent observations x1:T, but we can also view them as an autoregressive model in which
where xt is a one-hot encoding of a variable that takes values in {1,…,V}, and ck∈ΔV−1 for k=1,…,K are pmfs.
Define the matrix of likelihoods C∈RV×K to have columns ck,
With automatic differentiation at our disposal, this sounds like it might be a lot easier!
Let’s pursue this idea a little further. First, we’d prefer to do unconstrained optimization, so let’s parameterize the model in terms of logP and logC. When we need the constrained versions, we will just apply the softmax to obtain simplex vectors:
To maximize the likelihood with gradient ascent, we need the Jacobians, ∂logP∂L(θ) and ∂logC∂L(θ). For the RNNs above, we computed them using backpropagation through time, but here we can use an even cooler trick.
First, note that the posterior distribution in an HMM can be written as an exponential family,
The gradients are essentially the posterior marginals and pairwise marginals we computed in the EM algorithm!
In the M-step of EM, we solve for the parameters that satisfy the constrained optimality conditions, whereas in SGD we just take a step in the direction of the gradient. EM tends to converge must faster in practice for this reason.
Recurrent neural networks are foundational models for sequential data — both as machine learning tools and as theoretical models of neural computation. The key insight is that any autoregressive model can be parameterized through a fixed-dimensional hidden state updated recursively. Gated architectures like LSTMs address the vanishing-gradient problem by allowing cell states to be propagated across many time steps.