Generalized Linear Models#

Now that we have a better sense for neural spike traints, let’s build probablistic models that predict neural responses to sensory stimuli or other covariates. These are called encoding models, and ideally, these models will recapitulate summary statistics of interest.

Linear Nonlinear Poisson (LNP) models#

First, consider a single neuron. Let \(y_{t} \in \mathbb{N}_0\) denote the number of spikes it fires in the \(t\)-th time bin. (As before, assume time bins are length \(\Delta\), typically 5-100 ms.) Let \(\mathbf{x}_t\) denote the covariates at time \(t\). For example, the covariates may be features of a sensory stimulus at time bin \(t\).

A common modeling assumption in neuroscience is that neural spike counts are conditionally Poisson

\[ y_{t} \sim \mathrm{Po}(\lambda(\mathbf{x}_{1:t}) \cdot \Delta), \]

where \(\mathbf{x}_{1:t} = (\mathbf{x}_1, \ldots, \mathbf{x}_t)\) is the stimulus up to and including time \(t\), and where \(\lambda(\mathbf{x}_{1:t})\) is a conditional firing rate that depends on the stimuli.

As written above, the firing rate \(\lambda\) looks like a rather complex function… it takes in an arbitrarily long stimulus history and outputs a non-negative scalar. We will make a few simplifying assumptions in order to construct our first model.

  1. Assume that \(\lambda\) only depends on a finite set of features of the stimulus history, \(\boldsymbol{\phi}_t = (\phi_1(\mathbf{x}_{1:t}), \ldots, \phi_{D}(\mathbf{x}_{1:t}))^\top \in \mathbb{R}^D\). For example, the features may be the most recent \(D\) frames of the stimulus, corresponding to \(\phi_d(\mathbf{x}_{1:t}) = \mathbf{x}_{t-d}\).

  2. Assume that \(\lambda\) only depends on linear projections of the features, \(\mathbf{w}^\top \boldsymbol{\phi}_t \in \mathbb{R}\), for some weights \(\mathbf{w} \in \mathbb{R}^D\). We will call \(\mathbf{w}^\top \boldsymbol{\phi}_t\) the activation at time \(t\).

  3. Finally, assume that \(\lambda\) maps the activation through a rectifying nonlinearity, \(f: \mathbb{R} \mapsto \mathbb{R}_+\), to obtain a non-negative firing rate.

Altogether, these assumptions imply a linear nonlinear Poisson (LNP) model,

\[ y_t \sim \mathrm{Po}(f(\mathbf{w}^\top \boldsymbol{\phi}_t) \cdot \Delta) \]

Typical choices of rectifying nonlinearity are the exponential function, \(f(a) = e^a\), and the softplus function, \(f(a) = \log (1+e^a)\).

Incorporating spike history#

The model above treats the spike counts \(y_t\) and \(y_{t'}\) as conditionally independent given the stimulus. However, we know this assumption is invalid due to neurons’ refractory period: after a neuron spikes, it cannot spike for at least a few milliseconds. For small time bins, these dependencies matter.

A simple way to address this model misspecification is to allow the firing rate to depend on both the stimulus and the spike history, \(\lambda(\mathbf{x}_{1:t}, \mathbf{y}_{1:t-1})\). We can do so by including the spike history in the features,

\[ \boldsymbol{\phi}_t = \left(\phi_1(\mathbf{x}_{1:t}, \mathbf{y}_{1:t-1}), \ldots, \phi_D(\mathbf{x}_{1:t}, \mathbf{y}_{1:t-1}) \right)^\top. \]

This way, some of our features can capture the stimulus, and others can capture recent spike history. For example, one of our features might be \(\phi_d(\mathbf{x}_{1:t}, \mathbf{y}_{1:t-1}) = y_{t-d}\). In the language of statistical time series models, these spike history terms make this an autoregressive (AR) model.

Exercise

Suppose our features were \(\phi_d(\mathbf{x}_{1:t}, \mathbf{y}_{1:t-1}) = y_{t-d}\) for \(d=1,\ldots,D\). If neurons have a refractory period that prevents firing in two adjacent time bins, what would you expect the best-fitting weights \(\mathbf{w} \in \mathbb{R}^D\) to look like?

Multi-neuronal spike train models#

So far, we’ve considered models for a single neuron. In practice, we will often record from many neurons simultaneously, and we would like our models to capture correlations between neurons.

Let \(\mathbf{y}_t = (y_{t,1}, \ldots, y_{t,N})^\top \in \mathbb{N}_0^N\) denote the vector of spike counts from \(N\) neurons in time bin \(t\). We can generalize the LNP model above as,

\[ y_{t,n} \sim \mathrm{Po}(f(\mathbf{w}_n^\top \boldsymbol{\phi}_t) \cdot \Delta) \]

where the weights \(\mathbf{w}_n \in \mathbb{R}^D\) are specific to neuron \(n\), and where \(\boldsymbol{\phi}_t \in \mathbb{R}^D\) now includes features of the stimulus as well as the spike history of all neurons.

For example, we might have,

\[ \boldsymbol{\phi}_t = (\mathbf{x}_t,\ldots,\mathbf{x}_{t-L}, y_{t-1,1}, \ldots, y_{t-L,1}, \ldots, y_{t-1,N}, \ldots, y_{t-L,N}, 1) \]

where \(L\) is the maximum lag of stimulus and spike history to be considered. The final 1 in \(\boldsymbol{\phi}_t\) is a bias term that allows the model to learn a baseline firing rate.

The entries of \(\mathbf{w}_n\) associated with the features \((y_{t-1,m}, \ldots, y_{t-L,m})^\top\) can be thought of as coupling filters, which model how spikes on neuron \(m\) influence the future firing rate of neuron \(n\).

Basis function encodings#

The model above has \(\mathcal{O}(N^2 L)\) weights for the coupling filters. For small bin sizes, \(L\) may need to include dozens of past time bins to capture all the pairwise interactions. However, these coupling filters are often approximately smooth functions of the time lag. One way to cut down on parameters and capture this smoothness is to use a basis function representation. For example, one of the features can be,

\[ \phi_{m,b}(\mathbf{x}_{1:t}, \mathbf{y}_{1:t-1}) = \sum_{\ell=1}^L y_{t-\ell,m} e^{-\frac{1}{2 \sigma^2}(\ell - \mu_b)^2}. \]

This is a radial basis function encoding of the spike history of neuron \(m\). It is a weighted sum of past spiking, where the weights are a squared exponential (aka Gaussian) kernel centered on delay \(\mu_b\). We can use \(B < L\) basis functions to summarize the spike history over the last \(L\) time bins.

Exercise

Show that the feature above can be written as a convolution of the spike history with a squared exponential kernel.

Generalized linear models (GLMs)#

The model described above, with stimulus features, spike history terms, and basis function encodings, is what neuroscientists often call “the” generalized linear model (GLM), after Pillow et al. [2008]. Of course, in statistics we know that this model is just one instance of a broad family of GLMs, which are characterized by linear projections of covariates, nonlinear link functions, and exponential family conditional distributions [McCullagh and Nelder, 1983]. In fact, we have already encountered one GLM in this course: the logistic regression model from Unit 1.

Maximum likelihood estimation (MLE)#

As with logistic regression, the GLMs constructed above won’t have closed form solutions for the MLE. However, they will be amenable to the optimization methods discussed previously: (stochastic) gradient descent, Newton’s method, etc.

Let’s follow the same recipe as before to compute the gradient of the negative log likelihood of a GLM with an exponential nonlinearity,

\[\begin{split} \begin{aligned} \mathcal{L}(\mathbf{w}) &= - \log p(\mathbf{y} \mid \mathbf{x}, \mathbf{w}) \\ &= -\sum_{t=1}^T \sum_{n=1}^N \log \mathrm{Po}(y_{t,n} \mid e^{\mathbf{w}_n^\top \boldsymbol{\phi}_t} \cdot \Delta) \\ &= -\sum_{t=1}^T \sum_{n=1}^N \left( -\log y_{t,n}! + y_{t,n} \mathbf{w}_n^\top \boldsymbol{\phi}_t + y_{t,n} \log \Delta - e^{\mathbf{w}_n^\top \boldsymbol{\phi}_t} \cdot \Delta \right) \\ &= -\sum_{t=1}^T \sum_{n=1}^N \left(y_{t,n} \mathbf{w}_n^\top \boldsymbol{\phi}_t - e^{\mathbf{w}_n^\top \boldsymbol{\phi}_t} \cdot \Delta \right) + c \end{aligned} \end{split}\]

where \(c\) is constant wrt \(\mathbf{w}_n\).

Separability

The loss function is a sum over neurons, and that the weights for neuron \(n\) do not interact with those of other neurons \(m\). That means we can write the objective function as,

\[ \mathcal{L}(\mathbf{w}) = \sum_{n=1}^N \mathcal{L}_n(\mathbf{w}_n), \]

where

\[ \mathcal{L}_n(\mathbf{w}_n) = - \left(y_{t,n} \mathbf{w}_n^\top \boldsymbol{\phi}_t - e^{\mathbf{w}_n^\top \boldsymbol{\phi}_t} \cdot \Delta \right). \]

We can optimize the weights for each neuron separately. (And in parallel, if you like!)

Gradients and Hessians#

Now take the gradient wrt the weights of neuron \(n\),

\[\begin{split} \begin{aligned} \nabla_{\mathbf{w}_n} \mathcal{L}_n(\mathbf{w}_n) &= -\sum_{t=1}^T (y_{t,n} - e^{\mathbf{w}_n^\top \boldsymbol{\phi}_t} \cdot \Delta) \boldsymbol{\phi}_t \\ &= -\sum_{t=1}^T (y_{t,n} - \mathbb{E}[y_{t,n} \mid \mathbf{w}_n] ) \boldsymbol{\phi}_t \end{aligned} \end{split}\]

The gradient takes an intuitive form: it is a sum of the feature vectors, \(\boldsymbol{\phi}_t\), weighted by the error \((y_{t,n} - \mathbb{E}[y_{t,n} \mid \mathbf{w}_n])\). Recall that we found the same simple form for the gradient of the logistic regression model. We’ll show that this form is characteristic of a special family of GLMs in just a second.

First, note that the negative log likelihood is again a convex function. We can see that from the Hessian,

\[ \nabla_{\mathbf{w}_n}^2 \mathcal{L}_n(\mathbf{w}_n) = \sum_{t=1}^T (e^{\mathbf{w}_n^\top \boldsymbol{\phi}_t} \cdot \Delta) \, \boldsymbol{\phi}_t \boldsymbol{\phi}_t^\top. \]

Since this is a weighted sum of outer products with positive weights, the Hessian is positive semi-definite (PSD) and the negative log likelihood is convex. Convexity implies that gradient descent and other optimization methods will (under weak conditions) find a global optimum.

Exponential family distributions#

The reason why the Poisson GLM and logistic regression had such nice mathematical forms is because they are both exponential family GLMs. Exponential family distributions have densities of the form,

\[ p(y; \theta) = h(y) \exp \left\{ \langle t(y), \theta \rangle - A(\theta) \right\} \]

where

  • \(\theta\) is the natural parameter of the distribution

  • \(t(y)\) is a sufficient statistic of the data

  • \(h(y)\) is the base measure, which we won’t worry too much about

  • \(A(\theta)\) is the log normalizer, which ensures the density integrates to one.

In order for the density to integrate to one, the log normalizer must be,

\[ A(\theta) = \log \int h(y) \exp \left\{ \langle t(y), \theta \rangle \right\} \, \mathrm{d} y \]

Many of the distributions we’ve encountered thus far are exponential family distributions: Poisson, Bernoulli, Gaussian, gamma, etc.

Example

The Poisson distribution can be written in exponential family form by rearranging its pmf to,

\[\begin{split} \begin{aligned} \mathrm{Po}(y; \lambda) &= \frac{1}{y!} \lambda^y e^{-\lambda} \\ &= \frac{1}{y!} \exp \left\{ y \log \lambda - \lambda \right\} \\ &= \frac{1}{y!} \exp \left\{ y \theta - e^\theta \right\} \\ \end{aligned} \end{split}\]

where \(t(y) = y\) is the sufficient statistic, \(\theta = \log \lambda\) is the natural parameter, \(e^\theta\) is the log normalizer, and \(h(y) = 1/y!\) is the base measure.

Exercise

Write the Bernoulli pmf,

\[ \mathrm{Bern}(y; p) = p^y (1-p)^{1-y}, \]

in exponential family form. What is the natural parameter, the sufficient statistic, the base measure, and the log normalizer?

Exponential family GLMs#

An exponential family GLM is one in which the observation distribution is an exponential family and the natural parameter is a linear-nonlinear function of the features,

\[ \theta = g(\mathbf{w}^\top \boldsymbol{\phi}), \]

where \(g: \mathbb{R} \mapsto \mathbb{R}\) is a monotonically increasing function.

Note that this model is defined in terms of the natural parameters \(\theta\) rather than the mean parameters \(\mathbb{E}[t(y) \mid \theta]\) (which for a Poisson distribution is simply \(\mathbb{E}[y \mid \theta]\)). How do the natural and mean parameters relate to one another? Through the gradient of the log normalizer,

\[\begin{split} \begin{aligned} A'(\theta) &\triangleq \frac{\mathrm{d}}{\mathrm{d} \theta} A(\theta) \\ &= \frac{\mathrm{d}}{\mathrm{d} \theta} \log \int h(y) \exp \left\{ \langle t(y), \theta \rangle\right\} \, \mathrm{d}y \\ &= \frac{\int t(y) h(y) \exp \left\{ \langle t(y), \theta \rangle\right\} \, \mathrm{d}y}{\int h(y) \exp \left\{ \langle t(y), \theta \rangle\right\} \, \mathrm{d}y} \\ &= \int t(y) h(y) \exp \left\{ \langle t(y), \theta \rangle - A(\theta) \right\} \, \mathrm{d}y \\ &= \mathbb{E}[t(y) \mid \theta] \end{aligned} \end{split}\]

In short, gradients of the log normalizer yield expected sufficient statistics!

Now let’s write the GLM in terms of the mapping to the mean parameters,

\[ \begin{aligned} \mathbb{E}[t(y) \mid \mathbf{w}, \boldsymbol{\phi}] = A'(\theta) = f(\mathbf{w}^\top \boldsymbol{\phi}) \end{aligned} \]

where \(f(a) = A'(g(a))\) is the mean function. The canonical mean function is obtained by setting \(g\) to the identity function, so that \(f\) is simply the gradient of the log normalizer.

For example, in the Poisson distribution, the sufficient statistic is \(t(y) = y\), and the log normalizer is \(A(\theta) = e^\theta\). The derivative of the log normalizer is also \(A'(\theta) = e^\theta\), so the canonical Poisson GLM is,

\[ \mathbb{E}[y \mid \mathbf{w}, \boldsymbol{\phi}] = e^{\mathbf{w}^\top \boldsymbol{\phi}_t}, \]

just like we had above! (Note: the equivalence is exact when bin size is \(\Delta = 1\); the scaling factor doesn’t change the story much.)

Gradients for canonical exponential family GLMs#

Now consider the negative log likelihood in a canonical exponential family GLM,

\[\begin{split} \begin{aligned} \mathcal{L}(\mathbf{w}) &= - \sum_{i=1}^T \log p(y_i \mid \mathbf{w}, \boldsymbol{\phi}_i) \\ &= - \sum_{i=1}^T \langle t(y_i), \mathbf{w}^\top \boldsymbol{\phi}_i \rangle - A(\mathbf{w}^\top \boldsymbol{\phi}_i) + c \end{aligned} \end{split}\]

The gradient is,

\[\begin{split} \begin{aligned} \nabla \mathcal{L}(\mathbf{w}) &= - \sum_{i=1}^T (t(y_i) - A'(\mathbf{w}^\top \boldsymbol{\phi}_i)) \, \boldsymbol{\phi}_i \\ &= - \sum_{i=1}^T (t(y_i) - \mathbb{E}[t(y_i) \mid \mathbf{w}, \boldsymbol{\phi}_i]) \, \boldsymbol{\phi}_i. \end{aligned} \end{split}\]

Thus, we see that the simple form of the gradient — a sum of features weighted by the corresponding error — follows from a general property of exponential family GLMs.

GLMs in neuroscience#

Generalized linear models are central tools in systems neuroscience. Work by Paninski [2004], Truccolo et al. [2005], and Pillow et al. [2008] laid the theoretical foundations for much of the experimental work that followed.

  • Park et al. [2014], Yates et al. [2017] applied GLMs to characterize heterogeneous responses in multi-neuronal spike train recordings from lateral intraparietal cortex (LIP) during decision making.

  • Hardcastle et al. [2017] applied a similar analysis to characterize functional cell types in medial entorhinal cortex (MEC), finding grid cells, border cells, and much more.

  • Vidne et al. [2012] characterized the responses of retinal ganglion cells, allowing for correlated noise processes, and Freeman et al. [2015] further refined the model to include common inputs from bipolar cells. McIntosh et al. [2016] framed these models as convolutional neural networks, which we will study in lab.

  • On the methodological side, much work has gone into fitting these models at scale [Ramirez and Paninski, 2014, Zoltowski and Pillow, 2018], extending the models with structured prior distributions [Linderman et al., 2016], making them more biophysically realistic [Latimer et al., 2019], and understanding the extent of spiking responses that GLMs can capture [Weber and Pillow, 2017].