Poisson Processes

Poisson Processes#

The preceding chapter discussed Poisson GLMs for neural spike counts. Changing notation slightly, we had,

\[ y_{i} \sim \mathrm{Po}(\lambda_i \Delta ) \]

where \(y_i \in \mathbb{N}_0\) denotes the number of spikes in the \(i\)-th time bin, \(\lambda_i \in \mathbb{R}_+\) is the firing rate (in spikes/second), and \(\Delta \in \mathbb{R}_+\) is the bin width (in seconds).

What happens if we take the bin width to zero? Then all but a finite number of bins will have zero counts, and we can instead represent the data as an unordered set of spike times \(\{t_n\}_{n=1}^N \subset [0, T)\). Instead of a vector of Poisson-distributed spike counts, we can model the spike times as a realization of a Poisson process.

Definition#

A Poisson process is a stochastic process that generates sets of points, like the set of spike times above. It is defined by an intensity function or firing rate, \(\lambda(t): [0, T) \mapsto \mathbb{R}_+\).

Let \(N(\mathcal{A}) = |\{n: t_n \in \mathcal{A}\}|\) denote the number of points in the set \(\mathcal{A} \subseteq [0,T)\). For example, an interval \(\mathcal{A} = [a_0, a_1)\). Let \(N(t)\) be shorthand for \(N([0,t))\) that denotes the counting function, which specifies the number of points up to time \(t\). Assume \(N(0) = 0\).

A Poisson process has two defining properties:

The number of points in an interval is Poisson distributed with expectation given by the integrated intensity function,

\[N(\mathcal{A}) \sim \mathrm{Po}\left( \int_{\mathcal{A}} \lambda(t) \, \mathrm{d}t\right).\]
The number of points in \(\mathcal{A}\) is independent of the number of points in a disjoint interval \(\mathcal{A}'\),

\[N(\mathcal{A}) \perp \!\!\! \perp N(\mathcal{A}')\]

if \(\mathcal{A} \cap \mathcal{A}' = \varnothing\).

A homogeneous Poisson process has a constant intensity function, \(\lambda(t) \equiv \lambda\). Otherwise, the process is called inhomogeneous.

Sampling a Poisson process#

The two properties above imply a method for sampling a Poisson process. First, sample the total number of points from its Poisson distribution,

\[ N \sim \mathrm{Po} \left(\int_0^T \lambda(t) \, \mathrm{d}t \right). \]

Then sample the locations of the points by independently sampling from the normalized intensity,

\[ t_n \overset{\text{iid}}{\sim} \frac{\lambda(t)}{\int_0^T \lambda(t) \, \mathrm{d} t}, \]

to obtain the unordered set \(\{t_n\}_{n=1}^N\).

Interval distribution#

For a homogeneous Poisson process, the expected number of points is \(\mathbb{E}[N] = \lambda T\), and the locations are independently sampled according to a uniform distribution over \([0, T)\). The intervals between points, \(\delta_n = t_{(n)} - t_{(n-1)}\), where \(t_{(n)}\) is the \(n\)-th spike time in the ordered set, are exponential random variables,

\[\delta_{n} \overset{\text{iid}}{\sim} \mathrm{Exp}(\lambda).\]

This suggests another means of sampling a homogeneous Poisson process: sample intervals independently until the total elapsed time exceeds \(T\).

As a sanity check, note that the expected value of the intervals is \(\mathbb{E}[\delta_n] = \frac{1}{\lambda}\), so we should get about \(\lambda T\) points for large \(T\).

The exponential distribution is memoryless: the distribution of time until the next point arrives is independent of how much time has elapsed since the last point. More formally,

\[ \Pr(\delta_n > a + b \mid \delta_n > a) = \Pr(\delta_n > b). \]

Warning

The memoryless property of Poisson processes should give you pause. Don’t neurons have a refractory period, which sets a lower bound on the interval between spikes? Incorporating these types of dependencies will require us to move beyond Poisson processes, as discussed below.

Poisson process likelihood#

From the two-step sampling procedure above, we can derive the likelihood of a set of points under a Poisson process,

\[ \begin{aligned} p(\{t_n\}_{n=1}^N; \lambda) &= \exp\left\{-\int_0^T \lambda(t) \, \mathrm{d}t \right\} \prod_{n=1}^N \lambda(t_n) \end{aligned} \]

Derivation

The sampling procedure has two steps: sample the number of points, then sample their location. Thus,

\[ \begin{aligned} p(\{t_n\}_{n=1}^N; \lambda) &= p(N) \times \left[ \prod_{n=1}^N p(t_n) \right] \times N! \end{aligned} \]

Where does that lagging \(N!\) come from? The likelihood is defined over unordered sets of points \(\{t_n\}_{n=1}^N\). The product over \(n\) implicitly assumes an ordering. Since we could obtain the same output from any of the \(N!\) permutations, the likelihood needs to be multiplied by a factor of \(N!\).

Now expand the equation above and simplify,

\[\begin{split} \begin{aligned} p(\{t_n\}_{n=1}^N; \lambda) &= \mathrm{Po}\left(N; \int_0^T \lambda(t) \, \mathrm{d}t \right) \times \left[ \prod_{n=1}^N \frac{\lambda(t)}{\int_0^T \lambda(t) \, \mathrm{d} t} \right] \times N! \\ &= \frac{1}{N!} \left(\int_0^T \lambda(t) \, \mathrm{d}t \right)^N \exp\left\{-\int_0^T \lambda(t) \, \mathrm{d}t \right\} \times \left[ \prod_{n=1}^N \frac{\lambda(t)}{\int_0^T \lambda(t) \, \mathrm{d} t} \right] \times N! \\ &= \exp\left\{-\int_0^T \lambda(t) \, \mathrm{d}t \right\} \prod_{n=1}^N \lambda(t_n) \end{aligned} \end{split}\]

Exercise

Derive the likelihood of the set of intervals \(\{\delta_n\}_{n=1}^N\) corresponding to the points \(\{t_n\}_{n=1}^N\). You should get the same form as above.

Maximum likelihood estimation#

In practice, we often want to estimate the intensity function from data. For example, we may assume the intensity function has a parametric form, \(\lambda(t; \theta)\), with parameters \(\theta\). We could estimate the parameters by maximum likelihood estimation, using the Poisson process likelihood above. Note that the likelihood is a concave function of \(\lambda\), making it amenable to optimization.

The challenge is the integrated intensity in the exponent. For many models, there is no closed-form solution for the integral. For temporal point processes like the ones considered in this chapter, this is simply a one-dimensional integral. We can approximate it using numerical quadrature rules, as long as we can evaluate \(\lambda(t; \theta)\) for any \(t\).

Some classes of models do afford closed-form integration. For example, consider a model of the intensity as a weighted sum of basis functions,

\[ \lambda(t; \theta) = \sum_{b=1}^B w_b \phi_b(t) \]

where \(\theta = (w_1, \ldots, w_B)^\top \in \mathbb{R}_+^B\) are non-negative weights and \(\phi_b: [0,T) \mapsto \mathbb{R}_+\) are non-negative basis functions. Then,

\[ \begin{aligned} \int \lambda(t; \theta) \, \mathrm{d} t &= \sum_{b=1}^B w_b C_b. \end{aligned} \]

where \(C_b = \int \phi_b(t) \, \mathrm{d} t\).

Since we get to choose the basis functions, we can simply choose functions that are easy to integrate, like continuous univariate probability densities. In the special case of rectangular basis functions of width \(\Delta\),

\[\phi_b(t) = \mathbb{I}\big[t \in [b\Delta, (b+1)\Delta) \big],\]

we obtain a piecewise-constant model for the intensity function. This brings us back to the discrete-time model for spike counts from the last chapter.

Limit of the discrete-time model#

We can arrive at a Poisson process by taking a limit of the discrete-time model for spike counts. Intuitively, think of discrete time model as a Poisson process with piecewise constant intensity,

\[ \lambda(t) = \lambda_{i(t)}, \]

where \(i(t) = \lfloor \frac{t}{\Delta} \rfloor\) is the index of the time bin corresponding to continuous time \(t\), and \(\lambda_{i(t)}\) is the constant intensity in that bin. The discrete-time model ignored the precise timing of spikes within each bin and simply focused on the spike counts.

Sampling a Poisson process with piecewise constant intensity is straightforward: each bin is an independent, homogenous Poisson process with its own intensity. More precisely, sample

\[\begin{split} \begin{aligned} y_i &\sim \mathrm{Po}(\lambda_{i(t)} \Delta) & \text{for } i&=1,\ldots, \lfloor \tfrac{T}{\Delta} \rfloor \\ t_{i,n} &\overset{\text{iid}}{\sim} \mathrm{Unif}([i \Delta, (i+1)\Delta)) & \text{for } n&=1,\ldots,y_i \end{aligned} \end{split}\]

Then return the union of all spikes, \(\cup_{i} \{t_{i,n}\}_{n=1}^{y_i}\). The total number of spikes is \(N=\sum_i y_i\).

Now let’s derive the likelihood,

\[\begin{split} \begin{aligned} p(\{t_n\}_{n=1}^N; \lambda) &= \prod_{i=1}^{\lfloor \frac{T}{\Delta} \rfloor} p(\{t_{i,n}\}_{n=1}^{y_i}) \\ &= \prod_{i=1}^{\lfloor \frac{T}{\Delta} \rfloor} \mathrm{Po}(y_i; \lambda_{i(t)} \Delta) \left[ \prod_{n=1}^{y_i} \mathrm{Unif}(t_{i,n}; [i\Delta, (i+1)\Delta)) \right] y_i! \\ &= \prod_{i=1}^{\lfloor \frac{T}{\Delta} \rfloor} \frac{1}{y_i!} (\lambda_{i(t)} \Delta)^{y_i} e^{-\lambda_{i(t)} \Delta} \left(\tfrac{1}{\Delta}\right)^{y_i} y_i! \\ &= \prod_{i=1}^{\lfloor \frac{T}{\Delta} \rfloor} \lambda_{i(t)}^{y_i} e^{-\lambda_{i(t)} \Delta} \\ &= \exp \bigg\{-\sum_{i=1}^{\lfloor \frac{T}{\Delta} \rfloor} \lambda_{i(t)} \Delta \bigg\} \prod_{i=1}^{\lfloor \frac{T}{\Delta} \rfloor} \lambda_{i(t)}^{y_i} \\ \end{aligned} \end{split}\]

Now consider the limit as the bin width \(\Delta\) goes to zero. The sum in the exponent is a Riemann sum, and its limit is the integral of the intensity function. Moreover, as the bin width goes to zero, so does the probability of having more than one spike in a bin. In other words, \(y_i\) is either zero or one. Since \(\lambda^0 = 1\), we can write the likelihood as a product of instantaneous intensities at the time of each spike,

\[\begin{split} \begin{aligned} p(\{t_n\}_{n=1}^N; \lambda) &= \lim_{\Delta \to 0} \; \exp \bigg\{-\sum_{i=1}^{\lfloor \frac{T}{\Delta} \rfloor} \lambda_{i(t)} \Delta \bigg\} \prod_{i=1}^{\lfloor \frac{T}{\Delta} \rfloor} \lambda_{i(t)}^{y_i} \\ &= \exp \bigg\{ -\int_0^T \lambda(t) \, \mathrm{d} t \bigg\} \prod_{n=1}^N \lambda(t_n) \end{aligned} \end{split}\]

Simplicity and orderliness

A point process is simple if the probability of two spikes at the same time is zero.

A point process is said to be orderly or regular if

\[ \Pr(N([t,t+\Delta)) > 1) = o(\Delta) \]

where \(o(\Delta)\) is little-oh notation that means \(\lim_{\Delta \to 0} o(\Delta)/\Delta = 0\). Orderliness implies simplicity.

As long as the intensity \(\lambda\) has no atoms, a Poisson process is orderly and simple. We used this property when we claimed that \(y_i\) is either zero or one in the limit as \(\Delta\) goes to zero. To prove it more formally, note that in the piecewise constant model,

\[\begin{split} \begin{aligned} \Pr(y_i = 0) &= e^{-\lambda_{i(t)} \Delta} \\ \Pr(y_i = 1) &= \lambda_{i(t)} \Delta e^{-\lambda_{i(t)} \Delta} \\ \Pr(y_i = 2) &= \frac{(\lambda_{i(t)} \Delta)^2}{2} e^{-\lambda_{i(t)} \Delta} \end{aligned} \end{split}\]

For small \(\Delta\), we can use a Taylor approximation \(e^{\lambda_{i(t)} \Delta} = 1 + \lambda_{i(t)} \Delta + o(\Delta)\) where \(o(\Delta)\) is little-oh notation that means \(\lim_{\Delta \to 0} o(\Delta)/\Delta = 0\). It captures the second and higher order terms in the Taylor approximation. In this limit,

\[\begin{split} \begin{aligned} \Pr(y_t = 0) &= 1 -\lambda_{i(t)} \Delta + o(\Delta) \\ \Pr(y_t = 1) &= \lambda_{i(t)} \Delta + o(\Delta) \\ \Pr(y_t \geq 2) &= o(\Delta) \end{aligned} \end{split}\]

The last line says that the probability of seeing more than one spike goes to zero faster than \(\Delta\) does, which justifies our claim that \(y_i\) is binary in the limit of small bin sizes.

Renewal processes#

Let’s return to the interval distribution above. As we saw, under a Poisson process the intervals are exponentially distributed. This isn’t a very realistic model for neural spike trains because neurons have refractory periods.

Renewal processes are a more general class of models that explicitly model the interval distribution. Again, let \(\{\delta_n\}_{n=1}^N\) denote the intervals between the points \(\{t_n\}_{n=1}^N\). In a renewal process,

\[ \delta_n \overset{\text{iid}}{\sim} p(\delta), \]

where \(p(\delta)\) is an interval distribution with support on \(\mathbb{R}_+\). When \(p(\delta)\) is an exponential distribution we recover the Poisson process. To model neural spike trains, we could substitute a more general form like a gamma distribution or an inverse Gaussian distribution.

The inverse Gaussian distribution is appealing due to its connection to Brownian motion: it is the distribution of the first passage time at which a particle undergoing Brownian motion reaches a fixed level. A very simple stochastic integrate-and-fire model of a neuron treats the spike times as the first passage times when the membrane potential reaches a firing threshold. If the membrane potential is modeled as Brownian motion with positive drift, then the inter-spike intervals are inverse Gaussian distributed.

Exercise

A defining property of a Poisson process is that the numbers of events in disjoint intervals are independent. Do renewal processes violate this assumption?

Conditional intensity functions#

So far we have treated the intensity as simply a function of time. Of course, the intensity could also be modeled as a function of external stimuli or covariates, \(\mathbf{x}(t)\). For example, the stimulus could be incorporated into the basis function model described above.

To incorporate direct interactions between the points themselves, we need to generalize our model via a conditional intensity function,

\[ \lambda(t \mid \mathcal{H}_t) \]

where \(\mathcal{H}_t\) represents the history of points up to time \(t\). In the language of stochastic processes, \(\mathcal{H}_t\) is a filtration.

Given the history, the process behaves like a Poisson process. However, as soon as a new point occurs, the intensity function changes to account for it. Intuitively, this allows us to construct continuous time analogs of discrete time autoregressive models.

Hawkes processes#

Perhaps the canonical example of a point process with conditional dependencies is the Hawkes process [Hawkes, 1971]. It is defined by its conditional intensity function,

\[ \lambda(t \mid \mathcal{H}_t) = \lambda_0 + \sum_{t_n \in \mathcal{H}_t} w \, f(t - t_n), \]

where \(f : \mathbb{R}_+ \mapsto \mathbb{R}_+\) is the impulse response function or, following the last chapter, the self-coupling filter. It specifies how past spikes affect the conditional intensity at time \(t\). Without loss of generality, assume the impulse response is normalized so that \(\int_0^\infty f(s) \, \mathrm{d}s = 1\). A common choice is is an exponential density with time-constant (inverse rate) \(\tau\),

\[ f(s) = \mathrm{Exp}(s; \tau^{-1}) = \tau^{-1} e^{-s / \tau}. \]

This simple formulation of the model has two parameters, the baseline intensity \(\lambda_0 \in \mathbb{R}_+\) and the self-coupling weight \(w \in \mathbb{R}_+\).

Since \(f\) is a positive function, a Hawkes process is self-exciting: past spikes can only increase the future firing rate. Self-excitation can easily become unstable, depending on the self-coupling weight. If \(w \geq 1\), the process becomes unstable because, intuitively, each spike induces more than one future spike in expectation. If \(w < 1\), the stationary firing rate is,

\[ \lambda_{\infty} = \frac{\lambda_0}{1 - w}. \]

When \(w=0\), we recover the standard Poisson process with rate \(\lambda_0\), but as \(w\) increases the self-excitation produces bursts of spikes and a larger stationary rate.

Multivariate Hawkes processes#

Now let’s generalize the Hawkes process to multiple interacting point processes. Let \(M\) denote the number of processes — e.g., the number of neurons in a multi-neuronal spike train recording — and let \(\lambda_m(t \mid \mathcal{H}_t)\) denote the conditional intensity function for the \(m\)-th process. Let \(\mathcal{H}_{t,m}\) denote the history of events on process \(m\) and \(\mathcal{H}_t = \{\mathcal{H}_{t,m}\}_{m=1}^M\) denote the combined history of all processes.

The multivariate Hawkes process has conditional intensities,

\[ \lambda_m(t \mid \mathcal{H}_t) = \lambda_{0,m} + \sum_{m'=1}^M \sum_{t_n \in \mathcal{H}_{t,m'}} w_{m',m} \, f(t - t_n), \]

The weight \(w_{m',m} \geq 0\) specifies the influence that spikes on process \(m'\) have on the future rate of process \(m\).

Let \(\mathbf{W} \in \mathbb{R}_+^{M \times M}\) denote the matrix of weights. In the multivariate case, stability is determined by the eigenvalues of the weight matrix. The multivariate process is stable if the spectral radius of \(\mathbf{W}\) is less than one. Then, the stationary rates are,

\[ \boldsymbol{\lambda}_\infty = (\mathbf{I} - \mathbf{W})^{-1} \boldsymbol{\lambda}_0 \]

Spectral radius and the Perron-Frobenius theorem

The spectral radius of a matrix is the maximum absolute value of its eigenvalues. Since \(\mathbf{W}\) is a non-negative, real-valued matrix, the Perron-Frobenius theorem says that the maximum eigenvalue is real-valued and non-negative.

Maximum likelihood estimation for Hawkes processes#

The likelhood of a Hawkes process is the same as that of a Poisson process, but with the conditional intensity instead. Consider the univariate case (the multivariate Hawkes process follows the same form),

\[ \begin{aligned} p(\{t_n\}_{n=1}^N; \theta) &= \exp \bigg\{-\int_0^T \lambda(t \mid \mathcal{H}_t; \theta) \, \mathrm{d} t\bigg\} \prod_{n=1}^N \lambda(t_n \mid \mathcal{H}_{t_n}; \theta), \end{aligned} \]

where \(\theta = (\lambda_0, w)\) are the two parameters.

For Hawkes processes, the integral can be simplified,

\[ \begin{aligned} \int_0^T \lambda(t \mid \mathcal{H}_t; \theta) \, \mathrm{d} t &= \lambda_0 T + \sum_{n=1}^N w \int_{t_n}^T f(t - t_n) \, \mathrm{d}t. \end{aligned} \]

If \(f\) has bounded support or decays quickly to zero — e.g., when \(f(s) = \mathrm{Exp}(s; \tau^{-1})\) the density is close to zero after delays greater than a few time constants \(\tau\) — then we can approximate the integral as,

\[ \begin{aligned} \int_0^T \lambda(t \mid \mathcal{H}_t; \theta) \, \mathrm{d} t &\approx \lambda_0 T + w N. \end{aligned} \]

Maximum likelihood estimation boils down to a convex optimization problem: minize the negative log likelihood, which is convex and easy to evaluate, subject to nonnegativity constraints. This problem is straightforward to solve with CVXPy, for example.

Poisson superposition and thinning#

Just like the sum of independent Poisson random variables, the superposition of two independent Poisson proesses is another Poisson process. If,

\[\begin{split} \begin{aligned} \{t_{0,n}\}_{n=1}^{N_0} &\sim \mathrm{PP}(\lambda_0(t)) \\ \{t_{1,n}\}_{n=1}^{N_1} &\sim \mathrm{PP}(\lambda_1(t)) \end{aligned} \end{split}\]

are two independent Poisson processes, then there superposition is a Poisson process as well,

\[ \{t_{0,n}\}_{n=1}^{N_0} \cup \{t_{1,n}\}_{n=1}^{N_1} \sim \mathrm{PP}(\lambda_0(t) + \lambda_1(t)). \]

Likewise, if a Poisson process intensity is a sum of non-negative functions, we can use Poisson thinning to separate the points into indepedent Processes. Suppose

\[ \{t_n\}_{n=1}^N \sim \mathrm{PP}(\lambda_0(t) + \lambda_1(t)). \]

Introduce independent binary variables,

\[ z_n \sim \mathrm{Bern}\left( \frac{\lambda_1(t_n)}{\lambda_0(t_n) + \lambda_1(t_n)} \right) \]

to separate the events. Then,

\[\begin{split} \begin{aligned} \{t_n: z_n = 0\} &\sim \mathrm{PP}(\lambda_0(t)) \\ \{t_n: z_n = 1\} &\sim \mathrm{PP}(\lambda_1(t)) \end{aligned} \end{split}\]

Hawkes processes as cascades of Poisson processes#

We can use Poisson superposition and thinning to reinterpret the Hawkes process as a cascade of Poisson processes. Note that the conditional intensity of a Hawkes process is the sum of non-negative intensity functions: a homogeneous background rate plus weighted impulse responses. An equivalent way to sample the Hawkes process is to sample points from independent Poisson processes for each term in the intensity function.

First, sample points from the background process,

\[\mathcal{T}_0 = \{t_{0}\}_{n=1}^{N_0} \sim \mathrm{PP}(\lambda_0)\]
Then, for each point \(t_n \in \mathcal{T}_0\), sample new points from its a Poisson process with its impulse response,

\[\mathcal{T}_n = \{t_{n'}\}_{n'=1}^{N_n} \sim \mathrm{PP}(w f(t - t_n))\]
Repeat this process recursively for each newly sampled point.
Return the union of all induced points \(\cup \mathcal{T}_n\).

From this perspective, we can view the Hawkes process as a cascade of Poisson processes. Each point can spawn a new generation of child points.

This perspective also sheds light on the weights. When the impulse response is normalized, we can think of the weight as the expected number of children. If the expected number of children is greater than 1, then the process above becomes unstable and may not halt.

Finally, we can use this perspective to formulate the Hawkes process as a latent variable model. Each point has a latent parent — either the background or a preceding poitn &mdash. If we knew the parents, the Hawkes process would decompose into independent Poisson processes. We can leverage this formulation to do efficient Bayesian inference in Hawkes processes [Linderman and Adams, 2014].

Nonlinear Hawkes processes#

A limitation of Hawkes processes is that they are purely excitatory. If the weights were negative, the intensity function could be negative as well, and that would be an invalid point process.

One way around this is via nonlinear Hawkes processes, where the intensity function is,

\[ \lambda(t \mid \mathcal{H}_t) = g \left( w_0 + \sum_{t_n \in \mathcal{H}_t} w \, f(t - t_n) \right), \]

where \(g: \mathbb{R} \mapsto \mathbb{R}_+\) is a rectifying nonlinearity. This is essentially th continuous-time analog of the GLMs from the preceding chapter. Unfortunately, many of the nice properties of regular Hawkes processes are lost in this formulation:

The integrated intensity does not have a simpled closed form solution.
The process can no longer be viewed as a superposition.

Still, we can approximate the log likelihood with numerical quadrature methods, and in some cases we can use fancy augmentation schemes to perform Bayesian inference [Zhou et al., 2019].

Conclusion#

Point processes are stochastic processes that produce discrete sets of points. They are fundamental tools for modeling neural spike trains, where the points are spike times.
The Poisson process is the fundamental point process from which more complex models are constructed.
There are many ways to sample a Poisson process. We showed two: the “two-stage” approach of sampling the number of points and then their times; and the “interval” approach of repeatedly sampling inter-spike intervals.
Both approaches led to the same expression for the Poisson process likelihood, which is a concave function of the intensity.
However, the same simplifying assumptions that make Poisson processes so mathematically elegant also limit their utility for modeling data with dependencies. Renewal processes and Hawkes processes extend the Poisson process, without completely sacrificing tractability.
Finally, we saw how Hawkes processes can be viewed as cascades of Poisson processes thanks to the Poisson superposition and thinning principles.

To learn more about Poisson processes, the definitive reference is Kingman [1992].