Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Gaussian Processes

A Gaussian process (GP) is a distribution over functions. Rather than placing a prior over a finite set of parameters, a GP places a prior over the function itself, giving a flexible non-parametric approach to regression and classification. This chapter covers:

  1. The GP definition and common kernels

  2. Exact GP regression

  3. GP classification via augmentation

  4. Elliptical slice sampling

  5. Sparse GPs and inducing points

Source
import torch
import torch.distributions as dist
import matplotlib.pyplot as plt
import math

palette = list(plt.cm.Set2.colors)
torch.manual_seed(305)
<torch._C.Generator at 0x7f9514bcb9f0>

Gaussian Processes

A Gaussian process prior (shaded) and posterior (narrower shaded region)
after conditioning on a few observations (dots).

A Gaussian process prior (shaded) and posterior (narrower shaded region) after conditioning on a few observations (dots).

We say fGP(μ(),K(,))f \sim \mathcal{GP}(\mu(\cdot), K(\cdot, \cdot)) if for every finite set of inputs {x1,,xN}RD\{\mathbf{x}_1, \ldots, \mathbf{x}_N\} \subset \mathbb{R}^D, the joint distribution of the function values is Gaussian:

[f(x1)f(xN)]N ⁣([μ(x1)μ(xN)],  [K(x1,x1)K(x1,xN)K(xN,x1)K(xN,xN)]).\begin{bmatrix} f(\mathbf{x}_1) \\ \vdots \\ f(\mathbf{x}_N) \end{bmatrix} \sim \mathcal{N}\!\left( \begin{bmatrix} \mu(\mathbf{x}_1) \\ \vdots \\ \mu(\mathbf{x}_N) \end{bmatrix},\; \begin{bmatrix} K(\mathbf{x}_1, \mathbf{x}_1) & \cdots & K(\mathbf{x}_1, \mathbf{x}_N) \\ \vdots & & \vdots \\ K(\mathbf{x}_N, \mathbf{x}_1) & \cdots & K(\mathbf{x}_N, \mathbf{x}_N) \end{bmatrix} \right).

Here μ:RDR\mu: \mathbb{R}^D \to \mathbb{R} is the mean function and K:RD×RDRK: \mathbb{R}^D \times \mathbb{R}^D \to \mathbb{R} is the kernel (covariance function). The matrix of kernel evaluations is called the Gram matrix. The kernel must be positive definite: the Gram matrix must be PD for every finite set of inputs.

From Linear Regression to GPs

GPs are the natural limit of Bayesian linear regression with infinitely many features. Recall that with feature map ϕ(x)=(ϕ1(x),,ϕP(x))\boldsymbol{\phi}(x) = (\phi_1(x), \ldots, \phi_P(x))^\top and Gaussian prior wN(0,λPI)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \frac{\lambda}{P}\mathbf{I}), the function f(x)=ϕ(x)wf(x) = \boldsymbol{\phi}(x)^\top \mathbf{w} is a GP with kernel

K(xi,xj)=λPϕ(xi)ϕ(xj).K(x_i, x_j) = \frac{\lambda}{P}\, \boldsymbol{\phi}(x_i)^\top \boldsymbol{\phi}(x_j).

Example: squared exponential kernel. Use radial basis function features ϕp(x)=e(xcp)2/22\phi_p(x) = e^{-(x - c_p)^2 / 2\ell^2} centred at PP equally-spaced points. Taking PP \to \infty:

limPK(xi,xj)=λe(xic)2/22e(xjc)2/22dc=πλ  e(xixj)2/42.\lim_{P\to\infty} K(x_i, x_j) = \lambda \int_{-\infty}^\infty e^{-(x_i-c)^2/2\ell^2} e^{-(x_j-c)^2/2\ell^2}\,dc = \sqrt{\pi}\ell\lambda\; e^{-(x_i-x_j)^2/4\ell^2}.

This is the squared exponential (SE) kernel (also called the RBF or Gaussian kernel):

KSE(xi,xj)=σ2exp ⁣((xixj)222).K_{\mathrm{SE}}(x_i, x_j) = \sigma^2 \exp\!\left(-\frac{(x_i-x_j)^2}{2\ell^2}\right).

The SE kernel corresponds to functions that are infinitely differentiable. The length-scale \ell controls how quickly the function varies; σ2\sigma^2 is the marginal variance.

Kernels

The Matérn Family

For many applications (e.g. spatial statistics, physical systems), functions are smooth but not infinitely differentiable. The Matérn family provides a spectrum of smoothness controlled by ν>0\nu > 0:

Kν(xi,xj)=21νΓ(ν)(2νxixj)νKν ⁣(2νxixj)K_\nu(x_i, x_j) = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\,|x_i-x_j|}{\ell}\right)^\nu \mathcal{K}_\nu\!\left(\frac{\sqrt{2\nu}\,|x_i-x_j|}{\ell}\right)

where Kν\mathcal{K}_\nu is the modified Bessel function. Special cases:

ν\nuKernelGP sample paths
1/21/2eΔ/e^{-|\Delta|/\ell}Continuous, non-differentiable (OU process)
3/23/2(1+3Δ)e3Δ/(1 + \frac{\sqrt{3}|\Delta|}{\ell})e^{-\sqrt{3}|\Delta|/\ell}Once differentiable
5/25/2(1+5Δ+5Δ232)e5Δ/(1 + \frac{\sqrt{5}|\Delta|}{\ell} + \frac{5|\Delta|^2}{3\ell^2})e^{-\sqrt{5}|\Delta|/\ell}Twice differentiable
\inftyeΔ2/22e^{-|\Delta|^2/2\ell^2}Infinitely differentiable (SE)
Matérn kernel sample paths for \nu = 1/2, 3/2, 5/2 and the SE (\nu=\infty).

Matérn kernel sample paths for ν=1/2,3/2,5/2\nu = 1/2, 3/2, 5/2 and the SE (ν=\nu=\infty).

Stationary Kernels and Bochner’s Theorem

Both SE and Matérn kernels are stationary: K(xi,xj)K(\mathbf{x}_i, \mathbf{x}_j) depends only on the difference Δ=xixj\boldsymbol{\Delta} = \mathbf{x}_i - \mathbf{x}_j.

Composing Kernels

Kernels can be combined to encode richer structure:

Common kernel functions and their properties.
From .

Common kernel functions and their properties. From Williams & Rasmussen, 1996.

Rules for combining kernels (sums, products, compositions) and what structure
each combination encodes.  From the Kernel Cookbook.

Rules for combining kernels (sums, products, compositions) and what structure each combination encodes. From the Kernel Cookbook.

Duvenaud et al., 2013 automated the search over kernel compositions to find the one that best explains a dataset — an approach they called the Automated Statistician.

GP Regression

Posterior over Function Values

Given observations (xn,yn)(\mathbf{x}_n, y_n) with yn=f(xn)+ϵny_n = f(\mathbf{x}_n) + \epsilon_n, ϵnN(0,σ2)\epsilon_n \sim \mathcal{N}(0, \sigma^2), the posterior over the NN-vector f=[f(x1),,f(xN)]\mathbf{f} = [f(\mathbf{x}_1),\ldots,f(\mathbf{x}_N)]^\top is Gaussian:

p(fy)=N(fμ,K)p(\mathbf{f} \mid \mathbf{y}) = \mathcal{N}(\mathbf{f} \mid \boldsymbol{\mu}', \mathbf{K}')
K=(K1+σ2I)1,μ=K(K1μ+σ2y).\mathbf{K}' = \bigl(\mathbf{K}^{-1} + \sigma^{-2}\mathbf{I}\bigr)^{-1}, \qquad \boldsymbol{\mu}' = \mathbf{K}'\bigl(\mathbf{K}^{-1}\boldsymbol{\mu} + \sigma^{-2}\mathbf{y}\bigr).

Marginal Likelihood

Integrating out f\mathbf{f} gives a closed-form marginal likelihood:

p(y)=N(yμ,K+σ2I).p(\mathbf{y}) = \mathcal{N}(\mathbf{y} \mid \boldsymbol{\mu},\, \mathbf{K} + \sigma^2\mathbf{I}).

This can be used to select hyperparameters (kernel parameters, noise variance) by maximising logp(y)\log p(\mathbf{y}).

Posterior Predictive Distribution

For a new input x\mathbf{x}_\star, the joint of (y,y)(\mathbf{y}, y_\star) is Gaussian. Applying the Schur complement (Chapter 2) yields:

yx,DN(m,v)y_\star \mid \mathbf{x}_\star, \mathcal{D} \sim \mathcal{N}(m_\star,\, v_\star)
m=μ(x)+k(K+σ2I)1(yμ)m_\star = \mu(\mathbf{x}_\star) + \mathbf{k}_\star^\top(\mathbf{K}+\sigma^2\mathbf{I})^{-1}(\mathbf{y}-\boldsymbol{\mu})
v=K(x,x)+σ2k(K+σ2I)1kv_\star = K(\mathbf{x}_\star,\mathbf{x}_\star) + \sigma^2 - \mathbf{k}_\star^\top(\mathbf{K}+\sigma^2\mathbf{I})^{-1}\mathbf{k}_\star

where k=[K(x1,x),,K(xN,x)]\mathbf{k}_\star = [K(\mathbf{x}_1,\mathbf{x}_\star),\ldots,K(\mathbf{x}_N,\mathbf{x}_\star)]^\top.

Computational cost. Computing (K+σ2I)1(\mathbf{K}+\sigma^2\mathbf{I})^{-1} via Cholesky takes O(N3)O(N^3) time and O(N2)O(N^2) memory — the key bottleneck for large datasets. The predictive mean has the form m=nαnynm_\star = \sum_n \alpha_n y_n, a linear predictor in the training targets.

GP Classification via Augmentation

For binary observations yn{0,1}y_n \in \{0, 1\}, a natural model is:

fGP(μ(),K(,)),ynBernoulli(g(f(xn))),f \sim \mathcal{GP}(\mu(\cdot), K(\cdot, \cdot)), \qquad y_n \sim \mathrm{Bernoulli}(g(f(\mathbf{x}_n))),

where gg is the probit or logistic link. The posterior p(f{xn,yn})p(\mathbf{f} \mid \{\mathbf{x}_n, y_n\}) is no longer Gaussian.

Probit Augmentation

Albert & Chib, 1993 and Girolami & Rogers, 2006 exploit the identity

Φ(f)=Pr(zf)=0N(zf,1)dz,zN(0,1),\Phi(f) = \Pr(z \le f) = \int_0^\infty \mathcal{N}(z \mid f, 1)\,dz, \quad z \sim \mathcal{N}(0,1),

to introduce auxiliary variables znz_n:

p(ynzn)={1[zn>0]yn=11[zn0]yn=0znf(xn)N(f(xn),1).p(y_n \mid z_n) = \begin{cases} \mathbf{1}[z_n > 0] & y_n = 1 \\ \mathbf{1}[z_n \le 0] & y_n = 0 \end{cases} \qquad z_n \mid f(\mathbf{x}_n) \sim \mathcal{N}(f(\mathbf{x}_n), 1).

The augmented posterior factorises cleanly, enabling a Gibbs sampler:

p(f,{zn}{xn,yn})np(ynzn)N(znf(xn),1)N(fμ,K).p(\mathbf{f}, \{z_n\} \mid \{\mathbf{x}_n, y_n\}) \propto \prod_n p(y_n \mid z_n)\,\mathcal{N}(z_n \mid f(\mathbf{x}_n), 1)\cdot\mathcal{N}(\mathbf{f} \mid \boldsymbol{\mu}, \mathbf{K}).

Conditional updates:

  • znf,ynz_n \mid \mathbf{f}, y_n: truncated Gaussian — N(f(xn),1)\mathcal{N}(f(\mathbf{x}_n), 1) truncated to (0,)(0, \infty) if yn=1y_n=1, to (,0](-\infty, 0] if yn=0y_n=0.

  • f{zn}\mathbf{f} \mid \{z_n\}: standard GP regression with Gaussian observations znN(f(xn),1)z_n \sim \mathcal{N}(f(\mathbf{x}_n), 1).

Elliptical Slice Sampling

For non-conjugate GP models (e.g. GP classification with a logistic link), standard Gibbs can mix poorly because the GP prior induces strong correlations between function values.

Murray et al., 2010 proposed elliptical slice sampling (ESS), which exploits the Gaussian structure of the prior. The key observation is that any draw from N(0,Σ)\mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}) can be written as

f=ν0sinθ+ν1cosθ,ν0,ν1N(0,Σ).\mathbf{f} = \boldsymbol{\nu}_0 \sin\theta + \boldsymbol{\nu}_1 \cos\theta, \qquad \boldsymbol{\nu}_0, \boldsymbol{\nu}_1 \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}).

Given the current f\mathbf{f}, a new proposal is made by rotating on the ellipse defined by f\mathbf{f} and a fresh auxiliary draw ν\boldsymbol{\nu}:

f=fcosθ+νsinθ,νN(0,Σ).\mathbf{f}' = \mathbf{f}\cos\theta + \boldsymbol{\nu}\sin\theta, \qquad \boldsymbol{\nu} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}).

The angle θ\theta is sampled by a slice over [0,2π)[0, 2\pi): start with a random bracket, evaluate the likelihood at the proposal, shrink the bracket if the proposal is rejected. The algorithm is always accepted eventually and leaves the posterior invariant.

ESS proposes moves along an ellipse in function space, automatically
respecting the GP prior covariance structure.

ESS proposes moves along an ellipse in function space, automatically respecting the GP prior covariance structure.

Sparse GPs and Inducing Points

Exact GP inference costs O(N3)O(N^3). For large NN we need approximations. Inducing point methods Titsias, 2009 introduce MNM \ll N pseudo-inputs {zm}\{\mathbf{z}_m\} and approximate the full posterior with one conditioned on the function values u=[f(z1),,f(zM)]\mathbf{u} = [f(\mathbf{z}_1),\ldots,f(\mathbf{z}_M)]^\top.

Using the GP predictive distribution:

p(fu)=N(fKNMKMM1u,  K~)K~=KNNKNMKMM1KMN.p(\mathbf{f} \mid \mathbf{u}) = \mathcal{N}(\mathbf{f} \mid \mathbf{K}_{NM}\mathbf{K}_{MM}^{-1}\mathbf{u},\; \tilde{\mathbf{K}}) \qquad \tilde{\mathbf{K}} = \mathbf{K}_{NN} - \mathbf{K}_{NM}\mathbf{K}_{MM}^{-1}\mathbf{K}_{MN}.

Variational Bound

Titsias, 2009 lower-bounds logp(y)\log p(\mathbf{y}) by:

L2=logN(y0,  KNMKMM1KMN+σ2I)12σ2tr(K~).\mathcal{L}_2 = \log \mathcal{N}(\mathbf{y} \mid \mathbf{0},\; \mathbf{K}_{NM}\mathbf{K}_{MM}^{-1}\mathbf{K}_{MN} + \sigma^2\mathbf{I}) - \frac{1}{2\sigma^2}\mathrm{tr}(\tilde{\mathbf{K}}).

The first term is the likelihood of a low-rank GP; the trace term penalises unexplained prior variance. Computing L2\mathcal{L}_2 costs O(NM2+M3)O(NM^2 + M^3).

Stochastic Variational Inference

Hensman et al., 2013 showed that if we explicitly maintain a variational distribution q(u)=N(m,S)q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S}) rather than marginalising out u\mathbf{u}, the ELBO factorises over data points:

L3=n=1NEq(f(xn))[logp(ynf(xn))]KL(q(u)p(u)).\mathcal{L}_3 = \sum_{n=1}^N \mathbb{E}_{q(f(\mathbf{x}_n))}[\log p(y_n \mid f(\mathbf{x}_n))] - \mathrm{KL}(q(\mathbf{u}) \| p(\mathbf{u})).

This enables mini-batch stochastic gradient ascent, scaling GPs to millions of observations at O(M3)O(M^3) per step.

# ── Kernel functions ──────────────────────────────────────────────────────────

def rbf_kernel(x1, x2, length_scale=1.0, variance=1.0):
    '''Squared exponential (RBF) kernel  k(x1, x2) = variance * exp(-r^2 / 2).

    Args:
        x1: (N1,)  1-D inputs
        x2: (N2,)  1-D inputs
    Returns:
        (N1, N2) Gram matrix
    '''
    r2 = ((x1[:, None] - x2[None, :]) / length_scale) ** 2
    return variance * torch.exp(-0.5 * r2)


def matern_kernel(x1, x2, length_scale=1.0, variance=1.0, nu=1.5):
    '''Matern kernel for nu in {0.5, 1.5, 2.5}.

    Args:
        x1: (N1,)  1-D inputs
        x2: (N2,)  1-D inputs
        nu: smoothness parameter (0.5, 1.5, or 2.5)
    Returns:
        (N1, N2) Gram matrix
    '''
    r = (x1[:, None] - x2[None, :]).abs() / length_scale
    if nu == 0.5:
        return variance * torch.exp(-r)
    elif nu == 1.5:
        sr = math.sqrt(3) * r
        return variance * (1 + sr) * torch.exp(-sr)
    elif nu == 2.5:
        sr = math.sqrt(5) * r
        return variance * (1 + sr + sr**2 / 3) * torch.exp(-sr)
    else:
        raise ValueError(f'Unsupported nu={nu}; use 0.5, 1.5, or 2.5')


# ── Exact GP regression ───────────────────────────────────────────────────────

def gp_posterior_predictive(x_train, y_train, x_test, kernel_fn,
                             noise_var=0.1, mu=0.0):
    '''Exact GP regression: posterior predictive mean and std.

    Model:
        f ~ GP(mu, K)
        y_n | f ~ N(f(x_n), noise_var)

    Args:
        x_train:    (N,)   training inputs
        y_train:    (N,)   training targets
        x_test:     (M,)   test inputs
        kernel_fn:  callable(x1, x2) -> (N1, N2) kernel matrix
        noise_var:  observation noise variance
        mu:         prior mean (scalar)
    Returns:
        mean_pred:  (M,)   posterior predictive mean
        std_pred:   (M,)   posterior predictive std
        log_ml:     scalar log marginal likelihood
    '''
    N = len(x_train)
    jitter = 1e-4

    K_nn = kernel_fn(x_train, x_train) + (noise_var + jitter) * torch.eye(N)  # jitter=1e-4
    k_ns = kernel_fn(x_train, x_test)   # (N, M)
    K_ss = kernel_fn(x_test,  x_test)   # (M, M)

    L     = torch.linalg.cholesky(K_nn)                        # (N, N)
    alpha = torch.cholesky_solve((y_train - mu).unsqueeze(1), L).squeeze(1)  # (N,)

    mean_pred = mu + k_ns.T @ alpha                             # (M,)

    v         = torch.linalg.solve_triangular(L, k_ns, upper=False)  # (N, M)
    var_pred   = K_ss.diagonal() - (v * v).sum(0)              # (M,)
    var_pred   = var_pred.clamp(min=0) + noise_var
    std_pred   = var_pred.sqrt()

    # Log marginal likelihood: log N(y | mu*1, K_nn)
    log_ml = (-0.5 * (y_train - mu) @ alpha
              - L.diagonal().log().sum()
              - 0.5 * N * math.log(2 * math.pi))

    return mean_pred, std_pred, log_ml
Source
import numpy as np

# ── Helper: sample prior paths ────────────────────────────────────────────────
def sample_prior(x, kernel_fn, n_samples=5, seed=0):
    torch.manual_seed(seed)
    K = kernel_fn(x, x) + 1e-4 * torch.eye(len(x))
    L = torch.linalg.cholesky(K)
    eps = torch.randn(len(x), n_samples)
    return L @ eps   # (N, n_samples)

# ── GP regression demo ────────────────────────────────────────────────────────
x_all  = torch.linspace(-4, 4, 300)

# True function
f_true = lambda x: torch.sin(x) + 0.5 * torch.sin(3 * x)

# Training data
torch.manual_seed(7)
x_train = torch.tensor([-3.0, -2.0, -1.0, 0.5, 1.5, 2.5, 3.5])
y_train = f_true(x_train) + 0.2 * torch.randn(len(x_train))

# Kernel choice
ell, sigma2 = 1.0, 1.0
kfn = lambda a, b: rbf_kernel(a, b, length_scale=ell, variance=sigma2)

mean_pred, std_pred, log_ml = gp_posterior_predictive(
    x_train, y_train, x_all, kfn, noise_var=0.04)

print(f'Log marginal likelihood: {log_ml:.2f}')

# ── Plot ──────────────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

# Panel 1: prior samples
ax = axes[0]
prior_samples = sample_prior(x_all, kfn, n_samples=6)
for i in range(prior_samples.shape[1]):
    ax.plot(x_all.numpy(), prior_samples[:, i].numpy(),
            lw=1.0, alpha=0.7, color=palette[0])
ax.set_title('Samples from GP prior  ($\ell=1$, $\sigma^2=1$)')
ax.set_xlabel('$x$')
ax.set_ylabel('$f(x)$')
ax.set_xlim(-4, 4)

# Panel 2: posterior predictive
ax = axes[1]
mu_np  = mean_pred.numpy()
std_np = std_pred.numpy()
x_np   = x_all.numpy()

ax.plot(x_np, f_true(x_all).numpy(), 'k--', lw=1.2, label='True function')
ax.scatter(x_train.numpy(), y_train.numpy(), s=50, zorder=5,
           color='k', label='Training data')
ax.plot(x_np, mu_np, color=palette[1], lw=1.5, label='Posterior mean')
ax.fill_between(x_np, mu_np - 2*std_np, mu_np + 2*std_np,
                alpha=0.3, color=palette[1], label='$\pm 2\sigma$')
ax.set_title('GP regression: posterior predictive')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim(-4, 4)
ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

# ── Kernel comparison plot ────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(13, 4))

kernels = [
    ('SE', lambda a, b: rbf_kernel(a, b, length_scale=1.0)),
    ('Matern 1/2', lambda a, b: matern_kernel(a, b, length_scale=1.0, nu=0.5)),
    ('Matern 3/2', lambda a, b: matern_kernel(a, b, length_scale=1.0, nu=1.5)),
    ('Matern 5/2', lambda a, b: matern_kernel(a, b, length_scale=1.0, nu=2.5)),
]
x_dense = torch.linspace(-4, 4, 300)

# Kernel values vs distance
ax = axes[0]
dists = torch.linspace(0, 3, 200)
x0    = torch.zeros(1)
for i, (name, kfn_i) in enumerate(kernels):
    k_vals = kfn_i(x0, dists).squeeze()
    ax.plot(dists.numpy(), k_vals.detach().numpy(), label=name,
            lw=1.5, color=palette[i])
ax.set_title('Kernel functions $K(0, r)$')
ax.set_xlabel('Distance $r$')
ax.set_ylabel('Covariance')
ax.legend(fontsize=8)

# Sample paths per kernel
ax = axes[1]
for i, (name, kfn_i) in enumerate(kernels):
    s = sample_prior(x_dense, kfn_i, n_samples=1, seed=i)
    ax.plot(x_dense.numpy(), s[:, 0].detach().numpy(),
            label=name, lw=1.3, color=palette[i])
ax.set_title('One sample path per kernel')
ax.set_xlabel('$x$')
ax.set_ylabel('$f(x)$')
ax.legend(fontsize=8)

plt.tight_layout()
plt.show()
<>:40: SyntaxWarning: invalid escape sequence '\e'
<>:56: SyntaxWarning: invalid escape sequence '\p'
<>:40: SyntaxWarning: invalid escape sequence '\e'
<>:56: SyntaxWarning: invalid escape sequence '\p'
/tmp/ipykernel_2781/395890649.py:40: SyntaxWarning: invalid escape sequence '\e'
  ax.set_title('Samples from GP prior  ($\ell=1$, $\sigma^2=1$)')
/tmp/ipykernel_2781/395890649.py:56: SyntaxWarning: invalid escape sequence '\p'
  alpha=0.3, color=palette[1], label='$\pm 2\sigma$')
Log marginal likelihood: -7.93
<Figure size 1300x400 with 2 Axes>
<Figure size 1300x400 with 2 Axes>

Conclusion

TopicKey result
GP priorAny finite marginal is Gaussian; kernel encodes structure
SE kernelLimit of linear regression with infinitely many RBF features
GP regressionPosterior and predictive in closed form via Schur complement
Marginal likelihoodN(yμ,K+σ2I)\mathcal{N}(\mathbf{y}\mid\boldsymbol{\mu}, \mathbf{K}+\sigma^2\mathbf{I}); used to tune hyperparameters
GP classificationProbit augmentation converts to Gaussian observations
Elliptical slice samplingSlice sampling respecting GP prior correlations
Sparse GPsO(NM2)O(NM^2) inference; SVI enables mini-batch training

Key limitation. Exact GP inference costs O(N3)O(N^3). Sparse / inducing point methods reduce this, but designing scalable, expressive GP models remains an active research area.

References
  1. Williams, C. K., & Rasmussen, C. E. (1996). Gaussian processes for regression. MIT Press.
  2. Duvenaud, D., Lloyd, J., Grosse, R., Tenenbaum, J., & Zoubin, G. (2013). Structure discovery in nonparametric regression through compositional kernel search. International Conference on Machine Learning, 1166–1174.
  3. Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679.
  4. Girolami, M., & Rogers, S. (2006). Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8), 1790–1817.
  5. Murray, I., Adams, R., & MacKay, D. (2010). Elliptical slice sampling. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 541–548.
  6. Titsias, M. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In D. van Dyk & M. Welling (Eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics (Vol. 5, pp. 567–574). PMLR.
  7. Hensman, J., Fusi, N., & Lawrence, N. D. (2013). Gaussian processes for Big data. Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, 282–290.