Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Continual Learning

A central assumption of classical statistical learning is that training data are drawn i.i.d. from a fixed distribution. Real-world systems routinely violate this: a medical diagnosis model must incorporate new disease variants without forgetting old ones; a language model deployed in production should update on new domains without degrading on existing benchmarks; a robot should accumulate motor skills over its lifetime rather than relearning each task from scratch. Continual learning (also called lifelong learning or sequential learning) is the study of how to learn from a non-stationary stream of data or tasks while retaining previously acquired knowledge De Lange et al., 2021.

Problem Formulation

Task Sequences and the Forgetting Problem

A continual learning agent encounters a sequence of tasks T1,T2,,TT\mathcal{T}_1, \mathcal{T}_2, \ldots, \mathcal{T}_T. Each task Tt\mathcal{T}_t provides a dataset

Dt={(xn(t),yn(t))}n=1Nt\mathcal{D}_t = \{(x_n^{(t)}, y_n^{(t)})\}_{n=1}^{N_t}

drawn from a task-specific distribution pt(x,y)p_t(x, y). The agent has access to Dt\mathcal{D}_t only while training on task tt; it cannot revisit D1,,Dt1\mathcal{D}_1, \ldots, \mathcal{D}_{t-1} in full. The goal is to find parameters θ\theta that perform well across all tasks simultaneously:

minθ1Tt=1TLt(θ),Lt(θ)=E(x,y)pt[(θ;x,y)].\min_\theta \frac{1}{T} \sum_{t=1}^T \mathcal{L}_t(\theta), \qquad \mathcal{L}_t(\theta) = \mathbb{E}_{(x,y)\sim p_t}[\ell(\theta; x, y)].

The difficulty is that naive sequential training — minimizing Lt\mathcal{L}_t at each step with gradient descent — causes catastrophic forgetting McCloskey & Cohen, 1989: optimizing for the current task moves parameters into a region that performs poorly on previous tasks, and the loss of prior knowledge can be abrupt and nearly total.

Scenarios

Three standard scenarios differ in what information is available at test time:

Evaluation Metrics

After training on all TT tasks, let at,ta_{t', t} denote the accuracy on task tt immediately after training on task tt'. Key metrics are:

Catastrophic Forgetting

Why Gradient Descent Forgets

Let θt1\theta^*_{t-1} minimize the loss on all tasks seen so far. When we minimize Lt(θ)\mathcal{L}_t(\theta) starting from θt1\theta^*_{t-1}, the gradient θLt(θt1)\nabla_\theta \mathcal{L}_t(\theta^*_{t-1}) points in a direction that reduces the current task loss. Nothing in this gradient respects the curvature of the previous losses L1,,Lt1\mathcal{L}_1, \ldots, \mathcal{L}_{t-1}: the step may move parameters to a region of high loss for earlier tasks.

The severity of forgetting depends on task similarity and parameter overlap. If tasks use largely disjoint subsets of parameters, interference is small. If the same parameters are critical for multiple tasks, any update for one task can disrupt the others.

The Stability–Plasticity Dilemma

Continual learning requires balancing two competing pressures:

A model that is maximally stable (e.g., frozen weights) cannot learn new tasks. A model that is maximally plastic (e.g., standard SGD) forgets immediately. Effective continual learning algorithms navigate the trade-off between these two extremes.

Regularization-Based Methods

Regularization methods augment the loss for each new task with a penalty that discourages large changes to parameters that were important for past tasks.

Elastic Weight Consolidation

Elastic weight consolidation (EWC) Kirkpatrick et al., 2017 approximates the posterior over parameters after task t1t-1 as a Gaussian centered on the MAP estimate θt1\theta^*_{t-1}:

p(θD1:t1)N(θ;θt1,Ft11),p(\theta \mid \mathcal{D}_{1:t-1}) \approx \mathcal{N}(\theta;\, \theta^*_{t-1},\, F_{t-1}^{-1}),

where Ft1F_{t-1} is the Fisher information matrix, approximated by its diagonal:

Fii=Ep(xθt1) ⁣[(logp(yx,θ)θiθ=θt1) ⁣2]1Nn=1N(logp(ynxn,θt1)θi) ⁣2.F_{ii} = \mathbb{E}_{p(x \mid \theta^*_{t-1})}\!\left[\left(\frac{\partial \log p(y \mid x, \theta)}{\partial \theta_i}\Bigg|_{\theta = \theta^*_{t-1}}\right)^{\!2}\right] \approx \frac{1}{N}\sum_{n=1}^N \left(\frac{\partial \log p(y_n \mid x_n, \theta^*_{t-1})}{\partial \theta_i}\right)^{\!2}.

The diagonal Fisher FiiF_{ii} measures how much the log-likelihood changes when θi\theta_i moves away from θt1\theta^*_{t-1}: large FiiF_{ii} means θi\theta_i is important for task t1t-1 and should be protected.

Using the previous posterior as a prior for the new task and taking the MAP gives the EWC loss:

LEWC(θ)=Lt(θ)+λ2iFii(θiθt1,i)2.\mathcal{L}_\mathrm{EWC}(\theta) = \mathcal{L}_t(\theta) + \frac{\lambda}{2} \sum_i F_{ii}\,(\theta_i - \theta^*_{t-1,i})^2.

This is a weighted 2\ell_2 proximal penalty: parameters with high Fisher weight are kept close to their previous values, while unimportant parameters are free to change.

For a sequence of tasks, the Fisher from all previous tasks accumulates. In the online EWC variant Schwarz et al., 2018, a single running estimate of the Fisher is maintained rather than storing one Fisher matrix per task.

Synaptic Intelligence

Synaptic intelligence (SI) Zenke et al., 2017 estimates parameter importance online during training rather than post-hoc. It tracks the cumulative contribution of each parameter to the decrease in the loss along the optimization trajectory. For a parameter θi\theta_i moving from θi,0\theta_{i,0} to θi,Tt\theta_{i,T_t} during training on task tt, the importance is:

Ωit=kδθi,k(θiLt)k(Δθit)2+ξ,\Omega_i^t = \frac{\sum_{k} \delta\theta_{i,k}\, (-\partial_{\theta_i} \mathcal{L}_t)_k}{\left(\Delta\theta_{i}^t\right)^2 + \xi},

where the sum is over optimizer steps kk, δθi,k\delta\theta_{i,k} is the parameter update at step kk, and ξ\xi is a small damping constant. The numerator is the inner product of the parameter path with the negative gradient — large when the parameter moved in directions that actually reduced the loss. The denominator normalizes by the total displacement Δθit=θi,Ttθi,0\Delta\theta_i^t = \theta_{i,T_t} - \theta_{i,0}.

Accumulated importance across tasks, Ωi=t<tΩit\Omega_i = \sum_{t' < t} \Omega_i^{t'}, defines the regularization:

LSI(θ)=Lt(θ)+ciΩi(θiθt1,i)2.\mathcal{L}_\mathrm{SI}(\theta) = \mathcal{L}_t(\theta) + c \sum_i \Omega_i\,(\theta_i - \theta^*_{t-1,i})^2.

SI requires no additional forward passes or Hessian computations, making it computationally lighter than EWC.

Bayesian Continual Learning

The Bayesian perspective gives a principled foundation for regularization-based continual learning and connects it to the sequential inference algorithms we have studied throughout this course.

Sequential Bayes

If tasks arrive sequentially and the parameters are shared, the exact Bayesian update after observing Dt\mathcal{D}_t is:

p(θD1:t)=p(Dtθ)p(θD1:t1)p(DtD1:t1).p(\theta \mid \mathcal{D}_{1:t}) = \frac{p(\mathcal{D}_t \mid \theta)\, p(\theta \mid \mathcal{D}_{1:t-1})}{p(\mathcal{D}_t \mid \mathcal{D}_{1:t-1})}.

This is sequential Bayes: the posterior after task t1t-1 becomes the prior for task tt. If we could maintain the exact posterior, there would be no forgetting — all past information is encoded in the current prior. The challenge is computational: the exact posterior is generally intractable and grows in complexity with each new task.

Bayesian Online Linear Regression

Many continual learning problems reduce, in their simplest form, to online estimation of a shared parameter vector from a streaming sequence of observations. When the model is linear and the noise is Gaussian, sequential Bayes is tractable and exactly equivalent to Kalman filtering — the same algorithm used for state-space models in Linear Dynamical Systems.

Setup. Suppose we observe a stream of regression pairs (x1,y1),(x2,y2),(x_1, y_1), (x_2, y_2), \ldots where xtRDx_t \in \mathbb{R}^D and ytRy_t \in \mathbb{R}, generated by a linear model:

yt=θxt+ϵt,ϵtiidN(0,σ2).y_t = \theta^\top x_t + \epsilon_t, \qquad \epsilon_t \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2).

We place a Gaussian prior on the shared parameter θRD\theta \in \mathbb{R}^D, and the goal is to maintain the posterior p(θy1:t)p(\theta \mid y_{1:t}) after each new pair arrives, without storing the full data history.

Equivalence with the Kalman filter. Treating θ\theta as the hidden state of a linear dynamical system reduces this problem exactly to Kalman filtering:

LDS quantityRegression interpretation
State ztz_tParameter θ\theta (shared, constant)
Dynamics A=I\mathbf{A} = \mathbf{I}, Q=0\mathbf{Q} = \mathbf{0}Parameters do not change across observations
Emission matrix Ct=xtC_t = x_t^\topTime-varying: the feature vector at step tt
Emission noise R=σ2R = \sigma^2Observation noise variance

The dynamics are static (identity transition, zero process noise), so the Kalman predict step is trivial: the predicted distribution equals the filtered distribution from the previous step. Only the update step is active. Starting from p(θy1:t1)=N(μt1,Σt1)p(\theta \mid y_{1:t-1}) = \mathcal{N}(\mu_{t-1}, \Sigma_{t-1}), conditioning on yty_t gives:

δt=ytxtμt1,St=xtΣt1xt+σ2,Kt=Σt1xtSt\delta_t = y_t - x_t^\top \mu_{t-1}, \qquad S_t = x_t^\top \Sigma_{t-1} x_t + \sigma^2, \qquad K_t = \frac{\Sigma_{t-1}\, x_t}{S_t}
μt=μt1+Ktδt,Σt=(IKtxt)Σt1.\mu_t = \mu_{t-1} + K_t\,\delta_t, \qquad \Sigma_t = (\mathbf{I} - K_t x_t^\top)\,\Sigma_{t-1}.

The innovation δt\delta_t is the residual from the current prediction, StS_t is its variance (a scalar because yty_t is scalar), and KtRDK_t \in \mathbb{R}^D is the Kalman gain. In the information form, defining the precision Λt=Σt1\Lambda_t = \Sigma_t^{-1} and information vector ηt=Λtμt\eta_t = \Lambda_t \mu_t, the update is simply additive (by the Sherman–Morrison identity):

Λt=Λt1+1σ2xtxt,ηt=ηt1+ytσ2xt.\Lambda_t = \Lambda_{t-1} + \frac{1}{\sigma^2}\, x_t x_t^\top, \qquad \eta_t = \eta_{t-1} + \frac{y_t}{\sigma^2}\, x_t.

After tt steps, Λt=Λ0+1σ2XtXt\Lambda_t = \Lambda_0 + \frac{1}{\sigma^2} X_t^\top X_t, which is exactly the posterior precision from batch Bayesian linear regression on all tt observations. Sequential and batch inference are exactly equivalent for this model — no approximation is introduced by processing data one point at a time.

Drifting parameters. Static parameters are often unrealistic: the true data-generating process may shift over time. A natural generalization introduces a random-walk prior on θ\theta, i.e., A=I\mathbf{A} = \mathbf{I} with Q0\mathbf{Q} \succ 0. The Kalman predict step then inflates the covariance before each update:

Σtt1=Σt1t1+Q.\Sigma_{t|t-1} = \Sigma_{t-1|t-1} + \mathbf{Q}.

This is a form of controlled forgetting: the additional uncertainty injected at each step means older observations have less influence on the current estimate. Setting Q=q2I\mathbf{Q} = q^2 \mathbf{I} corresponds to exponential discounting of past data, with an effective memory horizon of σ2/q2\sim \sigma^2 / q^2 observations.

Variational Continual Learning

Variational continual learning (VCL) Nguyen et al., 2018 maintains a variational approximation qt(θ)p(θD1:t)q_t(\theta) \approx p(\theta \mid \mathcal{D}_{1:t}) at each step. The variational objective at task tt is:

Ft(q)=DKL(q(θ)qt1(θ))Eq(θ)[logp(Dtθ)],\mathcal{F}_t(q) = D_\mathrm{KL}(q(\theta) \,\|\, q_{t-1}(\theta)) - \mathbb{E}_{q(\theta)}[\log p(\mathcal{D}_t \mid \theta)],

where qt1q_{t-1} is the approximate posterior from the previous task, used as the new prior. Using a Gaussian mean-field approximation qt(θ)=N(θ;μt,diag(σt2))q_t(\theta) = \mathcal{N}(\theta;\, \mu_t, \mathrm{diag}(\sigma_t^2)), the parameters (μt,σt)(\mu_t, \sigma_t) are updated by minimizing Ft\mathcal{F}_t via the reparameterization gradient.

The KL term in Ft\mathcal{F}_t plays the role of the EWC penalty: it discourages qtq_t from drifting far from qt1q_{t-1}, weighted by the inverse variance of the previous posterior. When qt1=N(θt1,Ft11)q_{t-1} = \mathcal{N}(\theta^*_{t-1}, F_{t-1}^{-1}) (the Laplace approximation), the KL penalty reduces exactly to:

DKL(qtqt1)=12iFt1,ii(μt,iθt1,i)2+terms in σt.D_\mathrm{KL}(q_t \,\|\, q_{t-1}) = \frac{1}{2}\sum_i F_{t-1,ii}\,(\mu_{t,i} - \theta^*_{t-1,i})^2 + \text{terms in } \sigma_t.

This reveals that EWC is the MAP limit of VCL under a Laplace approximation to the posterior, with the Fisher information playing the role of the prior precision.

VCL also integrates naturally with coreset methods: a small set of representative data points from each past task is stored and used to refine the variational posterior, improving on the Laplace approximation when the posterior is non-Gaussian.

Replay-Based Methods

Replay methods counteract forgetting by periodically revisiting data from previous tasks, either stored explicitly or regenerated by a model.

Experience Replay

The simplest approach maintains a fixed-size memory buffer M\mathcal{M} containing a small number of exemplars from each past task, selected by random subsampling or a principled strategy (e.g., reservoir sampling to maintain a uniform random sample from the full stream). The loss at each step interleaves current-task data with replayed data:

L(θ)=Lt(θ)+βLM(θ),LM(θ)=1M(x,y)M(θ;x,y).\mathcal{L}(\theta) = \mathcal{L}_t(\theta) + \beta \,\mathcal{L}_\mathcal{M}(\theta), \qquad \mathcal{L}_\mathcal{M}(\theta) = \frac{1}{|\mathcal{M}|}\sum_{(x,y)\in\mathcal{M}} \ell(\theta; x, y).

Experience replay is conceptually simple and empirically effective, but its memory cost grows with the number of tasks (or the per-task exemplar budget shrinks).

Dark Experience Replay

Dark Experience Replay (DER) Buzzega et al., 2020 improves on standard replay by storing the model’s soft predictions (logits) zn=fθ(xn)z_n = f_\theta(x_n) at the time each exemplar is added to the buffer, in addition to the input xnx_n. The replay loss then includes a knowledge distillation term:

LDER(θ)=Lt(θ)+α1M(xn,zn)Mfθ(xn)zn2distillation from stored logits.\mathcal{L}_\mathrm{DER}(\theta) = \mathcal{L}_t(\theta) + \alpha \underbrace{\frac{1}{|\mathcal{M}|}\sum_{(x_n, z_n)\in\mathcal{M}} \|f_\theta(x_n) - z_n\|^2}_{\text{distillation from stored logits}}.

Matching the current model’s predictions to the stored logits preserves not just the final decision but the full predictive distribution — a richer signal that slows forgetting more effectively than label-only replay.

Gradient Episodic Memory

Gradient Episodic Memory (GEM) Lopez-Paz & Ranzato, 2017 uses the memory buffer M\mathcal{M} not to mix replay into the training loss, but to constrain the gradient update direction. GEM solves a quadratic program at each step to find the update g~\tilde{g} that:

  1. does not increase the loss on any past task’s exemplars (the non-interference constraint), and

  2. is as close as possible to the current task’s gradient gtg_t.

Formally, let gt=θLMt(θ)g_{t'} = \nabla_\theta \mathcal{L}_{\mathcal{M}_{t'}}(\theta) be the gradient of the memory loss for past task tt'. GEM solves:

g~=argminv12vgt2subject tov,gt0t<t.\tilde{g} = \arg\min_{v} \tfrac{1}{2}\|v - g_t\|^2 \quad \text{subject to} \quad \langle v, g_{t'} \rangle \geq 0 \quad \forall t' < t.

The constraint g~,gt0\langle \tilde{g}, g_{t'} \rangle \geq 0 ensures the update does not increase past task losses to first order. When all constraints are satisfied by gtg_t itself, no projection is needed; otherwise, gtg_t is projected onto the intersection of the constraint half-spaces.

Modern Approaches: Parameter-Efficient Continual Learning

The dominant paradigm for continual learning with large pretrained models has shifted toward parameter-efficient adaptation: rather than modifying all model weights, these methods encode task-specific knowledge in a small number of additional parameters, leaving the pretrained base frozen. This eliminates forgetting by construction — different tasks are stored in separate, non-interfering modules — and scales naturally to long task sequences without growing the base model.

Low-Rank Adaptation (LoRA)

LoRA Hu et al., 2022 constrains task-specific weight updates to a low-dimensional subspace. For a weight matrix W0Rm×nW_0 \in \mathbb{R}^{m \times n}, the adapted model uses:

W=W0+ΔW=W0+BA,W = W_0 + \Delta W = W_0 + BA,

where BRm×rB \in \mathbb{R}^{m \times r}, ARr×nA \in \mathbb{R}^{r \times n}, and rmin(m,n)r \ll \min(m, n). The base weights W0W_0 are frozen; only AA and BB are trained, requiring O(r(m+n))O(r(m+n)) parameters per task instead of O(mn)O(mn). For continual learning, each task tt receives its own adapter pair {A(t),B(t)}\{A^{(t)}, B^{(t)}\}: switching tasks requires only swapping the adapter, and the base model is never modified. Zero forgetting follows by construction.

A practical question is whether adapters from different tasks can be composed at inference time — for instance, by taking a weighted combination tαtΔW(t)\sum_t \alpha_t \Delta W^{(t)} when the task identity is unknown. This connects to the broader challenge of merging independently trained models, an active area of research sometimes called model merging or task arithmetic.

Prefix Tuning and KV Cache Adaptation

Prefix tuning Li & Liang, 2021 prepends a set of LL learned “virtual tokens” to the key-value cache of each attention layer. For a transformer with keys KRT×dK \in \mathbb{R}^{T \times d} and values VRT×dV \in \mathbb{R}^{T \times d} from the input tokens, the prefix-augmented attention computes:

Attn ⁣(Q,[Kˉ;K],[Vˉ;V]),\mathrm{Attn}\!\left(Q,\, [\bar{K};\, K],\, [\bar{V};\, V]\right),

where Kˉ,VˉRL×d\bar{K}, \bar{V} \in \mathbb{R}^{L \times d} are the only parameters trained and dd is the head dimension. The base model is fully frozen. Prefix tuning can be understood as soft prompting at every layer: the virtual tokens steer the model’s internal representations without touching its weights. For continual learning, separate prefixes per task require only O(Ldnlayers)O(L \cdot d \cdot n_\mathrm{layers}) storage, and switching tasks is a memory swap.

Cartridges

Cartridges Eyuboglu et al., 2025 extend prefix tuning from a learned prompt to a compressed knowledge store. Rather than learning a prefix for a task specification, a cartridge encodes the content of an entire document corpus into a compact KV cache (Kˉ,Vˉ)RL×d×RL×d(\bar{K}, \bar{V}) \in \mathbb{R}^{L \times d} \times \mathbb{R}^{L \times d} with LTL \ll T, via a process called self-study:

  1. Generate synthetic reference queries QrefRq×dQ_\mathrm{ref} \in \mathbb{R}^{q \times d} about the target documents using the base model.

  2. Run the model with the full TT-token documents in context to obtain ground-truth attention outputs YRq×dY \in \mathbb{R}^{q \times d}.

  3. Train the cartridge (Kˉ,Vˉ)(\bar{K}, \bar{V}) to minimize the discrepancy between YY and the attention outputs produced by (Kˉ,Vˉ)(\bar{K}, \bar{V}) alone.

The result is a compact, composable cache: multiple cartridges can be concatenated in the KV cache at inference time without retraining, enabling modular assembly of knowledge from multiple sources. Empirically, cartridges storing a 484k-token corpus require 38× less memory than the equivalent in-context representation and achieve 26× higher inference throughput.

For continual learning, each task or knowledge domain gets its own cartridge trained offline. Adding a new task means training one new cartridge; all previous cartridges are untouched. Forgetting is structurally impossible.

Attention Matching

Training a cartridge via self-study requires backpropagation through the model for each new task — taking hours for large contexts. Attention matching Zweiger et al., 2026 finds the compact cache (Kˉ,Vˉ)(\bar{K}, \bar{V}) in closed form, in seconds, by decomposing the problem into tractable least-squares subproblems.

Using the same notation as above, the goal is to find (Kˉ,Vˉ)RL×d×RL×d(\bar{K}, \bar{V}) \in \mathbb{R}^{L \times d} \times \mathbb{R}^{L \times d} that reproduces the attention outputs YRq×dY \in \mathbb{R}^{q \times d} of the full TT-token cache on the qq reference queries QrefQ_\mathrm{ref}.

Value fitting. Given compacted keys Kˉ\bar{K}, let XRq×LX \in \mathbb{R}^{q \times L} be the matrix of normalized attention weights from each reference query to each compacted key. The compacted values Vˉ\bar{V} that best reproduce YY solve:

minVˉXVˉYF2,Vˉ=(XX)1XY.\min_{\bar{V}} \|X \bar{V} - Y\|_F^2, \qquad \bar{V}^* = (X^\top X)^{-1} X^\top Y.

This is ordinary least squares: the closed-form solution is the same as Bayesian linear regression with an uninformative prior (cf. the information-form update in the Bayesian online regression section above, with Λ0=0\Lambda_0 = 0). Each column of Vˉ\bar{V} is fit independently, regressing the dd-dimensional attention output onto the LL normalized attention weights from the compacted keys.

Bias fitting. To also match the total attention mass — ensuring Kˉ\bar{K} attracts the correct aggregate attention weight — the method adds per-token scalar log-biases βRL\beta \in \mathbb{R}^L to the attention scores and solves a nonnegative least-squares problem for β\beta.

Key selection. Given fixed Vˉ\bar{V} and β\beta, the LL key positions Kˉ\bar{K} are chosen greedily by orthogonal matching pursuit or by highest aggregated attention weight across QrefQ_\mathrm{ref}.

The full three-stage pipeline (key selection → bias fitting → value fitting) runs in seconds and achieves 50× compression with minimal quality loss on long-context benchmarks, matching gradient-based methods that take hours. The closed-form structure mirrors the recursive Bayesian updates studied throughout this chapter: both reduce to sequential linear-regression problems that summarize a stream of information into a compact sufficient statistic.

Other Methods

Earlier work proposed two additional families that avoid forgetting by design rather than by parameter isolation. Though less prominent in the era of large pretrained models, they introduced ideas that continue to influence modern methods.

Architecture-based methods allocate dedicated parameters to each task. Progressive neural networks Rusu et al., 2016 add a new column of weights per task, with lateral connections to all previous frozen columns — enabling positive forward transfer at the cost of linear model growth. PackNet avoids growth by pruning and reassigning freed weights to future tasks, assigning each task a disjoint binary mask m(t)m^{(t)} over the shared parameter vector.

Gradient projection methods constrain updates to subspaces that do not interfere with past tasks. Orthogonal gradient descent (OGD) Farajtabar et al., 2020 projects the current gradient onto the orthogonal complement of the gradients at previous task optima:

g~t=gtt<tgtg(t)g(t)2g(t).\tilde{g}_t = g_t - \sum_{t'<t} \frac{g_t^\top g^{(t')}}{\|g^{(t')}\|^2}\, g^{(t')}.

Gradient projection memory (GPM) Saha et al., 2021 generalizes this by projecting layer-wise gradient matrices onto the null space of the feature subspace from past tasks, identified via SVD: G~=GUUG\tilde{G}^\ell = G^\ell - U^\ell {U^\ell}^\top G^\ell.

Summary and Open Problems

Continual learning sits at the intersection of optimization, Bayesian inference, and representation learning. The key tension — the stability–plasticity dilemma — manifests differently across the methods reviewed here:

MethodAnti-forgetting mechanismStorage per task
EWC Kirkpatrick et al., 2017Diagonal Fisher penaltyO(θ)O(|\theta|)
SI Zenke et al., 2017Online path-integral importanceO(θ)O(|\theta|) accumulated
VCL Nguyen et al., 2018Variational posterior as priorO(θ)O(|\theta|) + coresets
Experience replayStored exemplarsO(M)O(|\mathcal{M}|) buffer
DER Buzzega et al., 2020Stored exemplars + logit distillationO(MC)O(|\mathcal{M}| \cdot C)
GEM Lopez-Paz & Ranzato, 2017Gradient projection via QPO(M)O(|\mathcal{M}|) buffer
LoRA Hu et al., 2022Task-specific low-rank adaptersO(r(m+n))O(r(m+n)) per task
Prefix tuning Li & Liang, 2021Task-specific KV prefixO(Ldn)O(L \cdot d \cdot n_\ell) per task
Cartridges Eyuboglu et al., 2025Compressed KV cache per taskO(Ldn)O(L \cdot d \cdot n_\ell) per task
Attention matching Zweiger et al., 2026Closed-form KV compaction (OLS)O(Ldn)O(L \cdot d \cdot n_\ell) per task

Several open problems remain active areas of research:

Conclusion

Continual learning studies how to learn from a non-stationary stream of tasks without forgetting previously acquired knowledge. The field has evolved from regularization-based methods (EWC, SI, VCL), which protect important parameters using the Fisher information or a variational posterior as a surrogate prior, through replay-based methods (experience replay, DER, GEM), which counteract forgetting by revisiting past data, to modern parameter-efficient approaches (LoRA, prefix tuning, cartridges, attention matching), which sidestep forgetting entirely by keeping the base model frozen and storing task knowledge in lightweight, composable modules. The Bayesian perspective — sequential updating of a posterior, made exact by the Kalman filter in the linear Gaussian case — provides a unifying foundation that connects all of these approaches and motivates the approximate methods needed for nonlinear models.

References
  1. De Lange, M., Aljundi, R., Masana, M., Parrini, S., Jodelet, Q., & Tuytelaars, T. (2021). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3366–3385.
  2. McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109–165.
  3. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., & others. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521–3526.
  4. Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., & Hadsell, R. (2018). Progress & compress: A scalable framework for continual learning. International Conference on Machine Learning, 4528–4537.
  5. Zenke, F., Poole, B., & Ganguli, S. (2017). Continual learning through synaptic intelligence. International Conference on Machine Learning, 3987–3995.
  6. Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2018). Variational continual learning. International Conference on Learning Representations.
  7. Buzzega, P., Boschini, M., Paci, A., Abati, D., & Calderara, S. (2020). Dark experience for general continual learning: a strong, simple baseline. Advances in Neural Information Processing Systems, 33, 15920–15930.
  8. Lopez-Paz, D., & Ranzato, M. (2017). Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems, 30.
  9. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. arXiv Preprint arXiv:2106.09685.
  10. Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv Preprint arXiv:2101.00190.
  11. Eyuboglu, S., Ehrlich, R., Arora, S., Guha, N., Zinsley, D., Liu, E., Tennien, W., Rudra, A., Zou, J., Mirhoseini, A., & Ré, C. (2025). Cartridges: Lightweight and General-Purpose Long Context Representations via Self-Study. arXiv Preprint arXiv:2506.06266.
  12. Zweiger, A., Fu, X., Guo, H., & Kim, Y. (2026). Fast KV Compaction via Attention Matching. arXiv Preprint arXiv:2602.16284.
  13. Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. arXiv Preprint arXiv:1606.04671.
  14. Farajtabar, M., Azizan, N., Mott, A., & Li, A. (2020). Orthogonal gradient descent for continual learning. International Conference on Artificial Intelligence and Statistics, 3762–3773.
  15. Saha, G., Garg, I., & Roy, K. (2021). Gradient Projection Memory for Continual Learning. International Conference on Learning Representations.