Linear RNNs and structured SSMs are fast to train because their recurrences compose as affine functions — a property that admits an -depth parallel scan. Nonlinear RNNs, like the vanilla RNN and the GRU, do not enjoy this property: each state depends nonlinearly on the previous one, so evaluation seems inherently sequential. This chapter first introduces the parallel scan algorithm, then shows how iterative linearization extends it to nonlinear recurrences — reducing each iteration to a linear dynamical system (LDS) solve that can be parallelized with the same primitive.
The Parallel Scan¶
Consider a linear RNN with time-varying dynamics,
where , , and . Naively, computing the full trajectory requires sequential steps, since each depends on . The parallel scan Blelloch, 1990, also called the associative scan, reduces this to depth by exploiting the structure of affine composition.
Associativity and Closure¶
Each step of the recurrence is an affine map . Two consecutive steps compose as another affine map:
Representing each step as a pair , the composition rule is:
This operation is associative — — and closed — the composition of two affine maps is again an affine map. These two properties are exactly what the parallel scan requires.
Divide-and-Conquer¶
Given affine maps, we want all cumulative compositions for . The parallel scan computes these in two phases:
Up-sweep: pair up neighboring maps and reduce in parallel, rounds, building products at powers of two.
Down-sweep: propagate cumulative products back to fill all positions, another rounds.
Together both phases run in depth with processors, yielding the full trajectory in total work — the same as sequential evaluation, but parallelized across the time dimension.
Sequential Evaluation as Root-Finding¶
Let denote the hidden state at time , and let denote the (possibly nonlinear, possibly input-dependent) transition function. Sequential evaluation of the recurrence,
requires serial steps because directly depends on .
We can reframe the problem: the correct trajectory is the unique solution to the system of equations,
This is a system of coupled nonlinear equations in unknowns. Fixed-point iteration is a natural approach: start from an initial guess (e.g., all zeros) and iteratively refine it by solving a simpler surrogate problem.
A Unifying Framework¶
The key insight of Gonzalez et al. (2026) is that four prominent fixed-point methods — Newton, quasi-Newton, Picard, and Jacobi — all reduce to iterative evaluation of a linear dynamical system. Specifically, each iteration takes the common form,
where is an approximate Jacobian of the dynamics. This is a linear recursion in the unknown , driven by the bias , which only depends on the previous iterate. Since it is an LDS, it can be evaluated via a parallel scan in depth.
The four methods differ only in how they choose :
| Method | Cost per iteration | Parallelization | |
|---|---|---|---|
| Newton | — full Jacobian | Parallel scan (dense) | |
| Quasi-Newton | — diagonal | Parallel scan (elementwise) | |
| Picard | — identity | Prefix sum | |
| Jacobi | 0 — zero | Embarrassingly parallel |
All four methods are guaranteed to converge to the correct trajectory in at most iterations Gonzalez et al., 2026.
Four Root Finding Methods¶
Newton Iterations¶
Newton’s method for root-finding linearizes the residual using its full Jacobian. Applied to sequential evaluation, each Newton step is the first-order Taylor expansion of the recurrence around the current iterate:
This is the DEER algorithm of Lim et al. (2024). Because the transition matrix is the full Jacobian, each iteration requires memory and work for the matrix–matrix multiplications in the parallel scan. For large state dimensions, this is prohibitive.
When is a linear function of , the Jacobian is exact and Newton converges in a single iteration — recovering the standard parallel scan for LDSs as a special case.
Quasi-Newton Iterations¶
To reduce the cost of Newton iterations, Gonzalez et al. (2024) replace the full Jacobian with its diagonal:
This is the ELK (Evaluating Levenberg–Marquardt via Kalman) algorithm. With a diagonal transition matrix, each parallel scan step is an elementwise vector multiplication, reducing cost to . The diagonal of the Jacobian can often be computed in closed form for common RNN architectures (e.g., GRUs), or estimated stochastically using the Hutchinson estimator in a single function call.
Picard Iterations¶
Picard iterations set , approximating the Jacobian by the identity matrix. The update simplifies to a prefix sum:
where is the discretization step (for ODE-based dynamics). Picard iterations require only vector additions, making them the cheapest per-iteration method. Shih et al. (2023) used them to parallelize sampling in diffusion models.
The identity approximation is faithful when the true Jacobian , which holds for dynamics with small step sizes (e.g., discretized ODEs with fine time steps).
Jacobi Iterations¶
Jacobi iterations set , giving the simplest possible update:
Each element of the new trajectory is computed independently — no dependencies between time steps at all. This is embarrassingly parallel: all states can be updated simultaneously in a single kernel call. Song et al. (2021) used Jacobi iterations to accelerate feedforward computation in deep networks.
The zero-Jacobian approximation is faithful when consecutive states evolve nearly independently, i.e., when the true Jacobian .
Convergence Analysis¶
Gonzalez et al. (2026) derive a unified convergence bound for all four methods. Let denote the error at iteration , and let and denote the block-bidiagonal approximate and true Jacobians of the residual. Then:
As the error approaches zero, the asymptotic linear convergence rate is:
Convergence is fast () when two conditions hold simultaneously:
Small approximation error: is small, i.e., is a faithful approximation of the true Jacobian .
Stable LDS: is small, i.e., the linearized system with transition matrices is stable (spectral norms well below one).
The two conditions trade off against each other:
Newton: , so and the method converges in one step — but only if the resulting LDS is stable.
Quasi-Newton: is typically a better approximation than the identity, and the diagonal LDS is often stable.
Picard: , which is faithful when but the resulting LDS has unit eigenvalues, making — slow for long sequences.
Jacobi: , so (minimal), but the approximation error may be large.
Conclusion¶
The parallel scan turns any associative, closed operation into an -depth computation. For nonlinear RNNs, iterative linearization reduces each fixed-point iteration to an LDS — making the parallel scan applicable even when the original dynamics are nonlinear. All four methods are guaranteed to converge in at most iterations; the rate depends on how faithfully approximates the true Jacobian and on the stability of the induced LDS.
| Method | Cost/iter | Converges fast when | |
|---|---|---|---|
| Newton | Full Jacobian | Jacobian is dense | |
| Quasi-Newton | Diag. Jacobian | Diagonal of Jacobian dominates | |
| Picard | Dynamics identity shift | ||
| Jacobi | 0 | States evolve nearly independently |
- Blelloch, G. E. (1990). Prefix sums and their applications (Techreport CMU-CS-90-190). School of Computer Science, Carnegie Mellon University.
- Gonzalez, X., Buchanan, E. K., Lee, H. D., Liu, J. W., Wang, K. A., Zoltowski, D. M., Kozachkov, L., Ré, C., & Linderman, S. W. (2026). A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems. Transactions on Machine Learning Research. https://openreview.net/forum?id=fw6GgAIGur
- Lim, Y. H., Zhu, Q., Selfridge, J., & Kasim, M. F. (2024). Parallelizing Non-Linear Sequential Models Over the Sequence Length. International Conference on Learning Representations (ICLR).
- Gonzalez, X., Buchanan, E. K., Zoltowski, D. M., & Linderman, S. W. (2024). ELK: Evaluating Levenberg–Marquardt via Kalman. arXiv Preprint.
- Shih, A., Belkhale, S., Ermon, S., Sadigh, D., & Anari, N. (2023). Parallel Sampling of Diffusion Models. Advances in Neural Information Processing Systems (NeurIPS).
- Song, Y., Meng, C., Liao, R., & Ermon, S. (2021). Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving. International Conference on Machine Learning (ICML).