Curriculum and multi-stage training

Part 3 — Training pathologies and remedies

Learning objectives

Recognise that hard PINN problems are dramatically easier when approached through easier surrogates
Apply frequency continuation (Bunks 1995) to a multi-scale PINN target
Distinguish curriculum learning from learning-rate scheduling
Build the link from classical seismic FWI workflows to modern PINN curricula

Krishnapriyan, Gholami, Zhe, Kirby & Mahoney (NeurIPS 2021) showed that PINN training on a hard PDE problem is dramatically improved by building up to it: train a network on an easier version of the problem first, then progressively warp the training target toward the hard one. This is exactly the seismic FWI frequency-continuation idea (Bunks et al. 1995) reborn for PINNs.

Why curriculum works for PINNs

The optimisation landscape of a multi-scale PINN problem has many local minima. Direct optimisation lands in the closest one to the random initialisation. Training on an easier surrogate (low-frequency target) produces a smoother loss landscape with one dominant minimum; the optimiser reliably finds it. Then warping toward the harder target moves the minimum, but only locally — the network is already in the basin of the right minimum. This is how natural humans solve hard regression problems too: start with a simple model, then add complexity.

The technique has three flavours, distinguished by what is being warped:

Frequency continuation: warp the target frequency content from low to high. This is what Bunks 1995 introduced for seismic FWI; Krishnapriyan 2021 reused it for PINNs.
Spatial-domain expansion: train on a small subdomain first, then enlarge. Useful for problems with sharp features at the boundary.
Optimiser handoff: train with Adam (robust, fast initial convergence) then switch to L-BFGS (precise, second-order). Standard practice in seismic PINN papers.

Try it: frequency continuation

The widget races three strategies on the multi-scale target $f(x; w) = \tfrac{1}{2}\sin(\pi x) + \tfrac{1}{2}\sin(w \pi x)$ with the hard target at $w = 10$ . Crucially the network is a vanilla 1-64-64-1 Tanh MLP — no Fourier features, no SIREN. Spectral bias (§0.9, §2.2) is what is being fought.

naive: train directly at $w = 10$ for 2500 epochs.
lr-anneal: same target, with cosine-annealed learning rate. (Curriculum lite — tries to escape the minimum without changing the target.)
frequency continuation: five stages $w = 2 \to 3 \to 5 \to 7 \to 10$ , each 500 epochs.

What you should observe

naive: gets stuck on the low-frequency component, never resolves the $\sin(10 \pi x)$ feature. Final relative-L² is typically 60–80% — the spectral-bias plateau.
lr-anneal: similar to naive at typical seeds; learning-rate scheduling alone does not introduce the right inductive bias for multi-scale targets.
frequency continuation: the network learns the $w = 2$ component cleanly, then the $w = 3$ overlay is a small perturbation, and so on up to $w = 10$ . The vertical dashed lines mark stage boundaries; the loss panel shows the curriculum descending in steps. Final relative-L² typically drops to 30–50%, a meaningful win on a target that vanilla architectures cannot fully resolve. With Fourier features in addition (§2.2), the same curriculum drives error to $10^{-4}$ .

Frequency continuation for seismic FWI

Bunks et al. (1995) introduced frequency continuation for seismic FWI: filter the data to a low-pass band, invert; widen the band, re-invert; repeat. The reasoning was identical to the PINN case — low-pass data gives a convex(er) misfit with one dominant minimum, avoiding the cycle-skipping pathology of direct full-bandwidth FWI. Modern PINN-FWI methods (Song et al. 2023; Liu et al. 2024) use the same idea: start the PINN training with a low-pass-filtered version of the wavefield, then progressively widen the bandwidth.

Part 6 returns to this in detail: the curriculum is the centrepiece of how PINN-FWI avoids cycle-skipping on Marmousi-class problems.

Optimiser handoff: Adam → L-BFGS

The widget does not include L-BFGS (which is hard to implement well in pure JS), but the technique deserves mention. After Adam reaches a plateau, switching to L-BFGS for ~100 iterations typically reduces the loss by another 1–2 orders of magnitude. L-BFGS uses approximate second-order information and converges quadratically near the minimum; Adam is gradient-based and converges linearly. The combination is the de-facto standard in published seismic PINN papers (Rasht-Behesht 2022, Song 2023, Wu 2024).

If you need L-BFGS in JavaScript, the Cephes-derived numericjs.minimize works for problems up to ~10k parameters; beyond that you need WebAssembly or native code.

References

Krishnapriyan, A., Gholami, A., Zhe, S., Kirby, R., Mahoney, M.W. (2021). Characterizing possible failure modes in physics-informed neural networks. NeurIPS.
Bunks, C., Saleck, F.M., Zaleski, S., Chavent, G. (1995). Multiscale seismic waveform inversion. Geophysics 60(5).
Song, C., Alkhalifah, T., Waheed, U.B. (2023). A versatile framework to solve the Helmholtz equation using physics-informed neural networks. Geophys. J. Int.