Curriculum and multi-stage training
Learning objectives
- Recognise that hard PINN problems are dramatically easier when approached through easier surrogates
- Apply frequency continuation (Bunks 1995) to a multi-scale PINN target
- Distinguish curriculum learning from learning-rate scheduling
- Build the link from classical seismic FWI workflows to modern PINN curricula
Krishnapriyan, Gholami, Zhe, Kirby & Mahoney (NeurIPS 2021) showed that PINN training on a hard PDE problem is dramatically improved by building up to it: train a network on an easier version of the problem first, then progressively warp the training target toward the hard one. This is exactly the seismic FWI frequency-continuation idea (Bunks et al. 1995) reborn for PINNs.
Why curriculum works for PINNs
The optimisation landscape of a multi-scale PINN problem has many local minima. Direct optimisation lands in the closest one to the random initialisation. Training on an easier surrogate (low-frequency target) produces a smoother loss landscape with one dominant minimum; the optimiser reliably finds it. Then warping toward the harder target moves the minimum, but only locally — the network is already in the basin of the right minimum. This is how natural humans solve hard regression problems too: start with a simple model, then add complexity.
The technique has three flavours, distinguished by what is being warped:
- Frequency continuation: warp the target frequency content from low to high. This is what Bunks 1995 introduced for seismic FWI; Krishnapriyan 2021 reused it for PINNs.
- Spatial-domain expansion: train on a small subdomain first, then enlarge. Useful for problems with sharp features at the boundary.
- Optimiser handoff: train with Adam (robust, fast initial convergence) then switch to L-BFGS (precise, second-order). Standard practice in seismic PINN papers.
Try it: frequency continuation
The widget races three strategies on the multi-scale target with the hard target at . Crucially the network is a vanilla 1-64-64-1 Tanh MLP — no Fourier features, no SIREN. Spectral bias (§0.9, §2.2) is what is being fought.
- naive: train directly at for 2500 epochs.
- lr-anneal: same target, with cosine-annealed learning rate. (Curriculum lite — tries to escape the minimum without changing the target.)
- frequency continuation: five stages , each 500 epochs.
What you should observe
- naive: gets stuck on the low-frequency component, never resolves the feature. Final relative-L² is typically 60–80% — the spectral-bias plateau.
- lr-anneal: similar to naive at typical seeds; learning-rate scheduling alone does not introduce the right inductive bias for multi-scale targets.
- frequency continuation: the network learns the component cleanly, then the overlay is a small perturbation, and so on up to . The vertical dashed lines mark stage boundaries; the loss panel shows the curriculum descending in steps. Final relative-L² typically drops to 30–50%, a meaningful win on a target that vanilla architectures cannot fully resolve. With Fourier features in addition (§2.2), the same curriculum drives error to .
Frequency continuation for seismic FWI
Bunks et al. (1995) introduced frequency continuation for seismic FWI: filter the data to a low-pass band, invert; widen the band, re-invert; repeat. The reasoning was identical to the PINN case — low-pass data gives a convex(er) misfit with one dominant minimum, avoiding the cycle-skipping pathology of direct full-bandwidth FWI. Modern PINN-FWI methods (Song et al. 2023; Liu et al. 2024) use the same idea: start the PINN training with a low-pass-filtered version of the wavefield, then progressively widen the bandwidth.
Part 6 returns to this in detail: the curriculum is the centrepiece of how PINN-FWI avoids cycle-skipping on Marmousi-class problems.
Optimiser handoff: Adam → L-BFGS
The widget does not include L-BFGS (which is hard to implement well in pure JS), but the technique deserves mention. After Adam reaches a plateau, switching to L-BFGS for ~100 iterations typically reduces the loss by another 1–2 orders of magnitude. L-BFGS uses approximate second-order information and converges quadratically near the minimum; Adam is gradient-based and converges linearly. The combination is the de-facto standard in published seismic PINN papers (Rasht-Behesht 2022, Song 2023, Wu 2024).
If you need L-BFGS in JavaScript, the Cephes-derived numericjs.minimize works for problems up to ~10k parameters; beyond that you need WebAssembly or native code.
References
- Krishnapriyan, A., Gholami, A., Zhe, S., Kirby, R., Mahoney, M.W. (2021). Characterizing possible failure modes in physics-informed neural networks. NeurIPS.
- Bunks, C., Saleck, F.M., Zaleski, S., Chavent, G. (1995). Multiscale seismic waveform inversion. Geophysics 60(5).
- Song, C., Alkhalifah, T., Waheed, U.B. (2023). A versatile framework to solve the Helmholtz equation using physics-informed neural networks. Geophys. J. Int.