Why most beginner PINNs do not converge

Part 3 — Training pathologies and remedies

Learning objectives

Recognise that PINN training failure is not random — it falls into five named, well-studied pathologies
Diagnose a stalled PINN training by reading the per-term loss trace, not just the total loss
Map each pathology to the section of Part 3 that fixes it
Build the right diagnostic mindset before reaching for engineering tools

Part 1 introduced the PINN formulation and Part 2 surveyed architectures. With those two ingredients alone, you can write a PINN for almost any PDE. You will also discover, very quickly, that most beginner PINNs do not converge. The optimiser runs, the total loss decreases for a while, and then it plateaus at a value that is nowhere near zero. The network output looks plausible but is not actually a solution to the PDE you wrote down.

This is not a sign that PINNs do not work. It is a sign that the naive training loop has known, named failure modes. Modern PINN engineering — the toolbox developed since Wang, Teng & Perdikaris (2021), Wang, Yu & Perdikaris (2022) and a half-dozen subsequent papers — consists almost entirely of detecting these failure modes and applying targeted fixes. Part 3 is that toolbox. This first section is the map: it names the five pathologies you need to recognise and tells you which section of Part 3 fixes each.

The five named pathologies

Loss-balance crisis. The PINN loss is a sum of terms (PDE residual, IC, BC, data). When their gradient magnitudes differ by orders of magnitude, gradient descent follows the dominant direction and the other terms never reach zero. This is the single most common cause of PINN failure. Wang, Teng & Perdikaris (2021) named the symptom; §3.2 measures it; §3.3 (NTK weighting) and §3.4 (gradient-norm balancing, SA-PINN) fix it.
Spectral bias. Vanilla MLPs learn low frequencies before high frequencies; for a target with high spatial frequency the high-frequency component never enters the gradient before training ends. We met this in §0.9 and §2.2. The fixes (Fourier features, SIREN, multi-scale Fourier) are architectural and live in Part 2. The diagnostic signature — a fit loss that plateaus at $\textrm{Var}(y)/2$ — is what you look for.
Gradient pathology. Even when total loss balance looks fine, the per-parameter gradient magnitudes can be wildly different across loss terms. A term whose gradient vanishes nowhere (PDE residual at most points) overwhelms a term whose gradient is concentrated on a few points (sparse data, sharp IC). This is a finer-grained version of the loss-balance crisis. Wang et al. (2021) call it the gradient flow pathology; §3.4 introduces the gradient-norm-balancing remedy.
Causality violation. Time-domain PINNs are notorious for learning later times before earlier times. The total residual integrated over the entire space-time domain can be made small by finding a smooth function that fits the PDE near the final time, while the initial condition stays violated. Wang, Sankaran & Perdikaris (2024) named it the causality violation; §3.5 introduces the causality-aware loss schedule.
Sampling bias. The PDE residual is enforced at collocation points. Uniform sampling spreads points equally, which is wasteful when the true residual is concentrated in a small region (a shock, a wavefront, a layer). The network over-fits the easy regions and never resolves the hard one. Lu et al. (2021) introduced RAR (Residual-based Adaptive Refinement) to address this; Wu et al. (2023) generalised to RAD. §3.7 covers both.

Try it: the pathology gallery

The widget below runs a minimal in-browser repro for each of the five pathologies. The training settings are deliberately the simplest reasonable choice — vanilla architecture, equal loss weights, uniform sampling, no causality — so you can see the failure mode rather than read about it. After ~1500 epochs the widget reports the diagnostic signature: which loss term plateaued, and where. Each diagnosis ends with the cross-reference to the section of Part 3 that contains the fix.

Why the per-term trace is the only diagnostic that matters

The total loss $\mathcal{L}$ can decrease monotonically while $\mathcal{L}$ {\textrm{IC}} $L_{IC}$ stays constant. Watching the total alone hides the failure. Watching the per-term trace exposes it instantly: a term that decreases fast then plateaus is suspicious; a term that never decreases is a smoking gun. Every fix in Part 3 starts from a per-term diagnosis.

Practical rule: before applying any of the engineering remedies in §3.2–§3.8, log the per-term trace for at least the first 1000 training steps. If both terms are decreasing in lockstep, you have a problem of architecture (Part 2) or learning rate (Part 0), not loss balance. The remedies of Part 3 are powerful but they are also a hammer; you only reach for them when you have first identified the right nail.

Why naive PINN training fails: the unifying picture

All five pathologies can be summarised in one sentence: the gradient of the total loss is dominated by easy parts of the problem at the expense of hard parts. "Easy" is whatever is concentrated in the gradient signal — the loss term with biggest magnitude (loss-balance), low-frequency content (spectral bias), the loss term with most points (gradient pathology), the late-time region of a small advection problem (causality), the smooth interior of a domain (sampling). The fixes in Part 3 are all variations on the same theme: rebalance the gradient signal so the optimiser spends compute where the residual is largest. Once you see this unification, the dozen-or-so techniques in this Part stop feeling like a grab-bag and start feeling like one principle applied through five different lenses.

What you now know

You can recognise the five named PINN failure modes by their per-term loss signatures. You know which Part 3 section addresses each. You have the diagnostic mindset that turns "my PINN does not converge" from a vague complaint into a concrete debugging exercise. Each subsequent section in Part 3 (§3.2–§3.8) develops one targeted fix in depth, with a widget that lets you see the fix working on the corresponding pathology.

References

Wang, S., Teng, Y., Perdikaris, P. (2021). Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM J. Sci. Comput. 43(5), A3055–A3081.
Wang, S., Yu, X., Perdikaris, P. (2022). When and why PINNs fail to train: A neural tangent kernel perspective. J. Comput. Phys. 449, 110768.
Krishnapriyan, A., Gholami, A., Zhe, S., Kirby, R., Mahoney, M.W. (2021). Characterizing possible failure modes in physics-informed neural networks. NeurIPS.
Wang, S., Sankaran, S., Perdikaris, P. (2024). Respecting causality is all you need for training physics-informed neural networks. Comput. Methods Appl. Mech. Engrg.
Lu, L., Meng, X., Mao, Z., Karniadakis, G.E. (2021). DeepXDE: A deep learning library for solving differential equations. SIAM Review 63(1), 208–228.