The loss-balance crisis: data + PDE + BC weights
Learning objectives
- State the loss-balance crisis precisely: when ‖∇L_a‖ ≫ ‖∇L_b‖, gradient descent on L_a + L_b reduces only L_a
- Recognise that the optimal weight λ depends on gradient magnitudes that change as training progresses
- Hand-tune λ_IC for a harmonic IVP and find the Goldilocks zone empirically
- Build the intuition that motivates §3.3's NTK-based automatic weighting
Pathology #1 in §3.1 was the loss-balance crisis: the harmonic IVP with default IC weight 1 stalled with while . This section makes that observation precise, names the underlying mechanism, and lets you hand-tune the IC weight to find a working configuration. The takeaway is uncomfortable but unavoidable: there is no a-priori-correct weight. The right value depends on the gradient magnitudes of each loss term, and those magnitudes shift as training progresses. §3.3 (NTK weighting) and §3.4 (gradient-norm balancing, SA-PINN) automate the choice.
The mathematical statement
For a multi-term loss
the gradient flow is
If , the trajectory follows almost exclusively; the term is essentially invisible to the optimiser. Wang, Teng & Perdikaris (2021) called this the gradient flow pathology. The empirical signature: one term plateaus at a non-zero value while the others decrease.
Why the right λ is not obvious
You might hope that — "treat the terms equally" — is a sensible default. It is not. Consider:
- is evaluated at one point. Its gradient with respect to has magnitude proportional to whatever happens to be — depends on initialisation.
- is averaged over collocation points and involves second derivatives. Its gradient has magnitude proportional to , scaled by random initialisation.
For random init, the second-derivative-heavy PDE gradient is typically 10²–10³ times larger than the IC gradient at . The optimiser then sees only and finds — which satisfies the PDE trivially but violates the IC.
Try it: the manual weight tuner
The widget solves the same harmonic IVP from §3.1's loss-balance pathology with a slider for from -3 to 5. For each setting, click Train for a fresh 2000-epoch fit. Watch:
- The per-term loss histories. Both should decrease together for a good .
- The relative-L² error against the analytic solution . Below 5% is a clean fit; 30% or more means the network found a wrong steady state.
What you should observe
- : PDE residual collapses to ≈ 0, IC stays ≈ 1, network output . Relative-L² ≈ 100%.
- (the naive default): same failure mode, slightly less extreme.
- : sweet spot — both terms decrease together, the cosine is recovered, relative-L² drops to a few percent.
- : IC dominates, network outputs to fit the IC perfectly while ignoring the PDE. Relative-L² is again large.
The Goldilocks zone is narrow — about two orders of magnitude wide — and it sits where the gradient magnitudes of the two terms are matched, not where the loss values are matched. That is the whole game in a sentence.
Why hand-tuning is not a real fix
The widget gives you four points of evidence: bad, bad, good, bad. In a real PINN problem (Burgers, wave equation, FWI) the search space has many λs (PDE residual, IC, BC, data — sometimes split per spatial component), the gradient magnitudes shift across training, and you cannot afford to scan a 4D log-space grid of values to find the Goldilocks volume. Three responses have emerged in the literature:
- Hard constraints (Part 2 §2.4). When the geometry permits, reparameterise the network so the IC and BC are identically satisfied: only remains. This eliminates the loss-balance crisis at the IC/BC by removing those loss terms entirely.
- NTK-based automatic weighting (Wang, Yu, Perdikaris 2022; §3.3). Compute for each loss term and set . This rebalances the gradient magnitudes automatically.
- Gradient-norm balancing (Wang, Teng, Perdikaris 2021; §3.4). Cheap heuristic: at each step, set proportional to the maximum-over-terms ratio of gradient L²-norms. The original 2021 paper called this the GradNorm-style fix.
The widget exposes the limitation honestly
Even at , the relative-L² error is typically a few percent — better than the failed cases, but not the -level fits that NTK-balanced weighting reaches on the same problem. The hand-tuned weight is good enough to see the principle but it is not a winning strategy. §3.3 picks up the story.
Practical rule
Whenever you write down a multi-term PINN loss, log per-term gradient norms from the very first epoch. Print them every 100 steps. If any pair differs by more than 10², you have a loss-balance crisis. Fix it before tweaking anything else — learning rate, architecture, or sampling will not save you.
References
- Wang, S., Teng, Y., Perdikaris, P. (2021). Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM J. Sci. Comput. 43(5), A3055–A3081.
- van der Meer, R., Oosterlee, C., Borovykh, A. (2022). Optimally weighted loss functions for solving PDEs with neural networks. JCAM 405, 113887.
- Bischof, R., Kraus, M. (2021). Multi-objective loss balancing for physics-informed deep learning. arXiv:2110.09813.