The loss-balance crisis: data + PDE + BC weights

Part 3 — Training pathologies and remedies

Learning objectives

  • State the loss-balance crisis precisely: when ‖∇L_a‖ ≫ ‖∇L_b‖, gradient descent on L_a + L_b reduces only L_a
  • Recognise that the optimal weight λ depends on gradient magnitudes that change as training progresses
  • Hand-tune λ_IC for a harmonic IVP and find the Goldilocks zone empirically
  • Build the intuition that motivates §3.3's NTK-based automatic weighting

Pathology #1 in §3.1 was the loss-balance crisis: the harmonic IVP with default IC weight 1 stalled with LIC1\mathcal{L}{\textrm{IC}} \approx 1 while LPDE0\mathcal{L}{\textrm{PDE}} \to 0. This section makes that observation precise, names the underlying mechanism, and lets you hand-tune the IC weight to find a working configuration. The takeaway is uncomfortable but unavoidable: there is no a-priori-correct weight. The right value depends on the gradient magnitudes of each loss term, and those magnitudes shift as training progresses. §3.3 (NTK weighting) and §3.4 (gradient-norm balancing, SA-PINN) automate the choice.

The mathematical statement

For a multi-term loss

L(θ)=λPDELPDE(θ)+λICLIC(θ)+λBCLBC(θ)+λdataLdata(θ),\mathcal{L}(\theta) = \lambda_{\textrm{PDE}} \mathcal{L}_{\textrm{PDE}}(\theta) + \lambda_{\textrm{IC}} \mathcal{L}_{\textrm{IC}}(\theta) + \lambda_{\textrm{BC}} \mathcal{L}_{\textrm{BC}}(\theta) + \lambda_{\textrm{data}} \mathcal{L}_{\textrm{data}}(\theta) ,

the gradient flow is

θ˙=λPDEθLPDEλICθLICλBCθLBCλdataθLdata.\dot{\theta} = -\lambda_{\textrm{PDE}} \nabla_{\theta} \mathcal{L}_{\textrm{PDE}} - \lambda_{\textrm{IC}} \nabla_{\theta} \mathcal{L}_{\textrm{IC}} - \lambda_{\textrm{BC}} \nabla_{\theta} \mathcal{L}_{\textrm{BC}} - \lambda_{\textrm{data}} \nabla_{\theta} \mathcal{L}_{\textrm{data}} .

If λaLaλbLb|\lambda_a \nabla \mathcal{L}_a| \gg |\lambda_b \nabla \mathcal{L}_b|, the trajectory follows La-\nabla \mathcal{L}_a almost exclusively; the Lb\mathcal{L}_b term is essentially invisible to the optimiser. Wang, Teng & Perdikaris (2021) called this the gradient flow pathology. The empirical signature: one term plateaus at a non-zero value while the others decrease.

Why the right λ is not obvious

You might hope that λPDE=λIC=1\lambda_{\textrm{PDE}} = \lambda_{\textrm{IC}} = 1 — "treat the terms equally" — is a sensible default. It is not. Consider:

  • LIC=(uθ(0)1)2+(uθ(0))2\mathcal{L}{\textrm{IC}} = (u\theta(0) - 1)^2 + (u'\theta(0))^2 is evaluated at one point. Its gradient with respect to θ\theta has magnitude proportional to whatever uθ(0)/θ\partial u\theta(0) / \partial \theta happens to be — depends on initialisation.
  • LPDE=1Nci=1Nc(uθ(ti)+ω2uθ(ti))2\mathcal{L}{\textrm{PDE}} = \frac{1}{N_c} \sum{i=1}^{N_c} (u''\theta(t_i) + \omega^2 u\theta(t_i))^2 is averaged over Nc=50N_c = 50 collocation points and involves second derivatives. Its gradient has magnitude proportional to ω239\omega^2 \sim 39, scaled by random initialisation.

For random init, the second-derivative-heavy PDE gradient is typically 10²–10³ times larger than the IC gradient at λIC=1\lambda_{\textrm{IC}} = 1. The optimiser then sees only LPDE\mathcal{L}_{\textrm{PDE}} and finds u(t)0u(t) \approx 0 — which satisfies the PDE trivially but violates the IC.

Try it: the manual weight tuner

The widget solves the same harmonic IVP from §3.1's loss-balance pathology with a slider for log10(λIC)\log_{10}(\lambda_{\textrm{IC}}) from -3 to 5. For each setting, click Train for a fresh 2000-epoch fit. Watch:

  • The per-term loss histories. Both should decrease together for a good λIC\lambda_{\textrm{IC}}.
  • The relative-L² error against the analytic solution u(t)=cos(ωt)u(t) = \cos(\omega t). Below 5% is a clean fit; 30% or more means the network found a wrong steady state.

Loss Weight TunerInteractive figure — enable JavaScript to interact.

What you should observe

  • λIC=103\lambda_{\textrm{IC}} = 10^{-3}: PDE residual collapses to ≈ 0, IC stays ≈ 1, network output u(t)0u(t) \approx 0. Relative-L² ≈ 100%.
  • λIC=1\lambda_{\textrm{IC}} = 1 (the naive default): same failure mode, slightly less extreme.
  • λIC=102\lambda_{\textrm{IC}} = 10^{2}: sweet spot — both terms decrease together, the cosine is recovered, relative-L² drops to a few percent.
  • λIC=105\lambda_{\textrm{IC}} = 10^{5}: IC dominates, network outputs u(t)1u(t) \approx 1 to fit the IC perfectly while ignoring the PDE. Relative-L² is again large.

The Goldilocks zone is narrow — about two orders of magnitude wide — and it sits where the gradient magnitudes of the two terms are matched, not where the loss values are matched. That is the whole game in a sentence.

Why hand-tuning is not a real fix

The widget gives you four points of evidence: bad, bad, good, bad. In a real PINN problem (Burgers, wave equation, FWI) the search space has many λs (PDE residual, IC, BC, data — sometimes split per spatial component), the gradient magnitudes shift across training, and you cannot afford to scan a 4D log-space grid of λ\lambda values to find the Goldilocks volume. Three responses have emerged in the literature:

  • Hard constraints (Part 2 §2.4). When the geometry permits, reparameterise the network so the IC and BC are identically satisfied: only LPDE\mathcal{L}_{\textrm{PDE}} remains. This eliminates the loss-balance crisis at the IC/BC by removing those loss terms entirely.
  • NTK-based automatic weighting (Wang, Yu, Perdikaris 2022; §3.3). Compute tr(Ka)\textrm{tr}(K_a) for each loss term and set λa1/tr(Ka)\lambda_a \propto 1 / \textrm{tr}(K_a). This rebalances the gradient magnitudes automatically.
  • Gradient-norm balancing (Wang, Teng, Perdikaris 2021; §3.4). Cheap heuristic: at each step, set λa\lambda_a proportional to the maximum-over-terms ratio of gradient L²-norms. The original 2021 paper called this the GradNorm-style fix.

The widget exposes the limitation honestly

Even at λIC=102\lambda_{\textrm{IC}} = 10^{2}, the relative-L² error is typically a few percent — better than the failed cases, but not the 10410^{-4}-level fits that NTK-balanced weighting reaches on the same problem. The hand-tuned weight is good enough to see the principle but it is not a winning strategy. §3.3 picks up the story.

Practical rule

Whenever you write down a multi-term PINN loss, log per-term gradient norms θLa2|\nabla_\theta \mathcal{L}_a|_2 from the very first epoch. Print them every 100 steps. If any pair differs by more than 10², you have a loss-balance crisis. Fix it before tweaking anything else — learning rate, architecture, or sampling will not save you.

References

  • Wang, S., Teng, Y., Perdikaris, P. (2021). Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM J. Sci. Comput. 43(5), A3055–A3081.
  • van der Meer, R., Oosterlee, C., Borovykh, A. (2022). Optimally weighted loss functions for solving PDEs with neural networks. JCAM 405, 113887.
  • Bischof, R., Kraus, M. (2021). Multi-objective loss balancing for physics-informed deep learning. arXiv:2110.09813.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.