The loss-balance crisis: data + PDE + BC weights

Part 3 — Training pathologies and remedies

Learning objectives

State the loss-balance crisis precisely: when ‖∇L_a‖ ≫ ‖∇L_b‖, gradient descent on L_a + L_b reduces only L_a
Recognise that the optimal weight λ depends on gradient magnitudes that change as training progresses
Hand-tune λ_IC for a harmonic IVP and find the Goldilocks zone empirically
Build the intuition that motivates §3.3's NTK-based automatic weighting

Pathology #1 in §3.1 was the loss-balance crisis: the harmonic IVP with default IC weight 1 stalled with $\mathcal{L}$ while $\mathcal{L}$ {\textrm{PDE}} \to 0 $L_{PDE} \to 0$ . This section makes that observation precise, names the underlying mechanism, and lets you hand-tune the IC weight to find a working configuration. The takeaway is uncomfortable but unavoidable: there is no a-priori-correct weight. The right value depends on the gradient magnitudes of each loss term, and those magnitudes shift as training progresses. §3.3 (NTK weighting) and §3.4 (gradient-norm balancing, SA-PINN) automate the choice.

The mathematical statement

For a multi-term loss

\mathcal{L}(\theta) = \lambda_{\textrm{PDE}} \mathcal{L}_{\textrm{PDE}}(\theta) + \lambda_{\textrm{IC}} \mathcal{L}_{\textrm{IC}}(\theta) + \lambda_{\textrm{BC}} \mathcal{L}_{\textrm{BC}}(\theta) + \lambda_{\textrm{data}} \mathcal{L}_{\textrm{data}}(\theta) ,

the gradient flow is

\dot{\theta} = -\lambda_{\textrm{PDE}} \nabla_{\theta} \mathcal{L}_{\textrm{PDE}} - \lambda_{\textrm{IC}} \nabla_{\theta} \mathcal{L}_{\textrm{IC}} - \lambda_{\textrm{BC}} \nabla_{\theta} \mathcal{L}_{\textrm{BC}} - \lambda_{\textrm{data}} \nabla_{\theta} \mathcal{L}_{\textrm{data}} .

If $|\lambda_a \nabla \mathcal{L}_a| \gg |\lambda_b \nabla \mathcal{L}_b|$ , the trajectory follows $-\nabla \mathcal{L}_a$ almost exclusively; the $\mathcal{L}_b$ term is essentially invisible to the optimiser. Wang, Teng & Perdikaris (2021) called this the gradient flow pathology. The empirical signature: one term plateaus at a non-zero value while the others decrease.

Why the right λ is not obvious

You might hope that $\lambda_{\textrm{PDE}} = \lambda_{\textrm{IC}} = 1$ — "treat the terms equally" — is a sensible default. It is not. Consider:

$\mathcal{L}$ is evaluated at one point. Its gradient with respect to $\theta$ has magnitude proportional to whatever $\partial u$ \theta(0) / \partial \theta $\partial u_{θ} (0) / \partial θ$ happens to be — depends on initialisation.
$\mathcal{L}$ is averaged over $N_c = 50$ collocation points and involves second derivatives. Its gradient has magnitude proportional to $\omega^2 \sim 39$ , scaled by random initialisation.

For random init, the second-derivative-heavy PDE gradient is typically 10²–10³ times larger than the IC gradient at $\lambda_{\textrm{IC}} = 1$ . The optimiser then sees only $\mathcal{L}_{\textrm{PDE}}$ and finds $u(t) \approx 0$ — which satisfies the PDE trivially but violates the IC.

Try it: the manual weight tuner

The widget solves the same harmonic IVP from §3.1's loss-balance pathology with a slider for $\log_{10}(\lambda_{\textrm{IC}})$ from -3 to 5. For each setting, click Train for a fresh 2000-epoch fit. Watch:

The per-term loss histories. Both should decrease together for a good $\lambda_{\textrm{IC}}$ .
The relative-L² error against the analytic solution $u(t) = \cos(\omega t)$ . Below 5% is a clean fit; 30% or more means the network found a wrong steady state.

What you should observe

$\lambda_{\textrm{IC}} = 10^{-3}$ : PDE residual collapses to ≈ 0, IC stays ≈ 1, network output $u(t) \approx 0$ . Relative-L² ≈ 100%.
$\lambda_{\textrm{IC}} = 1$ (the naive default): same failure mode, slightly less extreme.
$\lambda_{\textrm{IC}} = 10^{2}$ : sweet spot — both terms decrease together, the cosine is recovered, relative-L² drops to a few percent.
$\lambda_{\textrm{IC}} = 10^{5}$ : IC dominates, network outputs $u(t) \approx 1$ to fit the IC perfectly while ignoring the PDE. Relative-L² is again large.

The Goldilocks zone is narrow — about two orders of magnitude wide — and it sits where the gradient magnitudes of the two terms are matched, not where the loss values are matched. That is the whole game in a sentence.

Why hand-tuning is not a real fix

The widget gives you four points of evidence: bad, bad, good, bad. In a real PINN problem (Burgers, wave equation, FWI) the search space has many λs (PDE residual, IC, BC, data — sometimes split per spatial component), the gradient magnitudes shift across training, and you cannot afford to scan a 4D log-space grid of $\lambda$ values to find the Goldilocks volume. Three responses have emerged in the literature:

Hard constraints (Part 2 §2.4). When the geometry permits, reparameterise the network so the IC and BC are identically satisfied: only $\mathcal{L}_{\textrm{PDE}}$ remains. This eliminates the loss-balance crisis at the IC/BC by removing those loss terms entirely.
NTK-based automatic weighting (Wang, Yu, Perdikaris 2022; §3.3). Compute $\textrm{tr}(K_a)$ for each loss term and set $\lambda_a \propto 1 / \textrm{tr}(K_a)$ . This rebalances the gradient magnitudes automatically.
Gradient-norm balancing (Wang, Teng, Perdikaris 2021; §3.4). Cheap heuristic: at each step, set $\lambda_a$ proportional to the maximum-over-terms ratio of gradient L²-norms. The original 2021 paper called this the GradNorm-style fix.

Even at $\lambda_{\textrm{IC}} = 10^{2}$ , the relative-L² error is typically a few percent — better than the failed cases, but not the $10^{-4}$ -level fits that NTK-balanced weighting reaches on the same problem. The hand-tuned weight is good enough to see the principle but it is not a winning strategy. §3.3 picks up the story.

Practical rule

Whenever you write down a multi-term PINN loss, log per-term gradient norms $|\nabla_\theta \mathcal{L}_a|_2$ from the very first epoch. Print them every 100 steps. If any pair differs by more than 10², you have a loss-balance crisis. Fix it before tweaking anything else — learning rate, architecture, or sampling will not save you.

References

Wang, S., Teng, Y., Perdikaris, P. (2021). Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM J. Sci. Comput. 43(5), A3055–A3081.
van der Meer, R., Oosterlee, C., Borovykh, A. (2022). Optimally weighted loss functions for solving PDEs with neural networks. JCAM 405, 113887.
Bischof, R., Kraus, M. (2021). Multi-objective loss balancing for physics-informed deep learning. arXiv:2110.09813.