Loss-weight sensitivity in FWI-PINN

Part 6 — Velocity inversion with PINNs

Learning objectives

Recognise the multi-term FWI-PINN loss as the §3.2 loss-balance crisis applied
See empirically how regularisation weight α shifts the misfit-minimum c₂
Identify the Goldilocks zone where α is large enough to break cycle-skipping but small enough to honour data
Connect to PINN-FWI's λ_d/λ_p balance (Wang-Teng-Perdikaris 2021)

The §3.2 widget demonstrated the loss-balance crisis on the harmonic IVP: the joint loss $\lambda_{\mathrm{ic}} L_{\mathrm{ic}} + L_{\mathrm{pde}}$ has weights that strongly affect convergence. PINN-FWI inherits this in spades. The full joint loss is

\mathcal{L} = \lambda_d \mathcal{L}_{\mathrm{data}} + \lambda_p \mathcal{L}_{\mathrm{pde}} + \lambda_i \mathcal{L}_{\mathrm{ic}} + \lambda_b \mathcal{L}_{\mathrm{bc}} + \lambda_r \mathcal{L}_{\mathrm{reg}} ,

with FIVE different weight ratios (each pair). Each weight balances different physics. Get any one wrong and convergence stalls or finds the wrong velocity model.

The simplest 2-term version

To build intuition, this widget studies the simplest version: classical-FWI data misfit + a Tikhonov regulariser pulling the velocity toward a prior.

J_{\mathrm{total}}(c_2) = J_{\mathrm{data}}(c_2) + \alpha (c_2 - c_2^{\mathrm{init}})^2 ,

with $c_2^{\mathrm{init}} = 1.0$ (a deliberately wrong prior — top-layer-velocity guess). Drag the $\alpha$ slider over six orders of magnitude and watch the total-misfit minimum shift:

α very small (1e-7): regulariser is negligible. Total-misfit minimum = data-misfit minimum, which on this 1D problem may be at the cycle-skipped point $c_2 \approx 0.7$ or 2.3 depending on the basin.
α very large (1e+2): regulariser dominates. Total-misfit minimum = $c_2^{\mathrm{init}} = 1.0$ . The data is ignored.
α "Goldilocks" (~1e-4): balanced. The regulariser kills the spurious data-misfit local minima but doesn't override the global one. Total-misfit minimum = truth (1.5).

Try it

The widget pre-computes $J_{\mathrm{data}}(c_2)$ on an 80-sample sweep at startup (once, ~5 s). The slider then re-computes $J_{\mathrm{total}}(c_2)$ instantly for any $\alpha$ . Three traces are plotted:

Orange: $J_{\mathrm{data}}(c_2)$ — fixed.
Purple: $\alpha (c_2 - c_2^{\mathrm{init}})^2$ — quadratic in $c_2$ , scales with $\alpha$ .
Cyan: $J_{\mathrm{total}}$ . The dot marks the argmin.

The cyan dot is what gradient-descent FWI would converge to. As you change $\alpha$ , watch the dot slide between truth=1.5 (small $\alpha$ , when basins are narrow) and prior=1.0 (large $\alpha$ , when prior dominates). The Goldilocks zone is the narrow range where the dot lands at truth.

How production codes pick weights

Discrepancy principle (Tikhonov 1963; Hanke 1995). Choose $\alpha$ such that $J_{\mathrm{data}}(c^*) \approx \sigma^2 N$ where $\sigma$ is the noise standard deviation and $N$ is the number of data samples. The data is fit to its own noise floor, no further.
L-curve method (Hansen 1992). Plot $\log J_{\mathrm{reg}}$ vs $\log J_{\mathrm{data}}$ for a range of $\alpha$ ; choose the $\alpha$ at the corner of the resulting "L". Standard for ill-posed inverse problems.
Generalized cross-validation (GCV; Golub, Heath, Wahba 1979). Pick $\alpha$ to minimise the predicted error on left-out data. Provably optimal in the asymptotic limit.
Bayesian / hierarchical. Treat $\alpha$ as a hyperparameter to be marginalised. Most rigorous; computationally heaviest.

PINN-FWI weights

The PINN-FWI joint loss $\mathcal{L} = \lambda_d L_d + \lambda_p L_p + \lambda_i L_i + \lambda_b L_b + \lambda_r L_r$ has 4 independent weight ratios. The Wang-Teng-Perdikaris 2021 NTK-balance trick from §3.3 generalises directly to this setting: at each epoch, scale each $\lambda$ inversely to the recent gradient-norm of its term. This forces all loss terms to contribute to the gradient at comparable scales, eliminating the "one term dominates" failure mode that plagues hand-tuned weights.

The McClenny-Braga-Neto SA-PINN trick (§3.4) further provides per-collocation-point weights $\gamma_k$ — useful when some receiver locations or PDE collocation points are systematically harder than others. Both NTK and SA-PINN have been ported into PINN-FWI by Sun & Alkhalifah and others; see §3.3 / §3.4 for the in-depth treatment.

What §6.9 will do

§6.9 closes Part 6 with the convergence-diagnostics question: how do you know your FWI run has converged for the right reason? Misfit reduction is necessary but not sufficient. Production codes track gradient norm, model-update magnitude, model-residual decay rate, and the model-data residual cross-spectrum. The widget visualises all four for a complete §6.2 inversion.

References

Tikhonov, A.N., Arsenin, V.Y. (1977). Solutions of Ill-Posed Problems. Wiley.
Hansen, P.C. (1992). Analysis of discrete ill-posed problems by means of the L-curve. SIAM Review 34(4), 561–580. The L-curve weight-selection method.
Wang, S., Teng, Y., Perdikaris, P. (2021). Understanding and mitigating gradient flow pathologies in physics-informed neural networks. SIAM J. Sci. Comput. 43(5), A3055–A3081. The NTK-balance §3.3 paper, applied to PINN-FWI weights.
McClenny, L.D., Braga-Neto, U. (2023). Self-adaptive physics-informed neural networks. JCP 474, 111722. Per-point adaptive weights, §3.4.