The classical FWI loss vs the PINN-FWI loss

Part 6 — Velocity inversion with PINNs

Learning objectives

Write down the classical FWI misfit J(m) and the PINN-FWI joint loss L(θ_u, θ_m)
Derive the adjoint-state gradient ∂J/∂c(x) = −(2/c³) ∫ u_tt · λ dt and understand each factor
Distinguish the two paradigms: classical = forward + adjoint PDE solves; PINN-FWI = joint backprop
See visually that the adjoint wavefield λ is the time-reversed propagation of the data residual
Set up the rest of Part 6: every following section stands on this gradient or its PINN-FWI analogue

Part 6 is the central application of this textbook. Parts 1–4 built the PINN forward model. Part 5 honestly admitted that, for forward modelling, FDTD wins. Part 6 turns the PINN machinery sideways and asks the only question where PINN can decisively beat FDTD: given recorded seismic data, recover the velocity field that produced it. This is full-waveform inversion (FWI).

Classical FWI was crystallised by Tarantola in 1984 and refined in the 2000s. PINN-FWI is the 2020 reformulation that replaces hand-coded adjoint solvers with neural networks trained jointly. Both share the same underlying objective. They differ entirely in how the gradient is computed. Section §6.1 makes that difference precise.

The classical FWI loss

Let $m(x)$ be the unknown velocity model. Let $d_{\mathrm{obs}}(t)$ be the recorded seismogram at one or more receiver locations. Let $F(m)$ denote the forward operator: feed in $m$ , run an FDTD wave-equation solve, sample at the receivers. The FWI loss is

J(m) = \tfrac{1}{2} \int_0^T \bigl| F(m)(t) - d_{\mathrm{obs}}(t) \bigr|^2 \, dt .

This is just the $L^2$ data-fit. Inverting for $m$ means finding $m^* = \arg\min_m J(m)$ . To use any gradient-based optimiser we need $\partial J / \partial m(x)$ at every point of the domain. Naively differentiating $F$ would require $O(N_x)$ forward solves, one per spatial cell — utterly infeasible. The breakthrough in geophysics (Lailly 1983; Tarantola 1984) was to recognise that the gradient can be assembled from two PDE solves regardless of how many cells the model has.

The adjoint-state method

Consider the acoustic wave PDE $u_{tt} = c(x)^2 u_{xx}$ with sources and receivers at known locations. Take the partial derivative of $J$ with respect to $c$ at a single point $x$ and apply the chain rule through the PDE constraint. After integration by parts in time and space — the standard derivation that occupies a chapter of Plessix's 2006 review — the result is

\boxed{\quad \frac{\partial J}{\partial c(x)} = -\frac{2}{c(x)^3} \int_0^T u_{tt}(x, t) \cdot \lambda(x, t) \, dt \quad}

where

$u(x, t)$ is the FORWARD wavefield — solve $u_{tt} = c^2 u_{xx}$ with the actual source and the current model $c(x)$ .
$\lambda(x, t)$ is the ADJOINT wavefield — solve the SAME PDE $\lambda_{tt} = c^2 \lambda_{xx}$ , but with the data residual $r(t) = d_{\mathrm{pred}}(t) - d_{\mathrm{obs}}(t)$ injected at the receiver location, with terminal conditions $\lambda(x, T) = 0$ and $\lambda_t(x, T) = 0$ , running BACKWARDS in time from $t = T$ down to $t = 0$ .
The integral $\int u_{tt} \cdot \lambda , dt$ is a TIME CORRELATION between the second time derivative of the forward field and the time-reversed adjoint field, evaluated at every point $x$ .
The prefactor $-2/c(x)^3$ comes from the chain rule of the change of variables from $m = 1/c^2$ (the natural slowness-squared parameterisation) to $c$ : $dm/dc = -2/c^3$ . The MINUS sign reflects that "faster c means smaller slowness-squared".

Two key facts. First, this gradient is exact in the continuous limit — no truncation error, no finite-difference Jacobian approximation. In the discrete world, the forward and adjoint operators must be exact transposes of each other (same stencil, same time stepping, mirror-imaged source/receiver swap) for the discrete gradient to inherit this exactness — this is the "discretization-consistency" requirement (Symes 2007). Second, the gradient at every point of the domain is recovered from just two PDE solves. That is the magic of the adjoint-state method: it makes large-scale FWI tractable.

The next paragraph builds an interactive picture of exactly this gradient.

Try it: the adjoint-state gradient at work

What the widget does, step by step:

Builds a 3-layer 1D model with true velocities $(c_1, c_2, c_3) = (1.0, 1.5, 1.2)$ . Source at $x = 0.05$ , receiver at $x = 0.95$ — a transmission geometry. The cached "observed" seismogram $d_{\mathrm{obs}}(t)$ is the wavefield arriving at the right boundary after travelling through all three layers.
For your chosen middle-layer velocity $c_2$ in the starting model, runs an FDTD forward solve and samples $d_{\mathrm{pred}}(t)$ at the receiver.
Computes the residual $r(t) = d_{\mathrm{pred}}(t) - d_{\mathrm{obs}}(t)$ and the misfit $J = \tfrac{1}{2} \int r^2 , dt$ .
Runs an adjoint FDTD solve with the residual injected at the receiver and time-reverses the result to obtain $\lambda(x, t)$ .
Computes $u_{tt}(x, t)$ by 2nd-order central differences in time, then evaluates the correlation integral $\int u_{tt} \cdot \lambda , dt$ at every grid point and multiplies by $-2 / c(x)^3$ .
Plots $c_{\mathrm{true}}(x)$ vs $c_{\mathrm{start}}(x)$ , the seismograms, the residual, both wavefields as space-time images, and the gradient profile.

Drag the slider. When $c_2 > 1.5$ the gradient at the middle layer is positive — descent says decrease $c_2$ . When $c_2 < 1.5$ the gradient is negative — descent says increase $c_2$ . At $c_2 = 1.5$ the residual collapses, the adjoint wavefield vanishes, and the gradient goes to zero. This is exactly the local minimum the optimiser would converge to. Multiple iterations of $c \leftarrow c - \eta , \partial J / \partial c$ would produce a cooling sequence of velocity updates, each computed by one forward + one adjoint solve.

One important caveat will already be visible: as you slide $c_2$ farther from truth, the gradient sign eventually FLIPS — descent now points AWAY from the global minimum. This is cycle skipping, the canonical FWI failure mode (covered in detail in §6.5). It happens when the predicted-vs-observed arrival-time mismatch exceeds half the source's dominant period at the receiver: the residual then aligns with the WRONG cycle of the waveform, the correlation $\int u_{tt} \lambda , dt$ changes sign, and gradient descent runs in the wrong direction. Slide systematically and you will notice the flips are not monotonic — there are multiple disconnected "islands" of correct gradient sign separated by cycle-skipped intervals (e.g.\ on this widget the correct-sign islands are roughly $c_2 \in [0.5, 0.7] \cup [1.0, 2.1] \cup [2.4]$ , with cycle-skipping flips around $c_2 \approx 0.8\text{--}0.9$ and $c_2 \approx 2.15\text{--}2.35$ ). Each island is a distinct local basin of attraction. Local-minimum convergence is contingent on starting inside the correct island, which is why classical FWI uses frequency-continuation (§6.4) and modern FWI uses Wasserstein or adaptive-waveform-inversion misfits to merge the islands and extend the global basin.

The PINN-FWI loss

Now contrast PINN-FWI (Rasht-Behesht et al. 2022; Song & Alkhalifah 2022; broader programme by Sun & Alkhalifah and others). Instead of one network for the wavefield (as in Parts 1–4) and FDTD for the forward model, PINN-FWI parameterises BOTH unknowns by neural networks:

$u_{\mathrm{NN}}(x, t; \theta_u)$ — the wavefield network, exactly as in Part 4.
$m_{\mathrm{NN}}(x; \theta_m)$ — a SECOND network whose only input is the spatial coordinate, whose output is the velocity. Often a small MLP or even a positional grid.

The joint loss combines a data-fit term, a PDE-residual term that uses the velocity NN, and the usual IC/BC terms:

\mathcal{L}(\theta_u, \theta_m) = \lambda_{\mathrm{data}} \, \mathcal{L}_{\mathrm{data}}(\theta_u) + \lambda_{\mathrm{pde}} \, \mathcal{L}_{\mathrm{pde}}(\theta_u, \theta_m) + \mathcal{L}_{\mathrm{ic}}(\theta_u) + \mathcal{L}_{\mathrm{bc}}(\theta_u)

where

\mathcal{L}_{\mathrm{data}}(\theta_u) = \tfrac{1}{N_{\mathrm{rec}}} \sum_{i, n} \bigl( u_{\mathrm{NN}}(x_i, t_n; \theta_u) - d_{\mathrm{obs}}(x_i, t_n) \bigr)^2 ,

\mathcal{L}_{\mathrm{pde}}(\theta_u, \theta_m) = \tfrac{1}{N_c} \sum_k \bigl( \partial_t^2 u_{\mathrm{NN}} - m_{\mathrm{NN}}(x_k)^2 \, \partial_x^2 u_{\mathrm{NN}} \bigr)^2 \Big|_{(x_k, t_k)} .

Both gradients $\partial \mathcal{L} / \partial \theta_u$ and $\partial \mathcal{L} / \partial \theta_m$ are computed by ordinary backprop through the joint computational graph. Adam updates both sets of parameters simultaneously, every epoch.

Where the two paradigms differ

Classical FWI. The velocity model $m$ lives on a grid. Each iteration: one forward solve, one adjoint solve, one model update via the explicit gradient formula. The adjoint wavefield $\lambda(x, t)$ is the central object; you literally compute it and integrate. Discretisation choices (stencil order, time-stepping scheme, source/receiver injection) propagate into the gradient and must be self-consistent. Tools: SPECFEM, Devito, Madagascar, custom MPI codes.

PINN-FWI. The velocity model is a neural network. Each epoch: one forward+backward pass through the joint loss; Adam updates $\theta_u$ and $\theta_m$ . The adjoint wavefield is never formed explicitly — but the chain-rule traceback through the wavefield network $u_{\mathrm{NN}}$ is its implicit realisation. Auto-diff handles all the discretisation-consistency book-keeping. Adding regularisation (Tikhonov, total-variation, sparsity in transform domain) is one extra term in the loss; classical FWI requires re-deriving the gradient.

The honest comparison

PINN-FWI is not obviously better. It has three documented weaknesses:

Wavefield-network spectral bias (Tancik et al. 2020; covered in §2.2). High-frequency wavefield content is hard to fit. PINN-FWI can stall when the data has frequencies above what $u_{\mathrm{NN}}$ can represent.
Coupled convergence. The wavefield network must be roughly correct before the velocity network gets meaningful gradients. Until $u_{\mathrm{NN}}$ approximately satisfies the PDE on the current $m_{\mathrm{NN}}$ , gradients $\partial \mathcal{L} / \partial \theta_m$ are noise. Classical FWI sidesteps this by always solving the PDE exactly.
Cycle skipping (covered in §6.5). The non-convexity of the $L^2$ misfit hits both methods, but PINN-FWI has no equivalent of the Wasserstein or AWI misfits that classical FWI uses to extend the basin of attraction.

And four genuine strengths:

No discretisation-consistency burden. Auto-diff is exact.
Mesh-free. PDE residual evaluated at random points; no body-fitted grid required for irregular geometry.
Easy regularisation. Anything you can write as a loss is automatically differentiable.
Continuous velocity field. $m_{\mathrm{NN}}(x)$ is a smooth function — no aliasing artefacts, natural smoothness.

Published PINN-FWI demonstrations (Rasht-Behesht et al. 2022 on benchmark layered models; Song & Alkhalifah 2022 on wavefield reconstruction inversion) show the method recovering velocity fields from seismic data, but at substantially higher compute cost than classical FWI on the same data. Both paradigms are alive and competitive in 2024 research; neither has obviously won at scale. Part 6 follows the PINN-FWI line because it is the natural continuation of Parts 1–4 and because it is the part of the PINN literature most often misunderstood.

What §6.2 will do

§6.2 chains 25 of the §6.1 adjoint-state gradients into the canonical classical FWI iteration loop, on the same 3-layer 1D model. You will pick a starting middle-layer velocity, click ▶ Run, and watch the velocity descend over 25 outer iterations toward truth. Every iteration runs one forward + one adjoint solve to compute the Plessix gradient, applies a smoothing preconditioner, and takes a backtracking-line-search step. The full PINN-FWI version (joint Adam over $\theta_u$ and $\theta_m$ ) is structurally identical except backprop replaces the explicit adjoint solve; §6.2's pseudocode comparison makes the contrast precise.

References

Tarantola, A. (1984). Inversion of seismic reflection data in the acoustic approximation. Geophysics 49(8), 1259–1266. The original FWI formulation.
Lailly, P. (1983). The seismic inverse problem as a sequence of before stack migrations. SIAM Conference on Inverse Scattering. Co-discoverer of the adjoint approach for seismic inversion.
Plessix, R.-E. (2006). A review of the adjoint-state method for computing the gradient of a functional with geophysical applications. Geophys. J. Int. 167(2), 495–503. The canonical reference for the adjoint-state method in geophysics.
Virieux, J., Operto, S. (2009). An overview of full-waveform inversion in exploration geophysics. Geophysics 74(6), WCC1–WCC26. The community-standard FWI overview.
Symes, W.W. (2007). Reverse time migration with optimal checkpointing. Geophysics 72(5), SM213–SM221. Discusses discretisation consistency for adjoint-state methods.
Sun, J., Alkhalifah, T. (2020). ML-descent: An optimization algorithm for full-waveform inversion using machine learning. Geophysics 85(6), R477–R492.
Rasht-Behesht, M., Huber, C., Shukla, K., Karniadakis, G.E. (2022). Physics-informed neural networks (PINNs) for wave propagation and full waveform inversions. JGR Solid Earth 127, e2021JB023120.
Song, C., Alkhalifah, T. (2022). Wavefield reconstruction inversion via physics-informed neural networks. IEEE Trans. Geosci. Remote Sens. 60, 5908012. (arXiv:2104.06897, 2021.)