The inverse problem, mathematically

Part 6 — Full-Waveform Inversion

Learning objectives

State the L2 FWI objective function and identify its inputs
Derive the gradient update rule via the adjoint state method
Explain cycle skipping and how frequency controls the basin of attraction
Describe the full FWI iteration: forward, adjoint, gradient, line search, repeat

Full-waveform inversion (FWI) treats seismic imaging as a nonlinear least-squares optimisation problem. You have observed data $d_{obs}(s, r, t)$ recorded at receivers $r$ from sources $s$ . You have a model $m$ (usually velocity, sometimes density or anisotropy) and a wave simulator that produces synthetic data $d_{syn}(m)$ . Pick the $m$ that minimises the difference. That is FWI in one sentence. The rest of this section is what that optimisation actually costs and why it is so hard to do without getting lost in local minima.

1. The L2 objective function

J(m) = \tfrac{1}{2} \sum_{s,r,t} \bigl[d_{\text{obs}}(s,r,t) - d_{\text{syn}}(s,r,t;\,m)\bigr]^2

This is the ordinary squared-error misfit integrated over every source, receiver, and time sample. Small J means the synthetic data match the observed data; large J means they do not. FWI is gradient descent on $J(m)$ in the space of all possible velocity models. L2 is the default because it has a well-behaved gradient; more robust variants (L1, Huber, correlation-based) are used when the data has outliers or large systematic errors.

2. The gradient and the adjoint state method

To run gradient descent we need $\nabla_m J$ — the partial derivative of $J$ with respect to every pixel of the velocity model. A naive finite-difference approach would perturb each pixel, re-run the simulator, and measure the change in $J$ . That costs one forward simulation per pixel, which for a model with $10^6$ pixels is impossible. The adjoint state method gets the entire gradient with just two simulations, regardless of model size:

Forward: Run the wave equation forward from the source wavelet to get the forward wavefield $U_s(x, z, t)$ and the synthetic data $d_{syn}$ .
Residual: Compute the trace-by-trace residual $r(s, r, t) = d_{syn} - d_{obs}$ .
Adjoint: Inject $r(s, r, t)$ in reverse-time as a source at each receiver, propagate the wave equation backward to get the adjoint wavefield $U_r^\dagger(x, z, t)$ .
Cross-correlate: The gradient at each pixel is

\partial J/\partial m(x,z) \propto -\int U_s(x,z,t)\,U_r^\dagger(x,z,t)\,dt

— a zero-lag cross-correlation between the forward and adjoint wavefields. This is structurally identical to the RTM imaging condition of §5.7; FWI and RTM share the same machinery, they just interpret the output differently. RTM outputs an image (reflectivity); FWI outputs a velocity-model correction.

3. The cycle-skipping problem

Gradient descent converges to the nearest local minimum of $J(m)$ . If the initial model is close to the truth, that local minimum is the global minimum and the answer is right. If the initial model is far from the truth, the local minimum can be a cycle-skipped solution — a model where the synthetic data match the observed data shifted by one full wavelength. Gradient descent cannot see past the next peak of the misfit, so it never finds the true minimum.

The boundary is set by frequency. If the synthetic and observed traces are misaligned by more than half a wavelet period ( $T/2 = 1/(2f)$ ), the gradient tells you to move in the wrong direction — toward the cycle-skipped minimum instead of the true one. The global basin of J(m) — the region where gradient descent converges to truth — has half-width $\Delta t < 1/(2f)$ .

Simplified 1D FWI. Single reflector at 1000 m, true velocity $V_{true} = 2000\ \text{m/s}$ , so the observed trace is a Ricker wavelet centred at 1.0 s. The synthetic trace for trial velocity $V$ is a Ricker centred at $2z/V$ . Left panel shows both traces; right panel shows the L2 misfit $J(V)$ with a dot at the current $V_{guess}$ .

Slide $V_{guess}$ with $f = 15\ \text{Hz}$ : the landscape has a deep central basin around 2000 m/s with oscillating side-lobes at cycle-skipped velocities. The info strip tells you whether the current guess is inside the global basin, near its edge, or cycle-skipped. Now drop the frequency to 5 Hz: the basin widens dramatically — you can start further from V_{true} and still converge. Raise it to 45 Hz: the basin shrinks; even a guess 200 m/s off is already cycle-skipped. This trade-off is the single most important number in production FWI: the lowest usable frequency sets how forgiving the method is of your initial model.

5. The FWI iteration in full

Initialise with a smooth velocity model from tomography (§5.9), well logs, or prior seismic.
Filter the observed data to the lowest available frequency band.
Forward the source wavelet through the current model to get synthetic data.
Residual: $r = d_{syn} - d_{obs}$ .
Adjoint: reverse-propagate $r$ as a receiver-side source.
Gradient: zero-lag cross-correlate forward and adjoint wavefields per pixel.
Pre-condition the gradient (scale by approximate inverse Hessian, apply masks that zero out regions outside the illumination cone).
Line search along the gradient to find the step length $\alpha$ that minimises $J(m + \alpha \cdot g)$ .
Update: $m \leftarrow m + \alpha \cdot g$ .
Repeat until convergence (gradient magnitude below threshold, or J stops decreasing).
Raise frequency band and go back to step 3 — multi-scale continuation.

A production 3D FWI runs this loop for 50–200 outer iterations across 5–10 frequency bands. Each outer iteration is 2 wave simulations (forward + adjoint) per shot × thousands of shots. GPU clusters run for days to weeks per frequency band. The final model is worth it: FWI recovers velocity detail at a tenth the wavelength of the lowest used frequency, producing images with clarity no ray-based method can match.

6. What can go wrong

Cycle skipping — the widget's whole message. Mitigate by starting at low frequency; mitigate further by initial models good to within half a wavelength of truth at the starting frequency.
Local minima from unmodelled physics. If the data contains elastic converted waves and the simulator is acoustic, the residual has no correct gradient — FWI tries to match unmodelled events by tweaking velocity, yielding garbage. Elastic FWI (§6.4) is the answer.
Source-wavelet errors. A mismatched source wavelet maps to a systematic velocity bias. Solution: jointly invert for the source wavelet, or use source-independent misfits (correlation coefficient, trace-envelope matching).
Noise in the observations. Low-frequency swell, dip lines, 60 Hz hum. FWI tries to fit all of it. Pre-filter aggressively; use robust misfits in noisy bands.
Computational cost. Forward + adjoint per shot per iteration per band gets expensive quickly. See §6.3 for encoded and source-encoded FWI that collapses thousands of shots into a few "super-shots".

**The one sentence to remember**

FWI is gradient descent on ½Σ(d_obs − d_syn(m))² using the adjoint-state method to compute the gradient in two wave simulations — the whole game is avoiding cycle skipping, and the answer is start low frequency and climb.

Where this goes next

§6.2 turns the cycle-skipping tradeoff into a concrete workflow: multi-scale frequency continuation, data preconditioning, envelope-FWI, time-domain-windowing strategies, and the family of tricks production FWI uses to stretch the usable frequency band downward.

References

Tarantola, A. (1984). Inversion of seismic reflection data in the acoustic approximation. Geophysics, 49, 1259.
Virieux, J., Operto, S. (2009). An overview of full-waveform inversion in exploration geophysics. Geophysics, 74, WCC1.
Pratt, R. G. (1999). Seismic waveform inversion in the frequency domain, Part 1. Geophysics, 64, 888.
Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge.