Cycle skipping: detection and remedies

Part 6 — Velocity inversion with PINNs

Learning objectives

Define cycle skipping in terms of the half-period traveltime threshold
Recognise cycle skipping in convergence traces and gradient sign
Understand the three modern alternative misfits: envelope, Wasserstein, AWI
See empirically that envelope misfit cures cycle skipping at the cost of resolution
Know the production-FWI workflow: envelope → L² (multi-stage misfit continuation)

§6.4 cured cycle skipping by changing the DATA (low-pass filtering). This section cures it by changing the MISFIT FUNCTION. Both ideas compose in production: filter the data AND use a robust misfit, layered curriculum-style.

What cycle skipping looks like

The §6.1 widget showed the gradient sign flipping at extreme c₂ values. The §6.2 widget showed the line search stalling. The §6.4 widget showed the freq-continuation rescue. Here is the canonical signature in three pictures:

Predicted vs observed seismograms. The two waveforms look very similar in shape but one is shifted in time by more than half the dominant period. Visually: peaks of $d_{\mathrm{pred}}$ align with TROUGHS of $d_{\mathrm{obs}}$ .
The L² residual $r = d_{\mathrm{pred}} - d_{\mathrm{obs}}$ . Has a "double pulse" shape — positive lobe at the predicted arrival, negative lobe at the observed arrival, both of comparable magnitude. The gradient $\partial J/\partial m$ correlates this residual with the time-reversed adjoint, giving CONSTRUCTIVE cancellation between the two lobes — wrong sign.
Convergence trace. $J$ stalls at a non-zero plateau. The line search reduces step size to zero and gives up. The model error $|c - c_{\mathrm{true}}|$ stays at its initial value (or grows).

The mathematical condition

Let $T_0 = 1/f_0$ be the dominant period of the source. Let $\Delta t$ be the traveltime mismatch between the predicted and observed arrivals at a receiver. Cycle skipping happens when

|\Delta t| > T_0 / 2 .

For a Ricker pulse with $\sigma$ -parameterisation $s(t) = (1 - 2u^2) e^{-u^2}$ , $u = (t-t_0)/\sigma$ , the dominant frequency is $f_0 = 1/(\pi \sigma)$ and the period is $T_0 = \pi \sigma$ . So the cycle-skipping threshold is

|\Delta t| > \pi \sigma / 2 .

For our §6.4 widget: $\sigma = 0.012$ s, so the threshold is $\sim 19$ ms. A 1D problem with travel-time differences exceeding 19 ms across the model is cycle-skipped at single-frequency.

Three modern alternative misfits

The trick: replace the $L^2$ misfit $J = \tfrac{1}{2} \int (d_{\mathrm{pred}} - d_{\mathrm{obs}})^2$ with something that is SMOOTH in time-shift, so the cycle-skipping non-convexity disappears. Three contenders:

Envelope misfit (Wu, Luo, Wu 2014). Replace each seismogram by its envelope $\mathrm{env}(s)(t) = \sqrt{s(t)^2 + \epsilon^2}$ before computing the residual. The envelope strips the carrier oscillations — peaks and troughs become bumps — so cycle-skipping is impossible at the envelope level. The price: resolution loss; the envelope smooths over the high-frequency content that distinguishes nearby reflectors. Production workflow: invert the envelope first, then refine with L².
Wasserstein / optimal-transport misfit (Engquist & Froese 2014; Métivier et al. 2016). Treat the seismograms as probability distributions (after positivity-shift normalisation) and compute the Wasserstein-1 or Wasserstein-2 distance between them. The Wasserstein metric is convex in time-shift by construction, so it cures cycle skipping cleanly. Mathematically elegant; computationally intensive — Métivier 2016's SOT (Sliced Optimal Transport) approximation made it production-tractable.
Adaptive Waveform Inversion (AWI) (Warner & Guasch 2016). Instead of comparing $d_{\mathrm{pred}}$ to $d_{\mathrm{obs}}$ directly, find the Wiener filter $w$ that best matches $d_{\mathrm{pred}} \ast w$ to $d_{\mathrm{obs}}$ , then minimise the deviation of $w$ from a $\delta$ -function. AWI absorbs time-shift residuals into the filter $w$ ; the misfit only sees what is left. Provably immune to first-order cycle skipping; a workhorse misfit at the major service companies (CGG, BP, etc.).

Try it: misfit landscapes

The widget above computes both misfits as a function of the middle-layer velocity $c_2$ on the §6.2 problem, with $c_1 = 1.0$ and $c_3 = 1.2$ fixed at truth. 60 sampled values of $c_2 \in [0.6, 2.4]$ each take one FDTD forward solve. Plotted side-by-side on log y-axis, with local minima circled.

Empirical result on this problem (60-sample landscape):

L² landscape: typically 10 local minima — one near $c_2 = 1.5$ (truth) plus 9 spurious minima at multiples of the Ricker period offset. Each spurious minimum is a "cycle-skipping island" where the predicted-vs-observed traveltime mismatch is an integer number of half-periods. Gradient descent converges to whichever island the starting model lies in.
Envelope landscape: 1 minimum at $c_2 = 1.5$ . The smoothed- $s^2$ envelope $\sqrt{\mathrm{smooth}(s^2, \sigma) + \epsilon^2}$ removes the carrier oscillations that produce the L² wiggles. The result is a smooth bowl with a global minimum at truth.

This is the Wu, Luo, Wu 2014 figure-of-merit, made interactive. Their published version uses Hilbert-transform envelopes (computed via FFT) which produce essentially the same convex landscape.

You might expect a "race two FWI runs" widget like §6.4's. The mathematical claim of the convex landscape is that envelope-FWI gradient descent converges from any starting point. In practice, the simple $\sqrt{s^2 + \epsilon^2}$ envelope has a known limitation: the gradient adjoint source

\textrm{adj\_src}(t) = s_{\mathrm{pred}}(t) \cdot \textrm{smooth}\!\left(\frac{\mathrm{env}_p - \mathrm{env}_o}{\mathrm{env}_p}, \sigma\right)(t)

still contains the carrier $s_{\mathrm{pred}}(t)$ as a multiplicative factor, so the adjoint wavefield $\lambda(x, t)$ inherits the same phase oscillations as L²'s. The correlation $\int u_{tt} , \lambda , dt$ then suffers the same phase-cancellation issues at the gradient level even though the misfit landscape is convex. The Wu 2014 paper carefully derives the Hilbert-envelope adjoint chain rule (which requires FFT in the backprop step too) to fix this — beyond this widget's in-browser scope.

Production envelope FWI codes use Hilbert envelopes; production frequency-continuation codes use band-pass filtering; production AWI codes use Wiener filter convolution. All three converge from cycle-skipped starting models because their misfits + their adjoint chain rules together preserve the convex landscape all the way to the gradient.

The production workflow: misfit continuation

Real production FWI runs a CASCADE of misfits:

First, low-frequency envelope inversion (immune to cycle skipping; resolves only large-scale structure).
Then, low-frequency L² inversion (warm-started from envelope; recovers mid-scale structure).
Then, frequency continuation through the §6.4 stages (refining successively higher-frequency content).
Finally, full-band L² inversion at the highest frequencies (high-resolution detail).

Each stage starts from the previous stage's converged model. This is the standard recipe in CGG's Geovation, Schlumberger's WesternGeco, and BP's reservoir-FWI workflows. The §6.4 frequency-continuation widget builds curriculum on the data dimension; this section's envelope idea builds curriculum on the misfit-function dimension; production stacks both.

What about PINN-FWI?

The same misfit-replacement trick works for PINN-FWI. Replace the $L_{\mathrm{data}}$ term in the joint loss

\mathcal{L} = \lambda_{\mathrm{data}} \mathcal{L}_{\mathrm{data}} + \lambda_{\mathrm{pde}} \mathcal{L}_{\mathrm{pde}} + \ldots

by an envelope or Wasserstein version. Auto-diff handles the gradient automatically. Sun & Alkhalifah (2021) and Yang et al. (2023 Geophysics) both report PINN-FWI with envelope or Wasserstein misfits successfully recovering velocity from cycle-skipped starting models that defeated L²-PINN-FWI. The §6.4 frequency-continuation idea also applies — just bandpass-filter the data input to the PINN-FWI joint loss.

What §6.6 will do

§6.6 frees a SECOND layer parameter (the top-layer velocity $c_1$ ) in addition to the middle-layer $c_2$ , and shows the resulting 2D misfit landscape $J(c_1, c_2)$ . The cross-talk between layer velocities — the inability of single-shot transmission data to independently constrain both — appears as an elongated valley in the heatmap. The same valley structure shows up at much larger scale in production multi-parameter FWI (P-wave velocity vs S-wave velocity vs density); the §6.6 prose covers the full $v_p$ / $v_s$ / $\rho$ generalisation and its remedies (multi-shot illumination, joint inversion with auxiliary data, impedance reparameterisation).

References

Wu, R.-S., Luo, J., Wu, B. (2014). Seismic envelope inversion and modulation signal model. Geophysics 79(3), WA13–WA24.
Engquist, B., Froese, B.D. (2014). Application of the Wasserstein metric to seismic signals. Communications in Mathematical Sciences 12(5), 979–988.
Métivier, L., Brossier, R., Mérigot, Q., Oudet, E., Virieux, J. (2016). Measuring the misfit between seismograms using an optimal transport distance: application to full waveform inversion. Geophys. J. Int. 205(1), 345–377.
Warner, M., Guasch, L. (2016). Adaptive waveform inversion: Theory. Geophysics 81(6), R429–R445.
Sun, B., Alkhalifah, T. (2021). The application of optimal transport for FWI in seismic exploration. Geophys. J. Int.
Yang, F., Ma, J., Lu, Y., Liu, X. (2023). Physics-informed deep learning for full-waveform inversion via Wasserstein loss. Geophysics.