Source-encoded FWI-PINN

Part 6 — Velocity inversion with PINNs

Learning objectives

Understand stochastic source encoding (Krebs 2009)
See empirically that encoded inversion converges with ~N× less compute
Recognise the noise-vs-compute trade-off (encoded gradients are noisy)
Connect to mini-batch SGD: source encoding is FWI's mini-batch trick

Real seismic surveys have hundreds to thousands of shots. Naive FWI runs ONE forward + ONE adjoint solve PER SHOT per outer iteration. With 1000 shots and 30 iterations, that is 60,000 PDE solves total. On a 2D Marmousi-class problem, that is days of CPU time on a small cluster.

Krebs et al. (2009) showed this can be reduced to $\sim 30$ PDE solves total — a 2000× saving — via STOCHASTIC SOURCE ENCODING. The idea: combine all shots into a single random superposition; run ONE forward + ONE adjoint solve; compute the gradient as if the encoded super-shot were a single shot. The expected gradient over random encoding signs equals the shot-by-shot gradient. The variance of the encoded gradient is the cost paid for the compute saving.

The math

Let $s_k(x, t) = \delta(x - x_k) f_k(t)$ be the source for shot $k$ , $d_{k}^{\mathrm{obs}}(t)$ be its recorded data at the receivers. Choose random encoding signs $\varepsilon_k \in {-1, +1}$ independently and uniformly, fresh each iteration. Define the encoded super-source and encoded data:

s_{\mathrm{enc}}(x, t) = \sum_k \varepsilon_k \, s_k(x, t) , \qquad d_{\mathrm{enc}}^{\mathrm{obs}}(t) = \sum_k \varepsilon_k \, d_{k}^{\mathrm{obs}}(t) .

Run ONE forward solve with $s_{\mathrm{enc}}$ on the current model to get $u_{\mathrm{pred}}^{\mathrm{enc}}$ . Compute the residual $r_{\mathrm{enc}} = u_{\mathrm{pred}}^{\mathrm{enc}}|$ . Run ONE adjoint solve. Apply the Plessix correlation. Result: an unbiased estimate of the shot-by-shot FWI gradient.

Why unbiased? The $\varepsilon_k$ are independent zero-mean. Cross-terms in the encoded gradient (between shot $k$ and shot $\ell \ne k$ ) carry $\varepsilon_k \varepsilon_\ell$ , which has expectation zero. Diagonal terms ( $k = \ell$ ) carry $\varepsilon_k^2 = 1$ deterministically. Sum over $k$ : same as the shot-by-shot diagonal sum.

The cost: the cross-terms ARE present in any single iteration; they form the gradient noise. As $\sqrt{N}$ noise relative to the signal, this is exactly the same trade-off as mini-batch SGD vs full-batch gradient descent in deep learning.

Try it

The widget runs 4-shot FWI on the §6.2 problem with $c_2$ as the only unknown. Shot-by-shot does 4 forward + 4 adjoint solves per outer iteration. Encoded does 1 forward + 1 adjoint with random ±1 signs flipping each iteration. With 12 outer iterations:

Shot-by-shot: 12 × 4 × 2 = 96 PDE solves total. Smooth convergence trace.
Encoded: 12 × 1 × 2 = 24 PDE solves total. Noisier trace but reaches truth.

For 4 shots the saving is 4×. For 1000-shot real seismic the saving is 1000×, and a single 30-iteration FWI takes 30 minutes instead of 30 hours. Production codes use a more conservative scheme: 4–8 random encodings averaged per outer iteration to reduce noise (still a $100\times$ saving for 1000 shots).

Time-shift encoding (Krebs 2009 Appendix). Instead of ±1 binary, use random time-shifts $\tau_k$ per shot. This delocalises the cross-talk noise in TIME rather than in amplitude.
Frequency-domain encoding (Plessix & Mulder 2008). Encode in the Fourier domain — random phase rotations per shot per frequency. Equivalent to mini-batch in frequency-domain FWI.
Plane-wave decomposition. Convert the source line into plane-wave shots; invert plane waves instead of point shots. Naturally encodes via the plane-wave parameter.
SGD vs Adam-style averaging. Moghaddam et al. 2013 propose averaging encoded gradients over a window of past iterations — the encoded version of momentum. Convergence is smoother.

PINN-FWI source encoding

The PINN-FWI version is structurally trivial: the data-fit term $\mathcal{L}_{\mathrm{data}}$ is summed over all shots. Replace it with the encoded version:

\mathcal{L}_{\mathrm{data}}^{\mathrm{enc}}(\theta_u) = \Bigl( u_{\mathrm{NN}}|_{\mathrm{rec}, \, t} - \sum_k \varepsilon_k d_k^{\mathrm{obs}} \Bigr)^2 ,

where the wavefield network $u_{\mathrm{NN}}$ is now trained against the encoded super-shot data. Auto-diff handles backprop through both networks. Adam updates $\theta_u$ and $\theta_m$ with the noisy encoded gradient. With fresh $\varepsilon_k$ each epoch, the noise averages out. Sun & Alkhalifah have demonstrated this for 2D Marmousi-class PINN-FWI at 5–10× compute savings vs shot-by-shot PINN-FWI.

What §6.8 will do

§6.8 returns to the loss-balance question of §3.2 in an FWI setting. The PINN-FWI joint loss $\mathcal{L} = \lambda_d \mathcal{L}$ has multiple weights that strongly affect convergence. The §6.8 widget demonstrates the simplest 2-term version on classical FWI (data misfit + Tikhonov regulariser) so the trade-off is visible in a single 80-sample misfit-landscape sweep. The full PINN-FWI version inherits all the same balancing concerns — covered in the §6.8 prose with cross-references to the §3.3 NTK-balance and §3.4 SA-PINN auto-tuning trio.

References

Krebs, J.R., Anderson, J.E., Hinkley, D., Neelamani, R., Lee, S., Baumstein, A., Lacasse, M.-D. (2009). Fast full-wavefield seismic inversion using encoded sources. Geophysics 74(6), WCC177–WCC188.
Moghaddam, P.P., Keers, H., Herrmann, F.J., Mulder, W.A. (2013). A new optimization approach for source-encoding full-waveform inversion. Geophysics 78(3), R125–R132.
Plessix, R.-E., Mulder, W.A. (2008). Source separation in seismic full-waveform inversion using random Krylov methods. SEG Annual Meeting.