Encoded FWI & computational strategies

Part 6 — Full-Waveform Inversion

Learning objectives

State the source-encoding identity and the cross-talk mechanism
Compare naive, encoded, and mini-batch FWI by cost per iteration and iterations to converge
Describe the memory-reduction tricks for storing the forward wavefield
Understand why L-BFGS (quasi-Newton) is the default solver

Production FWI is computationally extreme: a full-physics 3D acoustic FWI at 10 Hz on a modest survey (10 km × 10 km × 8 km, $\Delta x = 25\ \text{m}$ ) takes thousands of shots, each requiring a forward plus an adjoint simulation per iteration, for tens of outer iterations, for 5–10 frequency bands. Run the arithmetic and you get days-to-weeks of GPU cluster time per survey — and that is already after using every computational trick available. This section catalogues the tricks.

1. Source encoding — the central trick

The wave equation is linear in the source: if shot $i$ produces data $d_i$ , the combined source $\sum_i c_i s_i$ produces data $\sum_i c_i d_i$ . Pick random signs $c_i \in {-1, +1}$ , build a super-source and a super-data, and run one wave simulation that collectively informs every shot's gradient:

s_{\text{enc}} = \sum_i c_i s_i,\quad d_{\text{enc}} = \sum_i c_i d_i,\quad g_{\text{enc}} = \sum_i c_i^2\, g_i + \sum_{i \neq j} c_i c_j\, X_{ij}

The first term is $\sum_i g_i$ — the true FWI gradient we wanted. The second term is cross-talk: contributions from shot $i$ 's forward wavefield correlated with shot $j$ 's adjoint wavefield. Because $c_i c_j$ averages to zero over many random encodings, the cross-talk averages out over iterations, leaving only the correct gradient. One simulation per iteration instead of $N_{shots}$ — potentially a 100– to 1000-fold speedup, at the cost of ~3× more iterations to convergence because of the added noise.

Three horizontal bars show total-simulations-to-convergence for the three strategies:

Naive FWI: one sim per shot per iteration. For $N = 1000$ shots and 50 iterations: 50 000 sims.
Encoded FWI: one sim total per iteration, but 3× more iterations due to cross-talk: 150 sims total — 333× cheaper than naive for N=1000.
Mini-batch FWI: $k$ random shots per iteration, 1.5× more iterations. For $k = 0.1N = 100$ : 7500 sims — 6.7× cheaper than naive.

The speedup from encoded FWI grows with $N_{shots}$ : at $N = 10,000$ it is $10,000 \times 50 / 150 = 3,333\times$ faster. Production FWI over 10 000 shots that would take years naively finishes in days with encoding.

3. When encoded FWI breaks

Irregular acquisition geometries. Encoded FWI assumes all shots contribute linearly; missing or unevenly distributed sources break the randomness assumption and cross-talk stops averaging out.
Strong amplitude variations between shots. If one shot is 10× stronger than another, the random-sign sum is dominated by the strong shot and you effectively only invert one shot’s gradient.
Locally correlated noise. Noise that is coherent across shots (coherent swell, electrical hum) does not average out under encoding — it stacks additively.
Salt imaging. At sharp reflectors, the cross-talk can add coherent artefacts that never fully wash out. Hybrid strategies (encoded outer, naive inner) are used.

4. Memory — storing the forward wavefield

The adjoint-state gradient requires the forward source wavefield $U_s(x, z, t)$ at every grid point and every time step. For a 3D volume with $400 \times 400 \times 320 \times 4000$ samples (single precision), that is ~800 GB per shot. Three standard mitigations:

Checkpointing: save $U_s$ at every $k$ -th time step only. To compute the gradient at the skipped time steps, re-propagate forward from the nearest checkpoint. Memory drops by $k$ at the cost of $\log k$ extra simulations. Griewank's binomial checkpointing gives the optimal schedule.
Random boundaries: instead of absorbing-boundary wavefield storage, randomise the velocity at the boundaries so outgoing waves return scrambled. When you reverse-time-propagate the forward wavefield (re-deriving $U_s$ $U_{s}$ ), the randomised boundary reproduces the original wavefield in the interior. Storage drops to zero at the cost of re-propagation.
- Wavefield reconstruction: solve the wave equation backward in time from the last saved slab (usually just the boundary). Cheaper than re-forward, and accurate enough for L-BFGS-quality gradients.

5. Parallelism

Shot-parallel: each shot's simulation is independent; distribute across GPU nodes. Embarrassingly parallel up to $N_{shots}$ workers.
Domain-decomposed: split the model grid across GPUs within a node; each GPU handles its slab. Adds inter-GPU communication at slab boundaries but necessary for very large models.
Frequency-parallel: multi-scale runs can be pipelined across independent GPU pools, one frequency per pool, with results handed off when ready.

6. Hessian, or: why L-BFGS is the default

Pure steepest-descent FWI converges slowly because $J(m)$ has very different curvature along different model directions. Newton's method $m \leftarrow m - H^{-1} g$ fixes this but requires the full Hessian $H = \partial^2 J / \partial m^2$ , which costs $N_{model}^2$ simulations — infeasible for $N_{model} \sim 10^8$ .

L-BFGS (limited-memory BFGS) approximates $H^{-1}$ from the last $k$ gradient–model pairs ( $k = 5-20$ ), giving Newton-like convergence at storage cost $k \cdot N_{model}$ . Combined with an approximate diagonal pre-conditioner (Gauss-Newton on the diagonal, or the Hessian of the acquisition illumination), L-BFGS is the default FWI solver in every production package. Convergence is typically 3–5× faster than steepest descent per outer iteration, paying for itself immediately.

7. A realistic computational budget

For a 3D deep-water sub-salt project, 10 km × 15 km × 8 km at $\Delta x = 25\ \text{m}$ , 8 Hz, 5000 shots:

Naive: 2 sims × 5000 shots × 200 iters × 6 bands = 12 million simulations ⇒ ~2 years on a 100-GPU cluster.
Encoded: 2 sims × 1 super-shot × 600 iters × 6 bands = 7200 simulations ⇒ ~3 days on same cluster.
Mini-batch $k = 500$ : 2 sims × 500 shots × 300 iters × 6 bands = 1.8 M sims ⇒ ~100 days.

Only encoded FWI fits in a reasonable project budget. Every production FWI code supports at least source encoding; most support all three.

**The one sentence to remember**

Source encoding buys 100–1000× fewer simulations per iteration in exchange for 3× more iterations — a net 30–300× speedup that is the only thing that makes 3D production FWI affordable.

Where this goes next

§6.4 moves from acoustic to elastic/anisotropic physics — what happens when the acoustic wave equation is wrong and converted waves or anisotropy-induced travel-time errors dominate the residual.

References

Virieux, J., Operto, S. (2009). An overview of full-waveform inversion in exploration geophysics. Geophysics, 74, WCC1.
Pratt, R. G. (1999). Seismic waveform inversion in the frequency domain, Part 1. Geophysics, 64, 888.
Etgen, J., Gray, S. H., Zhang, Y. (2009). An overview of depth imaging in exploration geophysics. Geophysics, 74, WCA5.
Tarantola, A. (1984). Inversion of seismic reflection data in the acoustic approximation. Geophysics, 49, 1259.