When operator learning beats per-instance training

Part 8 — Operator learning for seismology

Learning objectives

  • Compute the crossover N* from measured costs in the actual browser
  • Recognise the regimes where operator learning amortises favourably
  • Identify when classical solvers and per-instance PINNs still win
  • Anticipate out-of-distribution failures and the data-coverage requirement
  • Wrap up Part 8 and look ahead to Part 9 (UQ + hybrid PINN-FWI)

Part 8 has presented operator learning as a paradigm-shifting toolkit. We now confront the honest cost-benefit question: WHEN does it pay off, and when do classical methods (FDTD, FSM, per-instance PINNs) remain the right choice? The answer is a single inequality.

The crossover formula

Let:

  • TopT_{\mathrm{op}}: cost of pretraining the operator network ONCE.
  • TperT_{\mathrm{per}}: cost of solving ONE problem instance from scratch (e.g., training a per-instance PINN, or running FDTD).
  • TinfT_{\mathrm{inf}}: cost of ONE operator inference (a single forward pass through the trained network).
  • NN: number of distinct problem instances we want to solve.

Total cost for each strategy:

costper(N)=NTper,costop(N)=Top+NTinf.\mathrm{cost}_{\mathrm{per}}(N) = N \cdot T_{\mathrm{per}}, \qquad \mathrm{cost}_{\mathrm{op}}(N) = T_{\mathrm{op}} + N \cdot T_{\mathrm{inf}} .

Operator wins when costop(N)<costper(N)\mathrm{cost}{\mathrm{op}}(N) < \mathrm{cost}{\mathrm{per}}(N), which gives

N>N=TopTperTinfTopTper(typical regime TinfTper).N > N^* = \frac{T_{\mathrm{op}}}{T_{\mathrm{per}} - T_{\mathrm{inf}}} \approx \frac{T_{\mathrm{op}}}{T_{\mathrm{per}}} \quad \text{(typical regime } T_{\mathrm{inf}} \ll T_{\mathrm{per}}\text{)} .

For seismic operators TinfT_{\mathrm{inf}} is typically 10410210^{-4}\text{–}10^{-2} s and TperT_{\mathrm{per}} is 1010410\text{–}10^4 s, so the speedup ratio Tper/TinfT_{\mathrm{per}} / T_{\mathrm{inf}} is enormous. The crossover NN^* depends primarily on how cheap the problem already is per-instance vs how expensive pretraining is. Common ranges:

  • 1-D toy problems (this textbook): Tper1T_{\mathrm{per}} \sim 1 s, Top5T_{\mathrm{op}} \sim 5 s. Crossover at N5N^* \sim 5.
  • 2-D acoustic FWI: Tper60T_{\mathrm{per}} \sim 60 s (PINN per source), Top1T_{\mathrm{op}} \sim 1 hr. Crossover at N60N^* \sim 60 sources.
  • 3-D elastic wave propagation: Tper1T_{\mathrm{per}} \sim 1 hr (FDTD shot), Top1T_{\mathrm{op}} \sim 1 week (F-FNO training, Lehmann et al 2024). Crossover at N168N^* \sim 168 shots.

For full survey-scale FWI projects with thousands of shots and many velocity-model iterations, all three regimes are deeply in the operator-wins regime.

Try it: measure your own crossover

Operator CrossoverInteractive figure — enable JavaScript to interact.

The widget runs in three timed phases: (1) pretrain a DeepONet on the §8.5-style 7-parameter heat-equation family, (2) measure inference time over 100 forward passes for tight statistics, (3) fit a small per-instance MLP to ONE specific instance via supervised regression. After all three measurements, the cost-vs-N chart shows where operator learning beats per-instance for THIS browser, on THIS machine, with THESE problem sizes. The N slider lets you place yourself at any working point and read off the speedup factor.

Beyond N*: the qualitative arguments

Crossover is the quantitative argument. There are four qualitative arguments that reinforce it:

  • Bayesian-friendly forward model. Sampling from a posterior p(cdata)p(c \mid \mathrm{data}) via MCMC needs O(104106)O(10^4\text{–}10^6) forward evaluations. With FDTD this is impractical at scale; with an operator surrogate, a few hours of MCMC suffices. This unlocks UNCERTAINTY QUANTIFICATION on FWI results — the central topic of Part 9.
  • Differentiable end-to-end. Operator networks are differentiable through their inputs (initial conditions, velocity model). For inverse problems formulated as gradient descent on the velocity model, the operator provides T/c\partial T / \partial c via auto-diff. FDTD requires manual adjoint-state implementation per equation type.
  • Real-time interactivity. §8.5's parameter sliders. With FDTD, a designer waits seconds-to-minutes per parameter change; with an operator surrogate, design happens at 60 fps.
  • GPU-efficient inference. Operator networks pack many forward passes into a single GPU kernel call. A typical TensorRT or ONNX deployment of an FNO does 1000 forward passes in 100 ms — far faster than 1000 separate FDTD invocations would manage.

When operator learning loses

Three scenarios where classical solvers and per-instance PINNs still win:

  • One-off problems with N = 1. If you have a single legacy survey to analyse and never need to re-do it, pretraining an operator network is wasted effort. Just run FDTD or train one PINN.
  • Out-of-distribution problems. Operator networks trained on Marmousi-class velocity models will not generalise to volcanic basement structures or strong-anisotropy salt domes. The training distribution is the operating envelope; outside it, predictions silently fail. For exotic case studies, classical solvers handle out-of-distribution inputs trivially.
  • Verifying classical-solver baselines. Even with a deployed operator surrogate, important production runs typically include at least one FDTD verification for trust calibration. The operator surrogate is the workhorse; FDTD is the safety net.

Hybrid architectures: best of both

Modern production seismic-imaging pipelines often use hybrid architectures that combine operator pretraining with PINN fine-tuning:

  • Operator warm-start for PINN. Pretrain an operator surrogate on a velocity-model family. For a NEW velocity model, evaluate the surrogate to get an initial guess, then refine with a per-instance PINN initialised from that guess. The PINN converges in 10× fewer epochs because it starts from a good answer.
  • Operator + classical FWI gradient correction. Use the operator for the wave-equation forward solve in FWI; combine its gradient with a small classical FDTD correction to reduce out-of-distribution errors. Saves 90%\sim 90% of compute.
  • Operator + Bayesian UQ. Cheap operator forwards enable HMC/Stein VI over the posterior, with a final classical-FDTD posterior-mean simulation for trust verification.

These are the architectures Part 9 will build on. Operator surrogates are not a replacement for PINNs and FDTD; they are a NEW LAYER in the toolkit, slotted in where amortisation pays off, and bypassed where it does not.

Out-of-distribution detection: a critical gap

The single biggest engineering risk of operator-based seismic workflows is SILENT FAILURE on OOD inputs. A network trained on smooth gradient velocity models will produce confident-looking predictions for a velocity model containing a salt dome — and those predictions can be ARBITRARILY WRONG without any in-band warning. This is fundamentally different from FDTD failure modes, which are usually loud (numerical instability, NaN propagation, etc.).

Production OOD detection techniques:

  • Likelihood under the training distribution. Compute logp(ctrain)\log p(c \mid \mathrm{train}) for each new input. If below a threshold, flag and revert to FDTD.
  • Model ensembles. Train several operator networks with different seeds; high disagreement on a new input indicates OOD.
  • PDE-residual check. Apply the network to predict T(x)T(x), then check if TNNT_{\mathrm{NN}} satisfies the eikonal residual. If not, fall back to FSM/FDTD.
  • Posterior uncertainty (Bayesian Operators). Train operator networks with weight-space uncertainty (BNN, MC-Dropout). High predictive variance signals OOD inputs. Implementation cost: 1.5-2× normal training but provides a direct uncertainty signal.

Part 9 will revisit OOD detection in the broader UQ context.

Part 8 wrap-up

Part 8 covered the operator-learning paradigm end-to-end:

  • §8.1 Per-instance vs operator framing — the conceptual pivot, with a worked DeepONet on the antiderivative operator demonstrating amortised inference.
  • §8.2 DeepONet architecture deep-dive — the branch + trunk decomposition as a learned-basis representation of the operator, visualised on the Poisson BVP.
  • §8.3 Fourier Neural Operators (FNO) — spectral-convolution layers, resolution-invariance, and a single-layer FNO that recovered the heat-equation operator to machine precision.
  • §8.4 Learned wave-equation propagators — time-stepping with FNO, the eigenvalue-stability constraint that determines whether rollouts stay bounded, and structural-stability tricks (β=-1 freeze, α clamp) that mirror production-code techniques.
  • §8.5 Real-time parametric explorers — the amortisation made tactile, with 7-D heat-operator family explored at 60 fps after a 5-second pretraining.
  • §8.6 (this section) The crossover analysis: when operator learning amortises, when classical methods still win, and the hybrid architectures that combine the strengths of both.

The reader who completes Part 8 has the toolkit to BUILD an operator surrogate for any new PDE family, EVALUATE the cost-benefit decision honestly, and DEPLOY hybrid PINN-operator architectures in production seismic-imaging workflows. Part 9 takes operator learning to the next step: hybridise it with classical FWI, add Bayesian uncertainty quantification on top, and confront the unique challenges of seismic-inverse problems at production scale.

References

  • Lu, L., Jin, P., Pang, G., Zhang, Z., Karniadakis, G.E. (2021). DeepONet. Nat. Mach. Intell. 3(3), 218–229.
  • Li, Z., Kovachki, N., Azizzadenesheli, K., et al. (2020). Fourier Neural Operator for Parametric Partial Differential Equations. ICLR 2021. arXiv:2010.08895.
  • Lu, L., Meng, X., Cai, S., et al. (2022). A comprehensive and fair comparison of two neural operators. CMAME 393, 114778. Empirical comparison of DeepONet and FNO across multiple problem families.
  • Lehmann, F., Gatti, F., Bertin, M., Clouteau, D. (2024). F-FNO 3D elastic-wave propagation. CMAME 420, 116718. Production-scale operator-based seismic surrogate.
  • Hendrycks, D., Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. ICLR 2017. Foundational OOD-detection paper.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.