When operator learning beats per-instance training

Part 8 — Operator learning for seismology

Learning objectives

Compute the crossover N* from measured costs in the actual browser
Recognise the regimes where operator learning amortises favourably
Identify when classical solvers and per-instance PINNs still win
Anticipate out-of-distribution failures and the data-coverage requirement
Wrap up Part 8 and look ahead to Part 9 (UQ + hybrid PINN-FWI)

Part 8 has presented operator learning as a paradigm-shifting toolkit. We now confront the honest cost-benefit question: WHEN does it pay off, and when do classical methods (FDTD, FSM, per-instance PINNs) remain the right choice? The answer is a single inequality.

The crossover formula

Let:

$T_{\mathrm{op}}$ : cost of pretraining the operator network ONCE.
$T_{\mathrm{per}}$ : cost of solving ONE problem instance from scratch (e.g., training a per-instance PINN, or running FDTD).
$T_{\mathrm{inf}}$ : cost of ONE operator inference (a single forward pass through the trained network).
$N$ : number of distinct problem instances we want to solve.

Total cost for each strategy:

\mathrm{cost}_{\mathrm{per}}(N) = N \cdot T_{\mathrm{per}}, \qquad \mathrm{cost}_{\mathrm{op}}(N) = T_{\mathrm{op}} + N \cdot T_{\mathrm{inf}} .

Operator wins when $\mathrm{cost}$ , which gives

N > N^* = \frac{T_{\mathrm{op}}}{T_{\mathrm{per}} - T_{\mathrm{inf}}} \approx \frac{T_{\mathrm{op}}}{T_{\mathrm{per}}} \quad \text{(typical regime } T_{\mathrm{inf}} \ll T_{\mathrm{per}}\text{)} .

For seismic operators $T_{\mathrm{inf}}$ is typically $10^{-4}\text{–}10^{-2}$ s and $T_{\mathrm{per}}$ is $10\text{–}10^4$ s, so the speedup ratio $T_{\mathrm{per}} / T_{\mathrm{inf}}$ is enormous. The crossover $N^*$ depends primarily on how cheap the problem already is per-instance vs how expensive pretraining is. Common ranges:

1-D toy problems (this textbook): $T_{\mathrm{per}} \sim 1$ s, $T_{\mathrm{op}} \sim 5$ s. Crossover at $N^* \sim 5$ .
2-D acoustic FWI: $T_{\mathrm{per}} \sim 60$ s (PINN per source), $T_{\mathrm{op}} \sim 1$ hr. Crossover at $N^* \sim 60$ sources.
3-D elastic wave propagation: $T_{\mathrm{per}} \sim 1$ hr (FDTD shot), $T_{\mathrm{op}} \sim 1$ week (F-FNO training, Lehmann et al 2024). Crossover at $N^* \sim 168$ shots.

For full survey-scale FWI projects with thousands of shots and many velocity-model iterations, all three regimes are deeply in the operator-wins regime.

Try it: measure your own crossover

The widget runs in three timed phases: (1) pretrain a DeepONet on the §8.5-style 7-parameter heat-equation family, (2) measure inference time over 100 forward passes for tight statistics, (3) fit a small per-instance MLP to ONE specific instance via supervised regression. After all three measurements, the cost-vs-N chart shows where operator learning beats per-instance for THIS browser, on THIS machine, with THESE problem sizes. The N slider lets you place yourself at any working point and read off the speedup factor.

Beyond N*: the qualitative arguments

Crossover is the quantitative argument. There are four qualitative arguments that reinforce it:

Bayesian-friendly forward model. Sampling from a posterior $p(c \mid \mathrm{data})$ via MCMC needs $O(10^4\text{–}10^6)$ forward evaluations. With FDTD this is impractical at scale; with an operator surrogate, a few hours of MCMC suffices. This unlocks UNCERTAINTY QUANTIFICATION on FWI results — the central topic of Part 9.
Differentiable end-to-end. Operator networks are differentiable through their inputs (initial conditions, velocity model). For inverse problems formulated as gradient descent on the velocity model, the operator provides $\partial T / \partial c$ via auto-diff. FDTD requires manual adjoint-state implementation per equation type.
Real-time interactivity. §8.5's parameter sliders. With FDTD, a designer waits seconds-to-minutes per parameter change; with an operator surrogate, design happens at 60 fps.
GPU-efficient inference. Operator networks pack many forward passes into a single GPU kernel call. A typical TensorRT or ONNX deployment of an FNO does 1000 forward passes in 100 ms — far faster than 1000 separate FDTD invocations would manage.

When operator learning loses

Three scenarios where classical solvers and per-instance PINNs still win:

One-off problems with N = 1. If you have a single legacy survey to analyse and never need to re-do it, pretraining an operator network is wasted effort. Just run FDTD or train one PINN.
Out-of-distribution problems. Operator networks trained on Marmousi-class velocity models will not generalise to volcanic basement structures or strong-anisotropy salt domes. The training distribution is the operating envelope; outside it, predictions silently fail. For exotic case studies, classical solvers handle out-of-distribution inputs trivially.
Verifying classical-solver baselines. Even with a deployed operator surrogate, important production runs typically include at least one FDTD verification for trust calibration. The operator surrogate is the workhorse; FDTD is the safety net.

Hybrid architectures: best of both

Modern production seismic-imaging pipelines often use hybrid architectures that combine operator pretraining with PINN fine-tuning:

Operator warm-start for PINN. Pretrain an operator surrogate on a velocity-model family. For a NEW velocity model, evaluate the surrogate to get an initial guess, then refine with a per-instance PINN initialised from that guess. The PINN converges in 10× fewer epochs because it starts from a good answer.
Operator + classical FWI gradient correction. Use the operator for the wave-equation forward solve in FWI; combine its gradient with a small classical FDTD correction to reduce out-of-distribution errors. Saves $\sim 90%$ of compute.
Operator + Bayesian UQ. Cheap operator forwards enable HMC/Stein VI over the posterior, with a final classical-FDTD posterior-mean simulation for trust verification.

These are the architectures Part 9 will build on. Operator surrogates are not a replacement for PINNs and FDTD; they are a NEW LAYER in the toolkit, slotted in where amortisation pays off, and bypassed where it does not.

Out-of-distribution detection: a critical gap

The single biggest engineering risk of operator-based seismic workflows is SILENT FAILURE on OOD inputs. A network trained on smooth gradient velocity models will produce confident-looking predictions for a velocity model containing a salt dome — and those predictions can be ARBITRARILY WRONG without any in-band warning. This is fundamentally different from FDTD failure modes, which are usually loud (numerical instability, NaN propagation, etc.).

Production OOD detection techniques:

Likelihood under the training distribution. Compute $\log p(c \mid \mathrm{train})$ for each new input. If below a threshold, flag and revert to FDTD.
Model ensembles. Train several operator networks with different seeds; high disagreement on a new input indicates OOD.
PDE-residual check. Apply the network to predict $T(x)$ , then check if $T_{\mathrm{NN}}$ satisfies the eikonal residual. If not, fall back to FSM/FDTD.
Posterior uncertainty (Bayesian Operators). Train operator networks with weight-space uncertainty (BNN, MC-Dropout). High predictive variance signals OOD inputs. Implementation cost: 1.5-2× normal training but provides a direct uncertainty signal.

Part 9 will revisit OOD detection in the broader UQ context.

Part 8 wrap-up

Part 8 covered the operator-learning paradigm end-to-end:

§8.1 Per-instance vs operator framing — the conceptual pivot, with a worked DeepONet on the antiderivative operator demonstrating amortised inference.
§8.2 DeepONet architecture deep-dive — the branch + trunk decomposition as a learned-basis representation of the operator, visualised on the Poisson BVP.
§8.3 Fourier Neural Operators (FNO) — spectral-convolution layers, resolution-invariance, and a single-layer FNO that recovered the heat-equation operator to machine precision.
§8.4 Learned wave-equation propagators — time-stepping with FNO, the eigenvalue-stability constraint that determines whether rollouts stay bounded, and structural-stability tricks (β=-1 freeze, α clamp) that mirror production-code techniques.
§8.5 Real-time parametric explorers — the amortisation made tactile, with 7-D heat-operator family explored at 60 fps after a 5-second pretraining.
§8.6 (this section) The crossover analysis: when operator learning amortises, when classical methods still win, and the hybrid architectures that combine the strengths of both.

The reader who completes Part 8 has the toolkit to BUILD an operator surrogate for any new PDE family, EVALUATE the cost-benefit decision honestly, and DEPLOY hybrid PINN-operator architectures in production seismic-imaging workflows. Part 9 takes operator learning to the next step: hybridise it with classical FWI, add Bayesian uncertainty quantification on top, and confront the unique challenges of seismic-inverse problems at production scale.

References

Lu, L., Jin, P., Pang, G., Zhang, Z., Karniadakis, G.E. (2021). DeepONet. Nat. Mach. Intell. 3(3), 218–229.
Li, Z., Kovachki, N., Azizzadenesheli, K., et al. (2020). Fourier Neural Operator for Parametric Partial Differential Equations. ICLR 2021. arXiv:2010.08895.
Lu, L., Meng, X., Cai, S., et al. (2022). A comprehensive and fair comparison of two neural operators. CMAME 393, 114778. Empirical comparison of DeepONet and FNO across multiple problem families.
Lehmann, F., Gatti, F., Bertin, M., Clouteau, D. (2024). F-FNO 3D elastic-wave propagation. CMAME 420, 116718. Production-scale operator-based seismic surrogate.
Hendrycks, D., Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. ICLR 2017. Foundational OOD-detection paper.