From per-instance solvers to operator learners

Part 8 — Operator learning for seismology

Learning objectives

Recognise the conceptual pivot from per-instance PINNs to operator networks
State the universal approximation theorem for operators (Chen-Chen 1995)
Sketch the two main families: DeepONet (branch + trunk) and FNO (spectral convolution)
Pretrain a small DeepONet end-to-end and observe the amortisation in action
Identify the regimes where operator learning is profitable vs wasteful

Every PINN we built in Parts 1-7 followed the same recipe: parameterise the SOLUTION of one specific problem instance by a neural network, train against the residual of one specific PDE plus one specific boundary condition. The §7.2 EikoNet solves the eikonal equation for ONE velocity model and ONE source position; if you change either, you re-train. For a survey with thousands of source positions and time-lapse studies that update the velocity model monthly, this is wasteful — every problem starts from scratch.

The pivot: function approximation vs operator approximation

Classical machine learning is FUNCTION APPROXIMATION: fit a network $f_\theta : \mathbb{R}^d \to \mathbb{R}$ to a target $f$ . PINNs add a physics constraint but stay in this regime: the input is a coordinate $x$ , the output is the solution at that coordinate, and the network is fit to ONE function $u(x)$ .

Operator learning is a level higher. The target is an OPERATOR — a map between function spaces:

\mathcal{G} : \mathcal{V} \to \mathcal{U}, \quad v \mapsto \mathcal{G}[v] ,

where $\mathcal{V}$ is a space of "input" functions (e.g., velocity models) and $\mathcal{U}$ is a space of "output" functions (e.g., travel-time fields). For seismology the canonical operator is the velocity-to-wavefield map: given any velocity model $v(x)$ , return the wavefield $u(x, t; v)$ that solves the wave equation in that velocity. Once such an operator is learned, evaluating $\mathcal{G}[v]$ for a NEW $v$ is a single forward pass — no PDE solver, no PINN training, no FDTD.

Universal approximation theorem for operators (Chen-Chen 1995)

The theoretical foundation is older than people think. Chen and Chen (1995) proved that for any continuous nonlinear operator $\mathcal{G}$ on a compact set of input functions, there exist branch and trunk networks such that

\mathcal{G}[v](y) \approx \sum_{k=1}^{K} b_k(v(x_1), v(x_2), \ldots, v(x_m)) \cdot t_k(y) ,

where the $b_k$ are functions of $m$ samples of the input function $v$ at fixed sensor locations ${x_1, \ldots, x_m}$ , and the $t_k$ are functions of the query coordinate $y$ . The approximation is uniform: making $K$ and $m$ larger and the networks deeper drives the error to zero. This is the operator analogue of the Cybenko-Hornik 1989 universal approximation theorem.

Chen-Chen 1995 was largely overlooked for 25 years. Lu, Jin, Karniadakis (2021) revived it, named the architecture DeepONet, and showed empirically that it works for a wide range of nonlinear operators — including many PDE solution operators relevant to physics and engineering.

Two families of operator network

Two architectures dominate the literature:

DeepONet (Lu et al 2021). Branch network $B(v(x_1), \ldots, v(x_m); \theta_B) \in \mathbb{R}^K$ encodes the input function. Trunk network $T(y; \theta_T) \in \mathbb{R}^K$ encodes the query coordinate. Output is the inner product $\mathcal{G}_{\mathrm{NN}}v = \sum_k B_k(v) T_k(y)$ . Generalisable, simple to implement, well-suited when the input function space is bounded and not too high-dimensional. This is what §8.2 builds in detail and the widget below pretrains.
Fourier Neural Operator (FNO) (Li et al 2020). The network operates in the FOURIER DOMAIN: each layer applies a learnable spectral convolution $\mathcal{F}^{-1}(K \cdot \mathcal{F}(v))$ where $K$ is a learnable Fourier multiplier. Layers stack with residual connections + nonlinearity. Built-in resolution invariance: train on a 64×64 grid, evaluate on 256×256 with no retraining. This is §8.3.

Other architectures exist (Graph Neural Operators, Random Feature Operators, Geometric DeepONet variants), but DeepONet and FNO are the two production-relevant families and account for ~95% of operator-learning papers in computational physics today.

The amortisation argument

When does operator learning pay off? The cost-benefit is straightforward. Per-instance PINN training costs $T_{\mathrm{train}}$ per problem. Operator pretraining costs $T_{\mathrm{op}}$ once, then each evaluation costs $T_{\mathrm{infer}}$ . For $N$ problem instances:

\text{Total cost (per-instance):} \quad N \cdot T_{\mathrm{train}} ,

\text{Total cost (operator):} \quad T_{\mathrm{op}} + N \cdot T_{\mathrm{infer}} .

The crossover $N^*$ where operator wins:

N^* = \frac{T_{\mathrm{op}}}{T_{\mathrm{train}} - T_{\mathrm{infer}}} \approx \frac{T_{\mathrm{op}}}{T_{\mathrm{train}}} \quad (T_{\mathrm{infer}} \ll T_{\mathrm{train}}).

For seismic operators the numbers favour operator learning aggressively: per-instance EikoNet trains in ~30 s, operator pretraining is typically ~hours-days on GPU, but $T_{\mathrm{infer}}$ is microseconds. With $T_{\mathrm{op}} \sim 10$ hours and $T_{\mathrm{train}} \sim 30$ seconds, the crossover sits at $N^* \approx 1200$ problem instances. For real-time microseismic monitoring with thousands of events per day, or time-lapse studies with monthly velocity-model updates, this is a clear win. For one-shot historical case studies, per-instance training is fine.

Try it: pretrain a DeepONet on the antiderivative operator

The widget pretrains a small DeepONet on the family of integrable functions $f_a(x) = a_1 \sin(\pi x) + a_2 \sin(2\pi x) + a_3 \sin(3\pi x)$ on $x \in [0, 1]$ , with the antiderivative as the operator:

\mathcal{G}[f_a](y) = \int_0^y f_a(s)\,ds = \sum_{k=1}^{3} a_k \frac{1 - \cos(k\pi y)}{k\pi} .

Architecture: branch network $B : \mathbb{R}^{32} \to \mathbb{R}^{8}$ takes 32 samples of $f$ on a fixed grid, trunk network $T : \mathbb{R}^1 \to \mathbb{R}^{8}$ takes the query $y$ . Output $\mathcal{G}$ . 1500 epochs of Adam on a bank of 80 random training functions, with mini-batches of 6 functions × 6 random query points per epoch. Browser wall-clock: ~10-20 s.

After pretraining, the (a₁, a₂, a₃) sliders let you explore the input-function space. EVERY new combination is evaluated INSTANTLY: one branch forward pass on the 32 samples + one trunk forward pass per query y + a dot product. No re-training. The exact analytic answer (cyan dashed) overlays the DeepONet prediction (orange) so you can see the operator network has genuinely learned the antiderivative MAP, not just memorised specific cases.

Why the antiderivative? Why not jump straight to seismic?

The antiderivative operator is a MINIMUM VIABLE TEACHING TARGET: it is

Linear — so the operator network is fitting a clean signal, no nonlinear traps.
Closed-form — we can evaluate the truth and compare exactly. No FSM solver needed.
Cheap to train — 5-10 seconds in browser, fits the textbook iteration budget.
Architecturally identical to the velocity-to-travel-time DeepONet of §8.4. Same branch + trunk topology, same training loop, just a different operator and harder data.

By pretraining the network and evaluating it across the (a₁, a₂, a₃) sliders, you see the OPERATOR-LEARNING property concretely: the network learned a MAP between function spaces, not a specific function. The same architecture and training pattern carries to all of §8.2-§8.6.

What §8.2-§8.6 will do

§8.2 DeepONet — branch and trunk networks. The mathematical and architectural deep dive on what just trained. Branch / trunk separation, dimensionality choices, sensor placement, training schemes, common failure modes.
§8.3 Fourier Neural Operators (FNO). The other dominant architecture. Spectral convolutions, resolution-invariance, and where FNO beats DeepONet (and where it does not).
§8.4 Learned propagators for fast forward modelling. Train a network to ADVANCE THE WAVEFIELD ONE TIME STEP given the current state. Compose for many steps. Surrogate or replacement for FDTD when $\sim 1000\times$ speedup matters.
§8.5 Parametric PDE explorers in real time. The amortisation payoff: pretrained operator networks let users explore parameter space (velocity, source, geometry) with INSTANT response. Demos that would take seconds to minutes per query with a classical solver run at 60 fps.
§8.6 When operator learning beats per-instance training. Honest cost-benefit analysis. Crossover N*, training-data requirements, generalisation outside the training distribution, and when classical solvers (FDTD, FSM) still win.

References

Chen, T., Chen, H. (1995). Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Trans. Neural Netw. 6(4), 911–917. The foundational theorem; quietly overlooked for 25 years.
Lu, L., Jin, P., Pang, G., Zhang, Z., Karniadakis, G.E. (2021). Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nat. Mach. Intell. 3(3), 218–229. The DeepONet paper that revived Chen-Chen 1995 and demonstrated it works in practice.
Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., Anandkumar, A. (2020). Fourier Neural Operator for Parametric Partial Differential Equations. arXiv:2010.08895 (ICLR 2021). The FNO paper.
Kovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattacharya, K., Stuart, A., Anandkumar, A. (2023). Neural operator: Learning maps between function spaces with applications to PDEs. J. Mach. Learn. Res. 24(89), 1–97. Recent review covering both families, theory, and convergence rates.