DeepONet: branch and trunk networks

Part 8 — Operator learning for seismology

Learning objectives

Open the hood on DeepONet — the architecture from §8.1 in detail
Recognise DeepONet as a learned-basis decomposition: trunk = basis, branch = coefficients
Visualise the LEARNED trunk basis after training
Understand sensor placement, basis dimension K, and common architectural variants
Identify failure modes: too few basis dims, too sparse sensors, out-of-distribution inputs

§8.1 trained a DeepONet without dwelling on the architecture. Now we open the hood. The DeepONet output for input function $f$ at query point $y$ is

\mathcal{G}_{\mathrm{NN}}[f](y) = \sum_{k=1}^{K} \underbrace{B_k(f(x_1), \ldots, f(x_m); \theta_B)}_{\text{branch (coefficients)}} \cdot \underbrace{T_k(y; \theta_T)}_{\text{trunk (basis)}} ,

which is structurally a SEPARATION OF VARIABLES between input function and query coordinate. The branch produces a vector $B \in \mathbb{R}^K$ that depends only on $f$ ; the trunk produces a vector $T \in \mathbb{R}^K$ that depends only on $y$ . Their inner product is the operator output.

Why it works: separation of variables

Think of the DeepONet as a CHANGE OF BASIS for the output-function space. The trunk learns a basis ${T_1(y), T_2(y), \ldots, T_K(y)}$ of $K$ functions on the query domain. The branch learns to compute, given any input function $f$ , the $K$ coefficients that decompose $\mathcal{G}[f]$ in that basis. The architecture is exactly the universal-approximation construction Chen-Chen 1995 proved is valid for continuous nonlinear operators on compact input function spaces.

The DEEP INSIGHT is that the basis is not chosen by us. We do not say "use a Fourier basis" or "use Chebyshev polynomials". The trunk discovers what basis is best — whatever shapes $T_k(y)$ minimise the operator-fitting loss. For the Poisson operator on a sine-mode input family, the optimal basis IS the sine basis (the Poisson operator is diagonal in that basis), and the trunk indeed learns sine-like shapes. For more complex operators on more complex input families, the basis is whatever the operator demands.

Architectural details

Sensor placement. The branch consumes $f$ sampled at $m$ fixed locations ${x_1, \ldots, x_m}$ . These sensors define the resolution at which input functions are represented. Too few sensors miss high-frequency content; too many bloat the branch network. For PDE solution operators, m=32-256 is typical; we use m=32 in the widget below.
Basis dimension K. The number of basis functions / branch outputs. Too small $K$ underfits. Too large $K$ overfits and slows training. The intrinsic dimension of the output space is the right starting point: for 3-parameter input families, K=4-8 is plenty; for high-dimensional output spaces (e.g., 2-D wavefields parameterised by a velocity model), K=64-256 is typical.
Network depths. Branch and trunk are independent MLPs — they can have different widths and depths. The widget uses 32-32 hidden layers in both, which is a reasonable default. For harder operators, branch should generally be wider than trunk because the branch handles the high-dimensional input function, while the trunk only handles a low-dimensional query coordinate.
Training loss. Standard MSE between predicted and target output values at random query points. For operators learned from PDE physics rather than data, an additional PDE-residual term can be added (the so-called "physics-informed DeepONet" or PI-DeepONet, Wang et al 2021).

Variants and common modifications

Several DeepONet variants appear in the literature:

Stacked / Unstacked DeepONet. The original Lu et al 2021 paper presents two variants: STACKED (one trunk-output per branch-output, separate trunks per channel) and UNSTACKED (single shared trunk, branches share the same basis). The unstacked form is what we use here and what dominates the literature.
Bias term. Some implementations add a learnable bias $b_0$ : $\mathcal{G}_{\mathrm{NN}}f = \sum_k B_k T_k + b_0$ . Useful when output values do not centre at zero.
POD-DeepONet. Lu et al 2022 propose pre-conditioning by Proper Orthogonal Decomposition: project the training output functions onto their leading POD modes, train the branch to predict POD coefficients, and use the POD modes themselves (not a learned MLP) as the trunk. Trades flexibility for speed and interpretability — the basis is fixed and orthogonal.
Physics-Informed DeepONet (PI-DeepONet). Wang et al 2021 add a PDE residual loss on top of the MSE data loss. Useful when training data is sparse but the governing PDE is known.
Multiple-input DeepONet (MIONet). Jin et al 2022 extend to operators with multiple input functions (e.g., simultaneously vary velocity AND density). Stacks multiple branches.

Try it: see the learned basis

The widget trains a DeepONet on the 1-D Poisson boundary-value problem (BVP):

-u''(y) = f(y), \quad u(0) = u(1) = 0, \quad y \in [0, 1] .

For the 3-parameter input family $f_a(x) = \sum_{k=1}^{3} a_k \sin(k\pi x)$ , the exact solution is $u(y) = \sum_{k=1}^{3} a_k \sin(k\pi y)/(k\pi)^2$ — each Fourier mode is an eigenfunction of $-d^2/dy^2$ with eigenvalue $(k\pi)^2$ , so the Poisson operator simply attenuates each mode by $1/(k\pi)^2$ . This is the classical Poisson SMOOTHING property: high-frequency input modes contribute LESS to the output (factor $1/9\pi^2$ for the third mode vs $1/\pi^2$ for the first).

After 2000 epochs of training (~10-20 s in browser), the widget displays:

Input function $f(x)$ for the current sliders.
Output $u(y)$ — exact (cyan dashed) vs DeepONet prediction (orange).
Trunk basis ${T_1(y), \ldots, T_8(y)}$ — eight curves, one per basis function. THIS IS THE WHOLE POINT: you see the basis the network discovered.
Branch coefficients ${B_1(f), \ldots, B_8(f)}$ — bar chart for the current input. Adjust the sliders and watch the bars rearrange — the same trunk basis with different branch weights gives different output functions.
Training loss on log-y vs epoch.

The trunk-basis panel often reveals interesting structure. For the Poisson operator on a 3-mode family, the network has effectively 3 USEFUL basis directions (because the output space is 3-dimensional). The other 5 basis functions either approximate the same shapes redundantly or sit near zero — the network learned that K=8 was overkill for this problem.

Failure modes

Too few basis dims $K$ . If K=2 for an input family with 3 dimensions of variation, no amount of training will let the DeepONet match the target — the architecture has insufficient expressivity. Diagnostic: training loss plateaus at a non-trivial floor.
Too sparse sensors $m$ . If m=4 sensors but the input function has 5+ Hz oscillations, the branch cannot disambiguate inputs with similar samples but different high-frequency content. Diagnostic: training loss is fine but test error on aliased inputs is poor.
Out-of-distribution input functions. If you train on $a_k \in [-1, 1]$ but evaluate on $a_k = 5$ , the DeepONet will extrapolate poorly. The branch and trunk are smooth but bounded — they cannot generalise beyond the training distribution. This is the universal limitation of statistical operator learning; classical solvers (FSM, FDTD) handle out-of-distribution inputs trivially.
Discontinuous operators. If $\mathcal{G}$ has discontinuities (e.g., piecewise solutions of conservation laws with shocks), the smooth trunk basis cannot represent the discontinuity exactly. POD-DeepONet or FNO with shock-adapted modifications fares better.

Looking ahead: when DeepONet vs FNO

DeepONet shines when:

The input function space is bounded and not too high-dimensional.
The output domain is fixed (the trunk evaluates pointwise; resolution is set at evaluation time).
You want simple, interpretable architecture.
Training data is available pointwise (random query coordinates with target values), not as full output fields.

The Fourier Neural Operator (§8.3 next) shines when:

Inputs and outputs live on the SAME spatial grid (e.g., velocity field → travel-time field, both on a 256×256 grid).
You want resolution-invariance: train on 64×64, evaluate on 1024×1024 with no retraining.
The operator is approximately translation-invariant or has natural Fourier structure.
Training data is dense (full output fields available, not pointwise).

Most seismology applications (velocity-to-wavefield, velocity-to-travel-time on a fixed grid) sit closer to the FNO regime. We will see why in §8.3 and how the choice plays out for the §8.4 learned wave-equation propagator.

References

Lu, L., Jin, P., Pang, G., Zhang, Z., Karniadakis, G.E. (2021). Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nat. Mach. Intell. 3(3), 218–229.
Lanthaler, S., Mishra, S., Karniadakis, G.E. (2022). Error estimates for DeepONets: A deep learning framework in infinite dimensions. Trans. Math. Appl. 6(1), tnac001. Convergence rates and theoretical analysis.
Lu, L., Meng, X., Cai, S., Mao, Z., Goswami, S., Zhang, Z., Karniadakis, G.E. (2022). A comprehensive and fair comparison of two neural operators (with practical extensions) based on FAIR data. Comput. Methods Appl. Mech. Eng. 393, 114778. POD-DeepONet and architectural variants.
Wang, S., Wang, H., Perdikaris, P. (2021). Learning the solution operator of parametric partial differential equations with physics-informed DeepONets. Sci. Adv. 7(40), eabi8605. PI-DeepONet — the physics-informed extension.
Jin, P., Meng, S., Lu, L. (2022). MIONet: Learning multiple-input operators via tensor product. SIAM J. Sci. Comput. 44(6), A3490–A3514. Multi-input extension.