Fourier Neural Operators (FNO)

Part 8 — Operator learning for seismology

Learning objectives

Derive the spectral-convolution layer that defines the FNO architecture
Recognise resolution-invariance as a structural property of FNO
Train a single-layer FNO to recover the heat-equation operator and watch it discover the diagonal Fourier multipliers
Compare DeepONet (§8.2) and FNO on the same operator family
Identify when FNO wins (gridded data, resolution-invariance) and when DeepONet wins (pointwise queries, irregular geometry)

The Fourier Neural Operator (Li et al 2020, ICLR 2021) takes a different tack than DeepONet. Instead of decomposing the operator into branch and trunk networks, FNO works in the FOURIER DOMAIN. Each layer applies a learnable spectral convolution interleaved with pointwise nonlinearity:

v^{(l+1)}(x) = \sigma\Bigl( \mathcal{F}^{-1}\bigl( R^{(l)} \cdot \mathcal{F}(v^{(l)}) \bigr)(x) + W^{(l)} v^{(l)}(x) \Bigr) ,

where $\mathcal{F}$ and $\mathcal{F}^{-1}$ are forward and inverse Fourier transforms, $R^{(l)}$ is a LEARNABLE complex multiplier on the first $K$ Fourier modes (modes beyond $K$ are truncated), $W^{(l)}$ is a pointwise (1×1) convolution in spatial domain, and $\sigma$ is a nonlinearity. Stacking $L$ such layers builds a deep neural operator.

Why Fourier?

The spectral-convolution layer is a GLOBAL OPERATION: a single Fourier coefficient depends on the input function across the whole domain. This contrasts with classical convolutional networks, which are local in space — each output pixel depends only on a local neighbourhood. Many PDE solution operators ARE global: the value of $u(x_0)$ depends on the source distribution everywhere within the domain of dependence (the wave equation), the entire forcing function (Poisson equation), or the entire initial condition (heat equation). FNO captures this naturally.

Three structural advantages drop out for free:

Resolution invariance. The learnable multiplier $R^{(l)}_k$ is a function of MODE NUMBER $k$ , not grid index. Train an FNO on a 64×64 grid; evaluate it on 1024×1024 with no retraining (provided the new grid resolves at least the first $K$ modes). This is a direct consequence of the Fourier representation: same modes, more grid points just mean a finer interpolation.
Translation equivariance (with periodic boundaries). Because Fourier convolution commutes with translations, an FNO trained on data centred around one location applies identically to data centred elsewhere. Useful for tomography of survey patches with similar geological style at different positions.
Mode truncation as regularisation. Only the first $K$ modes are learnable. High-frequency content beyond $K$ is effectively projected out. This forces the operator to be smooth in spectral content and prevents the network from overfitting to high-frequency noise.

The price: gridded data and translation-invariant operators

FNO assumes the input function is sampled on a REGULAR GRID. Irregular geometries (real seismic surveys with sparse and unevenly-spaced sources/receivers) need a separate gridding step, which can introduce error. DeepONet handles arbitrary sensor placements naturally — the branch consumes whatever values you sample at whatever locations.

FNO also implicitly assumes some translation symmetry — at minimum, the operator should not violently change behaviour from one part of the domain to another. For wave-equation operators on heterogeneous velocity models this is a soft assumption: the SAME wave equation applies everywhere, but the velocity-model heterogeneity breaks strict translation invariance. In practice FNO works well on heterogeneous-medium operators provided the heterogeneity is statistically homogeneous (no abrupt regime changes).

Try it: FNO discovering the heat-equation operator

The widget trains a SINGLE-LAYER FNO on the 1-D heat equation:

\frac{\partial u}{\partial t} = \alpha \frac{\partial^2 u}{\partial x^2} ,\quad u(0, t) = u(1, t) = 0 ,\quad u(x, 0) = u_0(x) .

The forward operator $\mathcal{G}: u_0(\cdot) \mapsto u(\cdot, T)$ at fixed $T$ is EXACTLY DIAGONAL in the Dirichlet sine basis:

\hat{u}_k(T) = \hat{u}_k(0) \cdot \exp\bigl(-\alpha (k\pi)^2 T\bigr) .

Training data: random initial conditions $u_0(x) = \sum_{k=1}^{5} a_k \sin(k\pi x)$ with $a_k \in [-1, 1]$ . Loss: spectral MSE between predicted and exact spectra. The single-layer FNO has $K = 16$ learnable real multipliers $\hat{K}[k]$ for the sine modes; the architecture is

\mathcal{G}_{\mathrm{NN}}[u_0] = \mathrm{DST}^{-1}\bigl( \hat{K} \cdot \mathrm{DST}(u_0) \bigr)

where DST is the discrete sine transform (a real-valued FFT specialised to Dirichlet boundaries; see widget for the implementation). After 1000 epochs of mini-batch SGD (~3-5 s in browser), the LEARNED multipliers $\hat{K}[k]$ should converge to the EXACT decay factors $\exp(-\alpha (k\pi)^2 T)$ .

The widget displays four panels:

Input $u_0(x)$ for the current sliders.
Output $u(x, T)$ : exact heat propagation (cyan dashed) vs FNO (orange).
Learned multipliers $\hat{K}[k]$ as orange bars; the cyan dots are the EXACT decay factors $\exp(-\alpha (k\pi)^2 T)$ . After training, the bars sit on the dots — the FNO has discovered the operator's spectral structure from data alone.
Spectral MSE trace on log-y.

Notice that for modes $k > 5$ , the input has zero amplitude (the family is 5-mode), so there is NO TRAINING SIGNAL on $\hat{K}[k]$ for those modes. They remain near their initial values. This is honest: an FNO learns only what the training data forces it to learn. To train multipliers for higher modes, one would need richer training inputs.

FNO vs DeepONet — when each wins

Both architectures are universal operator approximators on suitable function spaces, so in the LIMIT they can both represent any continuous nonlinear operator. The practical differences are:

FNO wins on regular grids with continuous fields. Velocity-to-wavefield, density-to-gravity-anomaly, vorticity-to-streamfunction: all defined on regular grids, all benefit from resolution invariance. FNO is also generally faster to train per epoch because the spectral convolution is fast (O(N log N) for FFT, O(K) for the multiplier).
DeepONet wins for pointwise queries. If the output is needed at irregular query points (e.g., picked travel times at receivers placed wherever the survey crew could install them), DeepONet evaluates the trunk at each point individually. FNO produces a full output field on the same grid as input — extra interpolation step needed for off-grid queries.
DeepONet handles irregular sensor placements. The branch network can take any function-sampling scheme. FNO requires regular gridded input.
FNO has fewer hyperparameters to tune. Choose K (mode count), L (depth), W (width). DeepONet has branch-depth, trunk-depth, basis-dim K, sensor positions — more knobs.
FNO scales naturally to 2D and 3D. The spectral convolution generalises trivially via 2D/3D FFTs. DeepONet scales by enlarging the trunk to take 2D/3D coordinates.

For seismology specifically: velocity-model-to-travel-time operators on regular surveying grids → FNO. Off-grid hypocentre location with arbitrary network layouts → DeepONet (or hybrid).

Beyond 1-layer: the deep FNO

The widget uses a single layer to keep the relationship between learned multipliers and the operator transparent. Production FNOs use 4-8 layers with ~12-20 modes each, and add the pointwise W operator and a GeLU/ReLU nonlinearity between layers. Stacking layers lets the network represent NONLINEAR operators (e.g., the Burgers viscous-shock operator, or full Navier-Stokes time-stepping). For LINEAR operators like our heat equation, a single layer is provably sufficient.

Several FNO variants extend the basic architecture:

Adaptive FNO (AFNO) (Guibas et al 2021): decouple frequency-domain weights so each mode has its own MLP. More expressive at the cost of more parameters.
U-FNO and Galerkin FNO: combine FNO blocks with U-Net-style downsampling/upsampling for multi-scale capture.
Tensor-Train FNO: low-rank decomposition of the spectral weight tensor to reduce parameters in 3D applications.
Geometric / Graph FNO (Li et al 2023): generalise FFT to operators on graphs and meshes for irregular geometries.

What §8.4 will do

§8.4 builds a LEARNED PROPAGATOR — a network that takes the wavefield at time $t$ and produces the wavefield at time $t + \Delta t$ . Compose for many timesteps to build a fast surrogate for FDTD. Naturally suits FNO architecture (regular grids, time-stepping, translation-invariant operator). We will see how stability, accuracy, and training-data-efficiency play out for this real seismic surrogate.

References

Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K., Stuart, A., Anandkumar, A. (2020). Fourier Neural Operator for Parametric Partial Differential Equations. ICLR 2021. arXiv:2010.08895.
Kovachki, N., Lanthaler, S., Mishra, S. (2021). On universal approximation and error bounds for Fourier Neural Operators. JMLR 22, 1–76. Universal-approximation theorem and convergence rates for FNO.
Guibas, J., Mardani, M., Li, Z., Tao, A., Anandkumar, A., Catanzaro, B. (2021). Adaptive Fourier Neural Operators: efficient token mixers for transformers. arXiv:2111.13587. AFNO and the AFNO-transformer connection.
Li, Z., Kovachki, N., Choy, C., Li, B., Kossaifi, J., Otta, S., Nabian, M., Stadler, M., Hundt, C., Azizzadenesheli, K., Anandkumar, A. (2023). Geometry-informed Neural Operator for large-scale 3D PDEs. arXiv:2309.00583.