Vanilla MLPs and their limits

Part 2 — Architectures for PINNs

Learning objectives

  • Map the three architecture knobs of a vanilla MLP — depth, width, activation — onto practical PINN consequences
  • Recognise that depth and width help on smooth problems but cannot fix spectral bias
  • Identify when a vanilla MLP is the right baseline (smooth, low-frequency, simple BC) and when it is not (high-frequency wavefield, complicated geometry)
  • Be ready for the architectural fixes in §2.2–§2.5 with a clear sense of what each one is fixing

Part 1 used vanilla MLPs as the baseline architecture for every PINN. They are the natural starting point because they are the simplest universal approximator: stack a few fully-connected layers, pick a smooth activation, and you have a function-approximation black box that can in principle represent any continuous function. For many seismic PINN problems they are still the production architecture of choice. But they are not magic, and the way they fail is well-understood.

The three architectural knobs

A vanilla MLP has exactly three architecture-time decisions:

  • Depth — how many hidden layers. More depth lets the network compose more transformations; for PINN problems the typical choice is 3–8 hidden layers, with diminishing returns past about 6.
  • Width — how many neurons per hidden layer. More width gives more capacity per layer. Typical PINN choices: 32–64 neurons per layer for 1D and 2D problems, 64–128 for 3D, occasionally 256+ for very high-resolution work.
  • Activation — the elementwise nonlinearity. We covered the inductive-bias differences in §0.3. The typical PINN default is Tanh (smooth, infinitely differentiable, well-behaved second derivatives needed for PDE residuals); Swish is a modern alternative; sin (SIREN, §2.3) is for high-frequency targets.

That is the whole vocabulary. Three numbers and a pick from a small enum. You can do a tremendous amount with these three choices alone.

Try it

Vanilla ArchInteractive figure — enable JavaScript to interact.

Pick a target, set the depth and width, choose an activation, and watch the network train. The status strip shows the parameter count for the current configuration. The pedagogical experiments to run:

  • Smooth target, depth sweep. With "Smooth sine" as the target, fix width = 32 and sweep depth from 1 to 5. The fit and the loss floor improve modestly with depth, but a 1-hidden-layer 32-wide network is already very close to the truth (loss ~ 1e-4). For smooth targets vanilla MLPs are basically a solved problem.
  • Smooth target, width sweep. Same target, fix depth = 2, sweep width from 4 to 64. Loss floor again improves; a width-4 network underfits visibly, width-32 is clean, width-64 is overkill. The total parameter count grows quadratically with width per layer.
  • Spectral-bias regime. Switch target to "Three-frequency mix". Now no choice of depth or width fixes the loss floor — it sits around 1e-2 regardless. The network captures the low and medium frequencies but misses the high (9π) component. This is the hard floor that motivates §2.2 (Fourier features) and §2.3 (SIREN).
  • Activation comparison. On the smooth target switch between Tanh, Swish, and sin. All three converge well; Tanh and Swish are essentially indistinguishable, sin shows a slightly different basin (the SIREN-style oscillatory inductive bias is overkill but not harmful here). On the multi-frequency target, sin closes some of the spectral-bias gap on its own — a hint of why SIREN exists.

The depth-vs-width trade-off, in one rule of thumb

For a fixed parameter budget, deeper-and-narrower networks usually outperform wider-and-shallower for PINN problems with PDE residuals. Reason: the residual loss exercises higher-order input derivatives of the network output, and depth gives the network more "rounds" of nonlinearity to compose those derivatives smoothly. Width adds capacity but each layer is just a single nonlinear transform. As a starting point, 4–6 hidden layers of 32–64 neurons is a sensible default for any 1D or 2D PINN; deviate from it only when you have a specific reason.

What vanilla MLPs cannot do (alone)

Three categories of failure justify the rest of Part 2:

  • High-frequency targets: spectral bias keeps the high-frequency components from being learned. §2.2 (Fourier features) and §2.3 (SIREN) are the architectural fixes.
  • Strict boundary conditions: the soft-enforcement penalty cannot satisfy a Dirichlet condition exactly, and the loss-balance crisis (§1.4) makes it brittle. §2.4 (hard-constrained architectures) bakes the BC into the network output.
  • Multi-scale problems: when the solution has features on several length scales, neither a single Fourier scale nor a single SIREN frequency captures all of them. §2.5 (multi-scale architectures) handles this by combining features from several scales in one network.

The arc of Part 2 is: vanilla as the baseline (here), each major architectural fix in turn (§2.2 to §2.5), then a decision tree (§2.6) to pick the right one for a given problem. By the end you should be able to look at any PINN problem and pick its architecture from first principles, not by trial and error.

Pause-and-check. (1) For the smooth-sine target, what depth/width combination reaches the lowest loss floor with the smallest parameter count? Argue why depth and width are not equivalent for that purpose. (2) On the three-frequency target, the loss floor is ~1e-2 for almost every architecture. What is the single quantity in the loss decomposition that prevents going below this? (3) The default activation here is Tanh. Argue from §0.3 why ReLU would be a poor choice, even before considering spectral bias.

References

  • Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251–257.
  • Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 6. MIT Press.
  • Rahaman, N., Baratin, A., Arpit, D., et al. (2019). On the spectral bias of neural networks. ICML.
  • Wang, S., Yu, X., Perdikaris, P. (2022). When and why PINNs fail to train: A neural tangent kernel perspective. J. Comput. Phys. 449, 110768.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.