Why supervised ML alone cannot solve PDEs

Part 1 — The PINN formulation

Learning objectives

See, on a 1D toy, that supervised MLPs have no extrapolation guarantee outside their training data
Recognise that low training loss does not imply correct out-of-distribution behaviour
Identify three reasons supervised ML alone fails on PDE problems: extrapolation, sample efficiency, and physics-blindness
Be primed for the PINN solution introduced in §1.2

By the end of Part 0 you can train an MLP to fit any continuous function on its training data. That is a powerful capability, but for solving partial differential equations it is not enough. Three problems combine to make pure supervised learning the wrong tool for PDEs, and Part 1 is about replacing it with something better. This section is just about seeing the problems clearly.

Problem 1: no extrapolation guarantee

A neural network trained by minimising mean-squared error on a dataset is doing exactly that and nothing else: minimising in-distribution loss. It has no obligation to generalise outside the data. The widget below makes this brutally concrete.

Pick any of the four target functions, restrict the training data to a sub-interval, and press Play. The MLP will find a curve that fits the highlighted interval cleanly — the training loss will descend by orders of magnitude. But the prediction outside the interval is, in general, garbage. Try this on every target. The MLP can extrapolate sin into another half-cycle? No. Can extrapolate the Gaussian peak into its tails? Often diverges instead. Can extrapolate the cubic? Flattens to nonsense.

This is not an artefact of the architecture or optimiser. Every universal approximator has this property. There are infinitely many functions that match the training data on a closed interval; supervised learning chooses one based on its inductive bias, not on the actual physics. For most physical problems, the inductive-bias-driven choice is wrong.

Problem 2: hideous sample efficiency for PDE problems

Suppose you wanted to use supervised learning to learn the solution of an acoustic wave equation in 2D space and time. The solution u(x, z, t) is a function of three real variables. To cover this space densely enough that "all the training data is in-distribution" for a target volume of interest, you need a lot of samples — millions for moderate accuracy. And every sample requires running a wave-equation forward solve to generate the label. So pure supervised learning here means: run a classical wave solver millions of times to generate training data, then train a network to mimic the solver's output.

That is a workable approach for some problems (and is exactly what operator-learning approaches do, which we will meet in Part 8). But for a one-shot inverse problem — "given this one set of seismic recordings, find the velocity model that explains them" — there are no labels to begin with. Pure supervised learning has nothing to chew on.

Even when supervised data exists, a vanilla MLP's prediction has no relationship to the underlying physics. If you train a network to predict acoustic wavefields and ask it for a wavefield between two training samples, the network interpolates between them but does not, in general, satisfy the wave equation. Rough-data interpolation is not the same as physics simulation. For applications where the prediction has to be physically consistent (which is most exploration seismology applications), this is a non-starter.

Low training loss is not generalisation. The widget shows training loss dropping by 4–6 orders of magnitude while the prediction outside the interval is qualitatively wrong. The two phenomena are decoupled.
"More layers" or "more neurons" does not fix it. Crank N up to 64. The training fit gets better; the extrapolation does not. There are still infinitely many functions that fit the data, and the bigger network just searches a wider space of plausible-but-wrong extrapolations.
Activation choice rearranges the failure but does not fix it. Swap Tanh → sin: now the prediction outside oscillates rather than flattens. Swap to swish: another flavor of wrong. Each activation has its preferred extrapolation behaviour, none of which is "the underlying physics says X".

What a PINN does about it

The fix introduced by Raissi, Perdikaris, and Karniadakis (2017–2019) is conceptually simple: add a physics-based loss term. If the function we want is supposed to satisfy a PDE, we can punish the network for violating that PDE at points other than the training samples. Suddenly the network has a reason to behave correctly outside the training interval — the PDE residual loss enforces it. §1.2 sets up the formulation; §1.3 puts a real PINN to work on the canonical example, the 1D Burgers equation.

Pause-and-check. (1) On the cubic target, set the training interval to [-0.3, 0.3] and run training. Where does the MLP's prediction first qualitatively diverge from the truth? What does this say about extrapolation distance? (2) Why is using more samples, or a bigger network, not a fix for problem 1? (3) Suppose the target is u(t) = e^(-t). What single piece of information about u, used as an auxiliary loss term, would force the network to extrapolate correctly?

References

Raissi, M., Perdikaris, P., Karniadakis, G.E. (2019). Physics-informed neural networks. J. Comput. Phys. 378, 686–707.
Karniadakis, G.E., Kevrekidis, I.G., Lu, L., Perdikaris, P., Wang, S., Yang, L. (2021). Physics-informed machine learning. Nat. Rev. Phys. 3, 422–440.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 5 (generalisation, extrapolation). MIT Press.
Xu, K., Zhang, M., Li, J., et al. (2021). How neural networks extrapolate: From feedforward to graph neural networks. ICLR.