Activation functions and inductive bias

Neural networks from absolute zero

Learning objectives

See in concrete shape what each activation’s inductive bias looks like — ReLU’s polyhedral creases, Tanh’s smooth bends, sin’s oscillations
Match an activation to a target class on first principles
Anticipate which activations will be PINN-friendly (smoothly differentiable to high orders) and which will not
Read a side-by-side fit comparison and rank activations by suitability for a given problem

In §0.1 you met a single neuron and noticed that the activation function controlled the shape of its response. In §0.2 we stacked many neurons and a single activation choice produced a fit. The natural follow-up is: what changes if we change the activation? The answer is more dramatic than most beginners expect, and it is the most common reason a working machine-learning practitioner has to think before grabbing the obvious default.

Inductive bias, in one paragraph

Every neural network architecture has an inductive bias — a built-in preference for some kinds of solutions over others. The activation function is one of the largest contributors to that bias. A network of ReLUs wants to produce piecewise-linear functions; you can fit smoother things with enough of them, but the underlying vocabulary is polygonal. A network of Tanhs wants to produce smooth functions with rounded transitions; you can approximate a step with steeply-sloped Tanhs, but the discontinuity will always be a little soft. A network of sines wants to oscillate. None of these is wrong; they are different shapes of "willingness".

For physics-informed neural networks specifically, inductive bias matters even more than for standard supervised learning, because the loss involves derivatives of the network output. A wave-equation PINN computes $\partial^2 u / \partial t^2$ ; an eikonal PINN computes $|\nabla T|^2$ ; an FWI-PINN computes mixed first and second derivatives. ReLU's second derivative is zero almost everywhere and undefined at zero — useless. Tanh, Swish, and sin are infinitely differentiable everywhere. That single fact is why most production PINN code uses Tanh or Swish, and why SIREN (sinusoidal networks, Sitzmann et al 2020) is a serious contender for high-frequency problems despite being only a few years old.

Try it

Five identical 1-hidden-layer MLPs are training in parallel above, on the same data, with the same hidden width $N$ and the same Adam optimiser. The only thing that differs between them is the activation function. Press Play to start them all at once. Each panel shows the target curve in gray and that activation's fit in its own colour. The bar chart at the bottom shows final mean-squared-error losses on a logarithmic scale — differences of one or two orders of magnitude are typical.

Four experiments worth doing

Sharp step. With "Sharp step" as the target, train all five with $N = 12$ . ReLU's polyhedral inductive bias matches a step almost perfectly — you should see it converge an order of magnitude faster than Tanh and Swish. Sigmoid, with its bounded saturating shape, is a close second. sin oscillates around the step (Gibbs-style ringing) and stays high.
Two-frequency mix. Switch to the multi-frequency target. Now the high-frequency component dominates the loss, and ReLU and the smooth saturating activations all struggle with the fast oscillation. sin (or a Tanh / Swish at very large $N$ ) is your only hope. This previews spectral bias, which we name and tame in §0.9.
Sawtooth. The sawtooth is discontinuous and periodic. ReLU's piecewise-linear vocabulary fits it almost exactly with $N$ no larger than the number of teeth times two; Tanh blurs each discontinuity; sin tries to be helpful and oscillates incorrectly between the teeth.
Smooth Gaussian peak. The smooth bump rewards smooth activations. Tanh, Sigmoid, and Swish all fit it cleanly at $N = 8$ ; ReLU produces a visible polygon if $N$ is small.

Why we mostly use Tanh and Swish for PINNs

The PINN community largely uses Tanh and Swish for two converging reasons. First, both are smooth and infinitely differentiable, so the second-derivative terms in PDE residuals are well-defined and well-behaved. Second, both produce reasonable fits across a wide range of target classes — Tanh struggles slightly more with discontinuities, Swish slightly more with sharp peaks, but neither produces the catastrophic failures ReLU produces on smooth targets or sin produces on monotone targets. They are the safe defaults. We will revisit the choice in Part 2 when we look at SIREN, Fourier-feature MLPs, and other high-frequency-friendly architectures.

What about the output activation?

Throughout Part 0, every MLP's output layer uses the identity activation — the network output is unbounded and continuous. This is the right default for regression problems, including all PINN forward and inverse problems we will meet in Parts 1–9. Bounded output activations (Sigmoid, Tanh) on the output layer are appropriate only when the quantity you predict is bounded by physics (e.g., a probability, a normalised reflectivity). For PINN-velocity inversion we sometimes use a bounded output to enforce $v_p > 0$ — but that is a hard-constraint trick we will see in §1.5, not an inductive-bias choice.

Pause-and-check. (1) Why is ReLU a poor choice for a PINN whose loss involves $\partial^2 u / \partial x^2$ ? (2) On the "Smooth Gaussian peak" target, which activation produces the cleanest fit at $N = 8$ ? Does that match your intuition about the activation's inductive bias? (3) The widget shows sin losing on most non-periodic targets. What does this suggest about when you should choose sin in practice?

References

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 6.3 (activation functions). MIT Press.
Glorot, X., Bordes, A., Bengio, Y. (2011). Deep sparse rectifier neural networks. AISTATS.
Sitzmann, V., Martel, J.N.P., Bergman, A.W., Lindell, D.B., Wetzstein, G. (2020). Implicit neural representations with periodic activation functions (SIREN). NeurIPS.
Raissi, M., Perdikaris, P., Karniadakis, G.E. (2019). Physics-informed neural networks. J. Comput. Phys. 378, 686–707.