Layers, depth, and universal approximation

Neural networks from absolute zero

Learning objectives

See that stacking many simple neurons produces an extraordinarily expressive function approximator
Watch a real MLP train in your browser and fit an arbitrary 1D target
Recognise that the output of a 1-hidden-layer MLP is literally the sum of its hidden neurons' contributions, plus a bias
Read off how target complexity, hidden width, and activation choice trade against each other

In §0.1 we stripped a single neuron down to its three knobs. A single neuron, on its own, is a one-trick pony: it can only bend the input through one fixed nonlinearity. The interesting thing is what happens when we compose them.

From neuron to layer to network

A layer is just several neurons that all see the same inputs. If we have $N$ neurons in a layer, each with its own weight vector and bias, the layer's output is an $N$ -dimensional vector — one number per neuron. Stack two layers, and the second layer's neurons take the first layer's outputs as their inputs. Stack many layers, and you have a "deep" network.

For most of Part 0 we will work with a particular tiny architecture: one input, one hidden layer with $N$ neurons, and one output. Schematically:

x \;\rightarrow\; \big[\,h_1, h_2, \ldots, h_N\,\big] \;\rightarrow\; y

Each hidden neuron computes $h_j = \sigma(w_j x + b_j)$ , exactly the single-neuron formula from §0.1. The output then takes a weighted sum of those hidden activations and adds an output bias:

y \;=\; \sum_{j=1}^{N} v_j \, \sigma(w_j x + b_j) \;+\; c

Read that equation slowly. The output is literally a sum of $N$ scaled, shifted activation functions, plus a constant. If you can fit any continuous function with such a sum (for large enough $N$ ), then the network is "universal".

The universal approximation theorem

That is exactly what was proved, twice, in 1989–1991. The result, in plain English:

**Universal approximation theorem** (Cybenko 1989; Hornik, Stinchcombe & White 1989; Hornik 1991). For any continuous function $f : [a, b] \to \mathbb{R}$ and any tolerance $\varepsilon > 0$ , there exists an $N$ and a choice of weights $\{v_j, w_j, b_j\}$ such that the 1-hidden-layer network above approximates $f$ uniformly on $[a, b]$ to within $\varepsilon$ .

The original theorem requires the activation $\sigma$ to be non-polynomial (Hornik 1991's strongest statement); Tanh, Sigmoid, ReLU, Swish, and sin all qualify. The theorem says nothing about how large $N$ must be, nor whether the optimiser will find the right weights. In practice both questions matter a lot, and the rest of Part 0 is largely about answering them.

Try it

This is the first widget where a real neural network trains on your machine. Pick a target, pick a hidden width, choose an activation, and press Play. The blue curve is the network's current output $y(x)$ . The faint coloured curves (visible when $N \le 12$ ) are each hidden neuron's individual contribution $v_j \sigma(w_j x + b_j)$ — watch them sum to the blue curve. The lower panel is the mean-squared-error loss on a logarithmic scale; you should see it drop several orders of magnitude in the first few seconds.

Three experiments worth doing right now

Width sweep: With "Smooth sine" as the target, train at $N = 1$ . The fit will be terrible — a single Tanh can only bend up once. Bump $N$ to 4, then 8, then 16. Each time you should see the loss floor drop by an order of magnitude, and the per-neuron contributions reveal how the network shares the work between neurons.
Hard target: Switch to "Sawtooth", keep $N = 8$ , and train. A Tanh network can only get so close to a discontinuous periodic target with eight neurons — this is the universal-approximation theorem's "for any tolerance you can find an $N$ " caveat made vivid. Push $N$ to 24 to close the gap.
Wrong activation: Switch to "Step" target, set activation to tanh with $N = 4$ , train, then switch the activation to relu. ReLU's piecewise-linear inductive bias matches a step almost perfectly; Tanh has to invest several neurons to approximate the same shape. Different activations have wildly different "vocabulary" — the topic of §0.3.

Depth (and why we are not exploring it yet)

The widget above has exactly one hidden layer. Depth — stacking multiple hidden layers — buys an additional kind of expressivity that a single wide layer cannot match efficiently (Telgarsky 2016 made this rigorous). In Part 2 we will build deeper networks and see why depth matters for high-frequency functions and for parametric problems. But for the universal-approximation idea, one wide layer is enough, and the per-neuron-contribution visualisation only works cleanly with one layer. We will earn depth in Part 2.

What you now know

A neural network is a sum of neurons. With enough neurons of a non-polynomial activation, that sum can approximate any continuous function. The ergonomic question — how do you actually find the right weights ${v_j, w_j, b_j}$ ? — is what §0.4 through §0.8 are for: loss functions, gradient descent, the chain rule, auto-differentiation, and training loops. By §0.8 you will have trained networks the same way the widget above does, but consciously instead of by black magic.

Pause-and-check. (1) For a 1-hidden-layer MLP with $N = 4$ Tanh neurons, how many trainable scalar parameters are there in total (count the weights, the biases, and the output bias)? (2) If the target is the constant function $f(x) = 0.5$ , what is the smallest $N$ that can fit it exactly? (3) The universal approximation theorem promises that some $N$ exists. What does it not promise? Drag the widget sliders until each answer is unambiguous, then move on to §0.3.

References

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2, 303–314.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks 4(2), 251–257.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 6. MIT Press.
LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature 521, 436–444.