The chain rule and backpropagation

Neural networks from absolute zero

Learning objectives

Compute every gradient in a 1-2-1 MLP by hand, one chain-rule step at a time
See backpropagation as nothing more than the chain rule applied carefully right-to-left
Recognise the three-factor pattern (downstream gradient × local Jacobian × upstream factor) at every node
Understand why backpropagation is linear in the number of parameters, the property that makes deep learning tractable

In §0.5 we computed the gradient of the loss with respect to the two parameters of a one-Tanh-neuron model by hand. That worked because there were only two parameters and one composition of functions. As soon as you have more layers, doing the gradient by hand becomes a bookkeeping nightmare, unless you adopt the systematic right-to-left procedure called backpropagation.

The substance of backpropagation is the chain rule from calculus. The trick is the order: starting from the loss and walking backwards through the network, computing local Jacobians and accumulating downstream gradients as you go. The result is that every parameter's gradient is computed in time proportional to the number of parameters, not exponentially, not even quadratically. This is the algorithmic miracle that makes training a million-parameter network feasible.

The chain rule, in the only form you need

Suppose $y = f(g(h(x)))$ . The chain rule says

\frac{dy}{dx} \;=\; f'(g(h(x)))\cdot g'(h(x))\cdot h'(x)

Each factor is the derivative of one operation, evaluated at its input. Chain rule in a network is just this, applied node by node. The key bookkeeping insight: if I want $\partial L / \partial w$ for some interior parameter $w$ , I can write it as

\frac{\partial L}{\partial w} \;=\; \underbrace{\frac{\partial L}{\partial z}}_{\text{downstream}}\cdot \underbrace{\frac{\partial z}{\partial w}}_{\text{local}}

where $z$ is the variable that immediately depends on $w$ . The downstream factor $\partial L / \partial z$ is the same for every parameter that feeds $z$ . So if I compute it once, I get many parameter gradients almost for free. That is the entire optimisation behind backpropagation.

Try it

This network has three layers: one input $x$ , two Tanh hidden neurons $h_1, h_2$ , and one identity-output neuron $\hat{y}$ . There are seven parameters in total: $w_{11}, b_1, w_{12}, b_2, w_{o1}, w_{o2}, b_o$ . Press Forward step seven times to fill in every node's value (the diagram lights up as you go). Then press Backward step repeatedly to propagate gradients back, one chain-rule application per click. The log on the right records each step's formula and numeric value so you can see exactly which factors multiplied to produce which gradient.

The pattern, made visible

Watch the backward sweep carefully. Every backward step has the same shape: downstream gradient × local Jacobian. For a weight $w$ feeding into pre-activation $z$ , the local Jacobian is just the input on the other side of $w$ . For a Tanh activation $h = \tanh(z)$ , the local Jacobian is $\tanh'(z) = 1 - \tanh^2(z)$ . For a sum, the local Jacobian is 1. That is it. Every neural network gradient ever computed is one of those three patterns, repeated at scale.

Notice also that the gradient at $h_1$ , $\partial L / \partial h_1$ , gets used twice, once to compute $\partial L / \partial z_1$ (and from there the parameter gradients $\partial L / \partial w_{11}$ and $\partial L / \partial b_1$ ). That sharing is what gives backprop its linear cost. If we had instead computed each parameter's gradient from scratch via the full chain expansion, we would have re-evaluated the downstream factor for every parameter, quadratic cost in the depth.

What the runtime does for you

You will not, in practice, write out backprop equations by hand for a real PINN. The shared nn-runtime.js module the textbook is built on does exactly the procedure above for arbitrary depth. The backward(trace, dLdy) function returns the same per-parameter gradients you just computed by hand, only for an MLP of any size. You are, in effect, looking at the entire substance of what an automatic-differentiation framework (PyTorch, JAX, TensorFlow) does, in a teaching version you can step through one operation at a time. §0.7 generalises this to the computational-graph view that lets you take derivatives of arbitrary expressions, not just neural-network structures.

Why this matters for PINNs

A PINN loss involves derivatives of the network output with respect to its inputs: things like $\partial u / \partial t$ , $\partial^2 u / \partial x^2$ . Computing those requires backpropagation through the network not once (to get loss gradients with respect to parameters) but several times nested: once to get $\partial u / \partial x$ , again to differentiate that, and finally to get gradients of the resulting PDE residual loss with respect to the parameters. Modern PINN frameworks rely on the same chain-rule machinery you just saw, applied recursively. The fact that PINN training is feasible at all rests on backpropagation's linear cost, if it were quadratic, the higher-order derivatives would be impossibly expensive.

Pause-and-check. (1) After all forward steps but before any backward steps, what does the diagram tell you about the gradients? (2) Why is $\partial L / \partial b_o$ equal to $\partial L / \partial z_o$ rather than something more complicated? (3) If we doubled the network's depth (say, three hidden layers of two neurons each), how many backward steps would the algorithm need? Roughly proportional to the number of parameters, or to something larger?

References

Rumelhart, D.E., Hinton, G.E., Williams, R.J. (1986). Learning representations by back-propagating errors. Nature 323, 533-536.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 6.5 (back-propagation). MIT Press.
LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature 521, 436-444.
Werbos, P. (1990). Backpropagation through time: What it does and how to do it. Proc. IEEE 78(10), 1550-1560.