Auto-differentiation as a computational graph

Neural networks from absolute zero

Learning objectives

See backpropagation generalised: any composition of differentiable primitives can be reverse-mode differentiated
Read any expression as a directed acyclic graph of elementary operations
Recognise the local-Jacobian-times-upstream-gradient pattern at every operation type
Understand why this generalisation is exactly what PyTorch / JAX / TensorFlow do under the hood

Backpropagation in §0.6 was specific to a fixed neural-network shape. The same machinery generalises: any expression built out of a small library of differentiable primitives — add, multiply, sin, tanh, exp, log, square, ... — can be evaluated as a computational graph, then traversed in reverse to compute gradients. This is reverse-mode automatic differentiation, and it is what every modern ML framework rests on.

The graph view

Take any expression. Identify the elementary operations and the data flow between them. Draw a node for each operation and an arrow for each data dependency. The result is a directed acyclic graph. For example:

f(x, y) \;=\; x \cdot y \;+\; \sin(x)

has two input nodes ( $x$ and $y$ ), a multiplication node ( $z_1 = x \cdot y$ ), a sine node ( $z_2 = \sin(x)$ ), and an addition node ( $f = z_1 + z_2$ ). The forward pass is the natural left-to-right evaluation. The backward pass starts at the output with $\partial f / \partial f = 1$ and propagates rightward... no, leftward... by multiplying through each operation's local Jacobian.

The local Jacobian, per operation

Each operation type has a closed-form derivative wrt its inputs:

add: $\partial(a + b)/\partial a = 1$ and $\partial(a + b)/\partial b = 1$ .
mul: $\partial(ab)/\partial a = b$ and $\partial(ab)/\partial b = a$ .
sin: $\partial \sin(z)/\partial z = \cos(z)$ .
tanh: $\partial \tanh(z)/\partial z = 1 - \tanh^2(z)$ .
square: $\partial z^2 / \partial z = 2z$ .

Each backward step at a node multiplies the upstream gradient (a single scalar coming from the right) by the appropriate local Jacobian, and adds the result to the gradient of each input. Adds, not assigns — because if a node's value is used by multiple downstream consumers, all of those contributions are summed when the node's own gradient is finalised.

Try it

Three small expressions are pre-built. Switch between them with the dropdown, drag the input sliders, then press Forward and Backward. The amber gradient at each input is exactly the partial derivative of the output with respect to that input. The graph is just a re-skinning of the chain-rule machinery from §0.6 — nothing new pedagogically, but applied to expressions that are not network-shaped.

The "shared input" pattern

Look at the second expression, the single neuron $f = \tanh(w x + b)$ . The input $x$ feeds into only one downstream node — simple. But the multivariate expression $f = x y + \sin(x)$ has $x$ feeding into two downstream nodes (the multiplication and the sine). When we propagate backward, both downstream gradients flow into $x$ and are summed. That is the "sum over paths" property of multivariable calculus, made concrete by the AD engine. The single-variable composite expression $(x \sin x)^2$ has the same pattern: $x$ is used twice, and its final gradient is the sum of the two contributions.

Why this matters for PINNs

A PINN PDE-residual loss looks like, for the 1D acoustic wave equation,

L_{\mathrm{PDE}} \;=\; \frac{1}{N}\sum_i \left(\frac{\partial^2 u_\theta}{\partial t^2}(x_i, t_i) - c^2\,\frac{\partial^2 u_\theta}{\partial x^2}(x_i, t_i)\right)^2

where $u_\theta$ is a neural network with parameters $\theta$ . To compute $L_{\mathrm{PDE}}$ , we need second derivatives of $u$ with respect to $x$ and $t$ (its inputs). To train, we then need gradients of $L_{\mathrm{PDE}}$ with respect to $\theta$ (its parameters). All of this is auto-differentiation. The runtime in this textbook does forward + backward through the network in nn-runtime.js; for the input-derivatives nesting we will need in Part 4, we will extend it. The principle is the same as the toy graph above — just applied recursively.

Pause-and-check. (1) For the multivariate expression $f = x y + \sin(x)$ , with $x = 1.5, y = 0.6$ , what should $\partial f/\partial x$ be analytically? Verify against the widget. (2) Why does the input- $x$ gradient in expression 1 require two backward contributions to be summed? (3) In a real PINN, why is the AD engine called twice (or more) per training step?

References

Baydin, A.G., Pearlmutter, B.A., Radul, A.A., Siskind, J.M. (2018). Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 18, 1–43.
Bradbury, J., Frostig, R., Hawkins, P., et al. (2018). JAX: Composable transformations of Python+NumPy programs. https://github.com/google/jax.
Paszke, A., Gross, S., Massa, F., et al. (2019). PyTorch: An imperative style, high-performance deep learning library. NeurIPS.
Margossian, C.C. (2019). A review of automatic differentiation and its efficient implementation. WIREs Data Min. Knowl. Discov. 9(4), e1305.