Gradient descent by hand

Neural networks from absolute zero

Learning objectives

Take explicit gradient-descent steps on a 2D loss surface and watch the trajectory
Recognise the three classical pathologies: too-small lr (slow), too-large lr (overshoot / divergence), long valleys (zig-zag)
Use momentum to fix the long-valley pathology
Read off why a learning-rate sweep is non-negotiable in any real PINN training run

You can now see a loss surface (§0.4). The next question is: how do we descend it? The answer is the most important algorithm in modern machine learning, and it is genuinely simple. We will work it out by hand on a 2D toy, then trust that the same procedure scales to networks with millions of parameters.

The algorithm, in one line

Start at any point $\theta_0$ on the surface. Compute the gradient $\nabla L(\theta_0)$ — the direction of steepest ascent. Move a small step in the opposite direction:

\theta_{k+1} \;=\; \theta_k \;-\; \eta\,\nabla L(\theta_k)

Here $\eta$ is the learning rate: how far to step. Repeat. That is the entire algorithm. There is genuinely nothing else.

The gradient, by hand

For our toy model $\hat{y} = \tanh(w x + b)$ with MSE loss over $N$ points, the gradient is straightforward. By the chain rule:

\frac{\partial L}{\partial w} = \frac{1}{N} \sum_{i=1}^{N} 2(\hat{y}_i - y_i)\cdot \tanh'(w x_i + b)\cdot x_i

\frac{\partial L}{\partial b} = \frac{1}{N} \sum_{i=1}^{N} 2(\hat{y}_i - y_i)\cdot \tanh'(w x_i + b)

where $\tanh'(z) = 1 - \tanh^2(z)$ . Three things multiply together: the loss derivative $2(\hat{y} - y)$ , the activation derivative $\tanh'(z)$ , and the local sensitivity to the parameter (just $x$ for $w$ ; just $1$ for $b$ ). This is the chain rule, applied to a one-layer network. §0.6 will generalise it to many layers.

Try it

Click anywhere on the heatmap to drop a starting point. The white arrow shows the descent direction $-\nabla L$ at the current location. Press Step (1) to take one gradient step; the yellow trail records your trajectory. The right-hand panels show the resulting fit and the loss-vs-step curve.

Three pathologies and one fix

Learning rate too small. Pick "Two points (clean isolated minimum)", set lr to 0.01, click somewhere far from the minimum, then press Step (50). You will see the trail crawl toward the minimum but barely move — 50 steps and you are still high up the bowl. Real training runs at this rate take forever.
Learning rate too large. Same target, set lr to 1.5, click the same starting point, step. The trail overshoots the minimum, lands far on the other side, overshoots again, and either oscillates forever or escapes the visible region entirely. Loss can increase step over step.
Long valley. Switch to "Sloped line (long shallow valley)", set lr to 0.3, momentum 0, click in the upper-left corner of the heatmap. Press Step (50). The trail zig-zags wildly across the valley walls, taking many steps to make any forward progress along the valley axis. This is the classic narrow-bowl pathology that motivates momentum.

Now turn momentum up to 0.9 and re-run experiment (3). The trail straightens dramatically: the velocity $\mathbf{v}_{k+1} = \mu \mathbf{v}_k - \eta \nabla L$ accumulates a running average of past gradients, which cancels the across-valley component (it flips sign each step) and reinforces the along-valley component (it stays the same sign). This is the simplest example of why optimisers more sophisticated than vanilla SGD exist — a topic we will return to in §0.8.

Choosing a learning rate

There is no universally correct learning rate. You have to find one that is right for your problem, your loss, and your initialisation. The widget above lets you do this by feel; in practice, you do a learning-rate sweep: run a short training trial at, say, six log-spaced learning rates between $10^{-4}$ and $10^{0}$ , plot loss-vs-step for each, and pick the one where the loss falls fastest without diverging. This is non-negotiable for any PINN training run, where the loss surface is much harder to characterise than the toy bowl above.

Why this matters for PINNs

A PINN loss in the wild has tens of thousands to millions of parameters and a surface that nobody can directly visualise. The exact same algorithm — take the gradient, step downhill, repeat — is what trains it. The pathologies you saw above do not go away with scale; they get worse. Long valleys appear in the form of loss-balance crises when the data-fit and PDE-residual loss terms have different orders of magnitude (Part 3, §3.2). Saddle points and local minima multiply. The fixes — momentum, adaptive learning rates, loss reweighting, curriculum, domain decomposition — are the substance of Parts 3 and 6. They all build on the toy you just played with.

Pause-and-check. (1) Why does the descent direction never point uphill? (2) On the "Sloped line" target, what value of momentum lets you reach the minimum in roughly half the steps as $\mu = 0$ ? (3) If you double the learning rate, what happens to the smallest loss you can reach with a finite number of steps?

References

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 8 (optimisation for training deep models). MIT Press.
Polyak, B.T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17.
Nesterov, Y. (1983). A method for solving the convex programming problem with convergence rate O(1/k²). Dokl. Akad. Nauk SSSR 269, 543–547.
Sutskever, I., Martens, J., Dahl, G., Hinton, G. (2013). On the importance of initialization and momentum in deep learning. ICML.