Training loops and optimisers

Neural networks from absolute zero

Learning objectives

Watch SGD, SGD+momentum, and Adam train identical networks side by side and rank them by convergence speed
Recognise the trade-offs between simplicity, robustness, and adaptive per-parameter scaling
Pick a sensible default optimiser for a new problem and a sensible learning rate to start from
Read a multi-trace loss curve and judge which optimiser is winning

You now have all the pieces of a training loop:

Forward-pass the network on a batch of data (§0.6).
Compute the loss (§0.4).
Backpropagate to get parameter gradients (§0.6 / §0.7).
Update parameters with an optimiser step (§0.5 + this section).
Repeat.

The fourth bullet is the optimiser, and there are several modern choices. They all share the goal of §0.5 (walk downhill on the loss surface) but differ in how aggressively, how adaptively, and how memory-fully they take each step. The three classical optimisers are SGD, SGD with momentum, and Adam.

The three optimisers, in equations

SGD is the algorithm from §0.5: subtract a fixed multiple of the gradient.

\theta_{k+1} = \theta_k - \eta\,\nabla L(\theta_k)

SGD + momentum adds a velocity term that accumulates a running average of past gradients:

\mathbf{v}_{k+1} = \mu \mathbf{v}_k + \nabla L(\theta_k), \qquad \theta_{k+1} = \theta_k - \eta\,\mathbf{v}_{k+1}

Typically $\mu = 0.9$ . Momentum lets the optimiser plough through long shallow valleys by averaging out the across-valley component (which oscillates) and amplifying the along-valley component (which is consistent).

Adam (Kingma and Ba 2014) does momentum + adaptive per-parameter learning rates. For each parameter, Adam tracks two running averages — the first moment of the gradient (like momentum) and the second moment (the squared gradient). The actual step divides the first by the square root of the second:

m_{k+1} = \beta_1 m_k + (1 - \beta_1)\,\nabla L,\quad v_{k+1} = \beta_2 v_k + (1 - \beta_2)\,(\nabla L)^2

\theta_{k+1} = \theta_k - \eta\,\hat{m}_{k+1}/(\sqrt{\hat{v}_{k+1}} + \epsilon)

where $\hat{m}, \hat{v}$ are bias-corrected versions of $m, v$ . The intuition: parameters with consistently small gradients get larger effective learning rates; parameters with consistently large gradients get smaller ones. This per-parameter rescaling is what makes Adam robust to the loss-balance pathologies that plague PINN training (which we will name in Part 3).

Try it

Three identical 1-N-1 Tanh MLPs train in parallel above. Same target, same width, same starting weights, same shared learning rate. The only difference is the optimiser. Press Play and watch the loss-curve panel: SGD is usually slowest, SGD+momentum is faster, and Adam typically wins or ties on the harder targets. Switch to "Two bumps" or "Sawtooth" and the spread widens dramatically.

Three observations

SGD lr-sensitivity. Set the shared lr very small (try 0.001). All three optimisers slow down, but SGD slows down most. Now bump lr to 0.3 (still moderate). SGD might oscillate; Adam holds up because of its adaptive rescaling.
Momentum rescues SGD. On the "Two bumps" target, SGD often gets stuck in a slow plateau while SGDM cuts through it. Same gradients, different update rule.
Adam is the safe default. For almost all PINN training in the literature today, Adam is the first-choice optimiser. It is rarely the absolute fastest, but it is rarely catastrophic, and it saves you from doing a heavy learning-rate sweep on every problem. Most PINN papers use Adam with $\eta \in [10^{-4}, 10^{-3}]$ and $\beta_1 = 0.9, \beta_2 = 0.999$ .

Beyond the basics

There are dozens of optimiser variants — RMSProp, AdaGrad, AdaDelta, Adamax, Nadam, Lion, Sophia, ... — and a smaller set of PINN-specific tricks: NTK-balanced step sizes, soft-Adam, RAdam (§3.4 will name them). For Part 0 you only need to know that plain Adam works most of the time, that SGDM is a respectable cheaper alternative, and that plain SGD is basically obsolete for serious training outside of theoretical analysis.

What you now know about training

The pieces of training are: data loading, forward pass, loss evaluation, backward pass to get parameter gradients, and an optimiser step. Looped many times, that is training. Of the three classical optimisers, Adam is the safe default; SGDM is faster on simple problems if you tune; SGD is for textbook analysis. Section §0.9 turns to one specific failure mode that Adam does not fix on its own — spectral bias — and what to do about it.

Pause-and-check. (1) Why does momentum help most on long valleys and least on bowl-shaped surfaces? (2) Why does Adam need two running averages while SGDM needs only one? (3) If two trainings of the same network give different final losses with everything else identical, what changes between runs?

References

Kingma, D.P., Ba, J. (2015). Adam: A method for stochastic optimization. ICLR.
Sutskever, I., Martens, J., Dahl, G., Hinton, G. (2013). On the importance of initialization and momentum in deep learning. ICML.
Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 8. MIT Press.
Loshchilov, I., Hutter, F. (2019). Decoupled weight decay regularization (AdamW). ICLR.