Loss functions and the loss landscape

Neural networks from absolute zero

Learning objectives

Read a 2D loss landscape — contours, valleys, isolated minima
Compare four standard regression losses (MSE, MAE, Huber, log-cosh) and recognise their characteristic surface shapes
Match a loss function to a problem (clean data, outliers, robust regression)
Understand why the loss must be a scalar function of the parameters — it is the surface that gradient descent walks on

By §0.3 you can build and stack neurons, choose an activation, and watch a network train. But what is "training" actually doing? The black-box answer is: the optimiser is adjusting the weights and biases to minimise a loss function. To understand training, you have to understand the loss — the scalar number that says how badly the current network is doing.

What a loss function is

A loss function is a recipe that takes the network's prediction $\hat{y}$ and the true target $y$ and produces a non-negative scalar that is zero when $\hat{y} = y$ and grows with the error. The most common, by an enormous margin, is mean squared error:

\mathrm{MSE}(\hat{y}, y) = (\hat{y} - y)^2

For a whole training set ${(x_i, y_i)}_{i=1}^N$ , the total loss is the average:

L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \mathrm{loss}(f_\theta(x_i),\, y_i)

where $\theta$ is shorthand for all the trainable parameters of the network — every weight, every bias — collected into one big vector. The loss $L$ is then a single scalar function of $\theta$ . That is a profound simplification: no matter how complicated the architecture, training is the problem of finding the $\theta$ that minimises one scalar.

The loss landscape, made visible

For a real-sized network, $\theta$ has thousands or millions of components, and we cannot draw the surface. But we can visualise the surface by squashing the network down to two parameters and plotting the loss as a 2D heatmap. That is what the widget below does.

The "network" here is exactly one Tanh neuron, $\hat{y} = \tanh(w x + b)$ . It has two trainable parameters: $w$ and $b$ . The left panel shows the loss $L(w, b)$ as a heatmap on a logarithmic colour scale (purple = low loss, yellow = high loss). The right panel shows the resulting prediction (blue curve) and the target points (gray dots) with red residual lines connecting them. Click anywhere on the heatmap to drop $(w, b)$ at that location and see the resulting fit.

Find the minimum visually. Pick the "Two points only" target. Watch the heatmap: the dark purple region is the basin around the minimum. Click into it and the right panel should show the prediction passing through both target points with tiny residuals. Click far away (top-right corner, say) and see the residuals blow up.
Compare loss surfaces. Keep the same target but switch from MSE to MAE. The minimum is in the same place, but the surface is now kinked rather than smooth-bowl: MAE has a corner where the residual changes sign. Switch to Huber and the surface looks like MSE near the minimum and like MAE far away — quadratic where it matters, robust where it does not.
Find the long valley. Switch to "Sloped line through origin". The loss surface has a long shallow valley along the $w$ axis. Click anywhere along that valley and the fit will look very similar — because along the valley you are mostly trading w for itself. This is a preview of one of the things gradient descent will struggle with in §0.5: long valleys make optimisation slow.

Why this matters for PINNs

A real PINN loss is the sum of multiple terms — a data-fit term, one or more PDE-residual terms, and boundary-condition terms. Each term is itself an MSE-like average over its own collocation points. The total loss surface in the high-dimensional parameter space $\theta$ has rough valleys, saddle points, and many local minima. Most of Part 3 ("Training pathologies and remedies") is about shaping this surface — weighting the loss terms so the optimiser can actually reach a good minimum. You cannot fix what you cannot see, and the 2D toy version above is the simplest visible analogue of what is actually going on.

What you now know

A loss function is a scalar function of the parameters. Different choices (MSE, MAE, Huber, log-cosh, and beyond) shape the surface differently — some are smooth, some are kinked, some are robust to outliers, some are not. Training is the problem of finding the minimum of this surface. The next four sections (§0.5 to §0.8) are about how we walk to that minimum, and how to do it efficiently when $\theta$ has more than two components.

Pause-and-check. (1) On the "Two points only" target with MSE loss, where in the (w, b) plane is the minimum? (2) Why does the MAE surface have visible kinks while the MSE surface is smooth? (3) For an experiment with a few outlier data points, would you choose MSE, MAE, or Huber, and why?

References

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 5 & 8 (loss functions, optimisation). MIT Press.
Huber, P.J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35(1), 73–101.
Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T. (2018). Visualizing the loss landscape of neural nets. NeurIPS.
Krishnapriyan, A., Gholami, A., Zhe, S., Kirby, R., Mahoney, M.W. (2021). Characterizing possible failure modes in physics-informed neural networks. NeurIPS.