Gradient Descent and Sampling Strategies

Part 3, Chapter 3: Numerical Optimization for Learning

Learning objectives

Explain why optimization is central to machine learning
Derive and apply the gradient descent update rule
Describe the role of the learning rate and its effect on convergence
Compare gradient descent, SGD, mini-batch SGD, and Newton's method
Apply optimization concepts to geoscience problems

Why Optimization Matters in ML

Training a machine learning model means finding the best parameters (weights, biases) that minimize a cost function. This is fundamentally an optimization problem.

Consider linear regression: we want to find the line $y = wx + b$ that best fits our data. "Best" means minimizing the Mean Squared Error:

$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} (wx^{(i)} + b - y^{(i)})^2$ i=1m(wx(i)+b−y(i))2

For a simple model with 2 parameters, we could try all combinations. But real ML models have millions of parameters. We need efficient optimization algorithms.

Gradient Descent: The Core Algorithm

Gradient descent is the workhorse of ML optimization. The idea is simple and elegant:

Start with an initial guess for the parameters $\theta$ .
Compute the gradient $\nabla J(\theta)$ , the direction of steepest increase in the cost.
Take a step in the opposite direction (toward steepest decrease).
Repeat until the cost stops decreasing.

The Gradient Descent Update Rule

$\theta_{t+1} = \theta_t - \alpha \, \nabla J(\theta_t)$

where:

$\theta_t$ = current parameter values at step $t$
$\alpha$ = learning rate (a positive scalar controlling step size)
$\nabla J(\theta_t)$ = gradient of the cost function at $\theta_t$

Intuition: Imagine you are lost in a foggy mountain range and want to reach the valley (lowest point). You cannot see far, but you can feel the slope beneath your feet. Gradient descent says: always step downhill. The gradient tells you which direction is "uphill," so you go the opposite way.

What Is a Gradient?

The gradient is a vector of partial derivatives. For a cost function $J$ with parameters $\theta = (\theta_1, \theta_2, \ldots, \theta_n)$ :

$\nabla J(\theta) = \begin{pmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \\ \vdots \\ \frac{\partial J}{\partial \theta_n} \end{pmatrix}$ endpmatrix

Each partial derivative tells you how much the cost changes when you slightly change one parameter, holding the others fixed.

Example: For $J(\theta) = \theta^2$ (a parabola), the derivative is $\frac{dJ}{d\theta} = 2\theta$ . At $\theta = 4$ , the gradient is $8$ (pointing "uphill" to the right). So gradient descent moves left: $\theta_{\text{new}} = 4 - \alpha \cdot 8$ textnew=4−alphacdot8.

The Learning Rate $\alpha$

The learning rate controls how big each step is. It is one of the most important hyperparameters in ML.

Too Small ( $\alpha = 0.001$ )

Steps are tiny. Convergence is very slow, may take thousands of iterations. Safe but inefficient.

Just Right ( $\alpha = 0.1$ for this problem)

Steps are appropriately sized. Converges in a reasonable number of iterations.

Too Large ( $\alpha = 1.5$ )

Steps overshoot the minimum. The cost may oscillate wildly or even diverge to infinity!

In practice, common starting values are $\alpha \in \{0.001, 0.01, 0.1\}$ . Many practitioners use learning rate schedules that start large and decay over time.

Gradient Descent Step-by-Step: $J(x) = x^2$

Let's trace gradient descent on $J(x) = x^2$ with initial value $x_0 = 4$ and learning rate $\alpha = 0.1$ :

Step $t$	$x_t$	$J(x_t)$	$\nabla J = 2x_t$	$x_{t+1}$
0	4.000	16.000	8.000	3.200
1	3.200	10.240	6.400	2.560
2	2.560	6.554	5.120	2.048
3	2.048	4.194	4.096	1.638

The pattern: $x_{t+1} = x_t - 0.1 \cdot 2x_t = 0.8 \cdot x_t$ 0.1cdot2xt=0.8cdotxt. So $x_t = 4 \cdot 0.8^t \to 0$ t=4cdot0.8tto0 as $t \to \infty$ .

The trace above fixes $\alpha = 0.1$ . The demo below lets you vary the rate yourself: drag the learning rate $\eta$ and press Play to watch the weight descend the loss curve. Small rates crawl toward the minimum; a well-chosen rate converges in a handful of steps; push $\eta$ above 1 and the updates overshoot and spiral outward.

Notice the geometry: each step moves $w$ opposite the gradient, by an amount proportional to both the slope and $\eta$ . Where the curve is steep the steps are large; near the minimum the gradient flattens and the steps shrink on their own, which is why a single fixed learning rate can still converge.

Variants of Gradient Descent

Batch Gradient Descent

Computes the gradient using all $m$ training examples at each step:

$\nabla J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \nabla L(h_\theta(x^{(i)}), y^{(i)})$

Pro: Stable, converges smoothly. Con: Very slow for large datasets (must process every example per step).

Stochastic Gradient Descent (SGD)

Computes the gradient using one random training example at each step:

$\theta_{t+1} = \theta_t - \alpha \, \nabla L(h_\theta(x^{(i)}), y^{(i)})$

Pro: Very fast per step; can handle huge datasets. Con: Noisy gradients cause the path to zigzag; may never settle exactly at the minimum.

Mini-Batch SGD

A compromise: compute the gradient on a mini-batch of $B$ examples (typically $B = 32, 64, 128$ ):

$\nabla J(\theta) \approx \frac{1}{B} \sum_{i \in \text{batch}} \nabla L(h_\theta(x^{(i)}), y^{(i)})$

Pro: Balances speed and stability; efficient on GPUs. Con: Still noisy (but less than pure SGD). This is the most widely used variant in practice.

Newton's Method

Newton's method uses second-order information (curvature) for faster convergence:

$\theta_{t+1} = \theta_t - H^{-1} \nabla J(\theta_t)$ t+1=thetat−H−1nablaJ(thetat)

where $H$ is the Hessian matrix of second partial derivatives:

$H_{ij} = \frac{\partial^2 J}{\partial \theta_i \partial \theta_j}$ ij=fracpartial2Jpartialthetaipartialthetaj

Newton's Method: Pros & Cons

Pros: Converges much faster (quadratic convergence near the minimum). Adapts step size automatically.

Cons: Computing and inverting the Hessian is $O(n^3)$ , prohibitively expensive for large models (millions of parameters). Not practical for deep learning.

Quasi-Newton methods (e.g., L-BFGS) approximate the Hessian without computing it fully, offering a middle ground between gradient descent and Newton's method.

Convergence Criteria

How do we know when to stop iterating? Common stopping criteria:

Gradient magnitude: Stop when $\|\nabla J(\theta)\| < \epsilon$ (the gradient is nearly zero, indicating a flat region).
Cost change: Stop when $|J(\theta_{t+1}) - J(\theta_t)| < \epsilon$ epsilon (the cost barely changes between steps).
Parameter change: Stop when $\|\theta_{t+1} - \theta_t\| < \epsilon$ ∣<epsilon (parameters barely move).
Maximum iterations: Set a cap (e.g., 10,000 steps) to prevent infinite loops.

Optimization in Geoscience

Optimization appears throughout geoscience:

Seismic Inversion: Finding the velocity model that minimizes the misfit between observed and predicted seismograms.
Well Placement: Optimizing drilling locations to maximize resource extraction while minimizing cost.
History Matching: Adjusting reservoir model parameters so that simulated production matches observed production data.
Full Waveform Inversion (FWI): A gradient-based method that iteratively updates the velocity model to fit the full seismic waveform, one of the most compute-intensive optimization problems in geophysics.

References

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 4 & 8 (numerical computation, optimization for training deep models). MIT Press.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 5.3 (error backpropagation, gradient methods). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 8 (optimization). MIT Press.
Bergen, K.J., Johnson, P.A., de Hoop, M.V., Beroza, G.C. (2019). Machine learning for data-driven discovery in solid Earth geoscience. Science 363, eaau0323.