Gradient Descent and Sampling Strategies

Part 3, Chapter 3: Numerical Optimization for Learning

Learning objectives

  • Explain why optimization is central to machine learning
  • Derive and apply the gradient descent update rule
  • Describe the role of the learning rate and its effect on convergence
  • Compare gradient descent, SGD, mini-batch SGD, and Newton's method
  • Apply optimization concepts to geoscience problems

Why Optimization Matters in ML

Training a machine learning model means finding the best parameters (weights, biases) that minimize a cost function. This is fundamentally an optimization problem.

Consider linear regression: we want to find the line y=wx+by = wx + b that best fits our data. "Best" means minimizing the Mean Squared Error:

J(w,b)=frac1msumi=1m(wx(i)+by(i))2J(w, b) = \frac{1}{m} \sum_{i=1}^{m} (wx^{(i)} + b - y^{(i)})^2i=1m(wx(i)+by(i))2

For a simple model with 2 parameters, we could try all combinations. But real ML models have millions of parameters. We need efficient optimization algorithms.

Gradient Descent: The Core Algorithm

Gradient descent is the workhorse of ML optimization. The idea is simple and elegant:

  1. Start with an initial guess for the parameters theta\theta.
  2. Compute the gradient nablaJ(theta)\nabla J(\theta), the direction of steepest increase in the cost.
  3. Take a step in the opposite direction (toward steepest decrease).
  4. Repeat until the cost stops decreasing.

The Gradient Descent Update Rule

θt+1=θtαJ(θt)\theta_{t+1} = \theta_t - \alpha \, \nabla J(\theta_t)

where:

  • θt\theta_t = current parameter values at step tt
  • α\alpha = learning rate (a positive scalar controlling step size)
  • J(θt)\nabla J(\theta_t) = gradient of the cost function at θt\theta_t

Intuition: Imagine you are lost in a foggy mountain range and want to reach the valley (lowest point). You cannot see far, but you can feel the slope beneath your feet. Gradient descent says: always step downhill. The gradient tells you which direction is "uphill," so you go the opposite way.

What Is a Gradient?

The gradient is a vector of partial derivatives. For a cost function JJ with parameters theta=(theta1,theta2,ldots,theta_n)\theta = (\theta_1, \theta_2, \ldots, \theta_n)n):

nablaJ(theta)=beginpmatrixfracpartialJpartialtheta1fracpartialJpartialtheta2vdotsfracpartialJpartialtheta_nendpmatrix\nabla J(\theta) = \begin{pmatrix} \frac{\partial J}{\partial \theta_1} \\ \frac{\partial J}{\partial \theta_2} \\ \vdots \\ \frac{\partial J}{\partial \theta_n} \end{pmatrix}nendpmatrix

Each partial derivative tells you how much the cost changes when you slightly change one parameter, holding the others fixed.

Example: For J(theta)=theta2J(\theta) = \theta^2 (a parabola), the derivative is fracdJdtheta=2theta\frac{dJ}{d\theta} = 2\theta. At theta=4\theta = 4, the gradient is 88 (pointing "uphill" to the right). So gradient descent moves left: thetatextnew=4alphacdot8\theta_{\text{new}} = 4 - \alpha \cdot 8textnew=4alphacdot8.

The Learning Rate alpha\alpha

The learning rate controls how big each step is. It is one of the most important hyperparameters in ML.

Too Small (α=0.001\alpha = 0.001)

Steps are tiny. Convergence is very slow, may take thousands of iterations. Safe but inefficient.

Just Right (α=0.1\alpha = 0.1 for this problem)

Steps are appropriately sized. Converges in a reasonable number of iterations.

Too Large (α=1.5\alpha = 1.5)

Steps overshoot the minimum. The cost may oscillate wildly or even diverge to infinity!

In practice, common starting values are alphain0.001,0.01,0.1\alpha \in \{0.001, 0.01, 0.1\}. Many practitioners use learning rate schedules that start large and decay over time.

Gradient Descent Step-by-Step: J(x)=x2J(x) = x^2

Let's trace gradient descent on J(x)=x2J(x) = x^2 with initial value x_0=4x_0 = 4 and learning rate alpha=0.1\alpha = 0.1:

Step ttxtx_tJ(xt)J(x_t)J=2xt\nabla J = 2x_txt+1x_{t+1}
04.00016.0008.0003.200
13.20010.2406.4002.560
22.5606.5545.1202.048
32.0484.1944.0961.638

The pattern: xt+1=xt0.1cdot2xt=0.8cdotxtx_{t+1} = x_t - 0.1 \cdot 2x_t = 0.8 \cdot x_tt0.1cdot2xt=0.8cdotxt. So xt=4cdot0.8tto0x_t = 4 \cdot 0.8^t \to 0t=4cdot0.8tto0 as ttoinftyt \to \infty.

Variants of Gradient Descent

Batch Gradient Descent

Computes the gradient using all mm training examples at each step:

J(θ)=1mi=1mL(hθ(x(i)),y(i))\nabla J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \nabla L(h_\theta(x^{(i)}), y^{(i)})

Pro: Stable, converges smoothly. Con: Very slow for large datasets (must process every example per step).

Stochastic Gradient Descent (SGD)

Computes the gradient using one random training example at each step:

θt+1=θtαL(hθ(x(i)),y(i))\theta_{t+1} = \theta_t - \alpha \, \nabla L(h_\theta(x^{(i)}), y^{(i)})

Pro: Very fast per step; can handle huge datasets. Con: Noisy gradients cause the path to zigzag; may never settle exactly at the minimum.

Mini-Batch SGD

A compromise: compute the gradient on a mini-batch of BB examples (typically B=32,64,128B = 32, 64, 128):

J(θ)1BibatchL(hθ(x(i)),y(i))\nabla J(\theta) \approx \frac{1}{B} \sum_{i \in \text{batch}} \nabla L(h_\theta(x^{(i)}), y^{(i)})

Pro: Balances speed and stability; efficient on GPUs. Con: Still noisy (but less than pure SGD). This is the most widely used variant in practice.

Newton's Method

Newton's method uses second-order information (curvature) for faster convergence:

thetat+1=thetatH1nablaJ(thetat)\theta_{t+1} = \theta_t - H^{-1} \nabla J(\theta_t)t+1=thetatH1nablaJ(thetat)

where HH is the Hessian matrix of second partial derivatives:

Hij=fracpartial2JpartialthetaipartialthetajH_{ij} = \frac{\partial^2 J}{\partial \theta_i \partial \theta_j}ij=fracpartial2Jpartialthetaipartialthetaj

Newton's Method: Pros & Cons

Pros: Converges much faster (quadratic convergence near the minimum). Adapts step size automatically.

Cons: Computing and inverting the Hessian is O(n3)O(n^3), prohibitively expensive for large models (millions of parameters). Not practical for deep learning.

Quasi-Newton methods (e.g., L-BFGS) approximate the Hessian without computing it fully, offering a middle ground between gradient descent and Newton's method.

Convergence Criteria

How do we know when to stop iterating? Common stopping criteria:

  • Gradient magnitude: Stop when nablaJ(theta)<epsilon\|\nabla J(\theta)\| < \epsilon (the gradient is nearly zero, indicating a flat region).
  • Cost change: Stop when J(thetat+1)J(thetat)<epsilon|J(\theta_{t+1}) - J(\theta_t)| < \epsilont)<epsilon (the cost barely changes between steps).
  • Parameter change: Stop when thetat+1thetat<epsilon\|\theta_{t+1} - \theta_t\| < \epsilont<epsilon (parameters barely move).
  • Maximum iterations: Set a cap (e.g., 10,000 steps) to prevent infinite loops.

Optimization in Geoscience

Optimization appears throughout geoscience:

  • Seismic Inversion: Finding the velocity model that minimizes the misfit between observed and predicted seismograms.
  • Well Placement: Optimizing drilling locations to maximize resource extraction while minimizing cost.
  • History Matching: Adjusting reservoir model parameters so that simulated production matches observed production data.
  • Full Waveform Inversion (FWI): A gradient-based method that iteratively updates the velocity model to fit the full seismic waveform, one of the most compute-intensive optimization problems in geophysics.

References

  • Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 4 & 8 (numerical computation, optimization for training deep models). MIT Press.
  • Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 5.3 (error backpropagation, gradient methods). Springer.
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 8 (optimization). MIT Press.
  • Bergen, K.J., Johnson, P.A., de Hoop, M.V., Beroza, G.C. (2019). Machine learning for data-driven discovery in solid Earth geoscience. Science 363, eaau0323.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.