Gradient Descent and Sampling Strategies
Learning objectives
- Explain why optimization is central to machine learning
- Derive and apply the gradient descent update rule
- Describe the role of the learning rate and its effect on convergence
- Compare gradient descent, SGD, mini-batch SGD, and Newton's method
- Apply optimization concepts to geoscience problems
Why Optimization Matters in ML
Training a machine learning model means finding the best parameters (weights, biases) that minimize a cost function. This is fundamentally an optimization problem.
Consider linear regression: we want to find the line that best fits our data. "Best" means minimizing the Mean Squared Error:
For a simple model with 2 parameters, we could try all combinations. But real ML models have millions of parameters. We need efficient optimization algorithms.
Gradient Descent: The Core Algorithm
Gradient descent is the workhorse of ML optimization. The idea is simple and elegant:
- Start with an initial guess for the parameters .
- Compute the gradient , the direction of steepest increase in the cost.
- Take a step in the opposite direction (toward steepest decrease).
- Repeat until the cost stops decreasing.
The Gradient Descent Update Rule
where:
- = current parameter values at step
- = learning rate (a positive scalar controlling step size)
- = gradient of the cost function at
Intuition: Imagine you are lost in a foggy mountain range and want to reach the valley (lowest point). You cannot see far, but you can feel the slope beneath your feet. Gradient descent says: always step downhill. The gradient tells you which direction is "uphill," so you go the opposite way.
What Is a Gradient?
The gradient is a vector of partial derivatives. For a cost function with parameters :
Each partial derivative tells you how much the cost changes when you slightly change one parameter, holding the others fixed.
Example: For (a parabola), the derivative is . At , the gradient is (pointing "uphill" to the right). So gradient descent moves left: .
The Learning Rate
The learning rate controls how big each step is. It is one of the most important hyperparameters in ML.
Too Small ()
Steps are tiny. Convergence is very slow, may take thousands of iterations. Safe but inefficient.
Just Right ( for this problem)
Steps are appropriately sized. Converges in a reasonable number of iterations.
Too Large ()
Steps overshoot the minimum. The cost may oscillate wildly or even diverge to infinity!
In practice, common starting values are . Many practitioners use learning rate schedules that start large and decay over time.
Gradient Descent Step-by-Step:
Let's trace gradient descent on with initial value and learning rate :
| Step | ||||
|---|---|---|---|---|
| 0 | 4.000 | 16.000 | 8.000 | 3.200 |
| 1 | 3.200 | 10.240 | 6.400 | 2.560 |
| 2 | 2.560 | 6.554 | 5.120 | 2.048 |
| 3 | 2.048 | 4.194 | 4.096 | 1.638 |
The pattern: . So as .
Variants of Gradient Descent
Batch Gradient Descent
Computes the gradient using all training examples at each step:
Pro: Stable, converges smoothly. Con: Very slow for large datasets (must process every example per step).
Stochastic Gradient Descent (SGD)
Computes the gradient using one random training example at each step:
Pro: Very fast per step; can handle huge datasets. Con: Noisy gradients cause the path to zigzag; may never settle exactly at the minimum.
Mini-Batch SGD
A compromise: compute the gradient on a mini-batch of examples (typically ):
Pro: Balances speed and stability; efficient on GPUs. Con: Still noisy (but less than pure SGD). This is the most widely used variant in practice.
Newton's Method
Newton's method uses second-order information (curvature) for faster convergence:
where is the Hessian matrix of second partial derivatives:
Newton's Method: Pros & Cons
Pros: Converges much faster (quadratic convergence near the minimum). Adapts step size automatically.
Cons: Computing and inverting the Hessian is , prohibitively expensive for large models (millions of parameters). Not practical for deep learning.
Quasi-Newton methods (e.g., L-BFGS) approximate the Hessian without computing it fully, offering a middle ground between gradient descent and Newton's method.
Convergence Criteria
How do we know when to stop iterating? Common stopping criteria:
- Gradient magnitude: Stop when (the gradient is nearly zero, indicating a flat region).
- Cost change: Stop when (the cost barely changes between steps).
- Parameter change: Stop when (parameters barely move).
- Maximum iterations: Set a cap (e.g., 10,000 steps) to prevent infinite loops.
Optimization in Geoscience
Optimization appears throughout geoscience:
- Seismic Inversion: Finding the velocity model that minimizes the misfit between observed and predicted seismograms.
- Well Placement: Optimizing drilling locations to maximize resource extraction while minimizing cost.
- History Matching: Adjusting reservoir model parameters so that simulated production matches observed production data.
- Full Waveform Inversion (FWI): A gradient-based method that iteratively updates the velocity model to fit the full seismic waveform, one of the most compute-intensive optimization problems in geophysics.
References
- Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, ch. 4 & 8 (numerical computation, optimization for training deep models). MIT Press.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 5.3 (error backpropagation, gradient methods). Springer.
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 8 (optimization). MIT Press.
- Bergen, K.J., Johnson, P.A., de Hoop, M.V., Beroza, G.C. (2019). Machine learning for data-driven discovery in solid Earth geoscience. Science 363, eaau0323.