Optimization basics

Processing Prerequisites

Learning objectives

Write down a cost function and find its minimum by gradient descent
Pick a step size that converges; recognize zigzag, slow, and divergent failures
Explain why ill-conditioning (elongated level sets) slows gradient descent
Connect cost, gradient, and regularization to FWI, tomography, and least-squares inversion

Section §0.7 handed us the normal equations: the best least-squares solution to Ax = b is $\hat x = (A^{T}A)^{-1}A^{T}b$ . That is a closed-form answer to a linear problem. FWI and modern seismic inversion are non-linear — the forward operator is a wave-equation simulator, not a matrix — and there is no closed-form anything. What replaces the matrix inverse is iterative optimization.

1. The cost function

A cost function f(x) measures how bad a proposed solution x is. The optimizer’s job is to find the x that makes f(x) as small as possible. For seismic:

Least-squares inversion: $f(x) = |Ax - b|_2^2$ . Convex (one minimum). Closed-form solution available.
FWI: $f(m) = \tfrac{1}{2}|,\mathcal{F}(m) - d,|^{2}$ where m is the velocity model and ℱ(m) is a wave-equation forward model. Not convex in general — has many local minima.
Regularized inversion: $f(x) = |Ax - b|_2^2 + \epsilon,|Dx|_2^2$ where D is a roughness penalty. Shrinks solution norm or smooths it out.

2. Gradient descent

x_{k+1} = x_{k} - \alpha\,\nabla f(x_k)

Take a step in the direction opposite the gradient. The step size α (also called the learning rate) controls how far. Too small and you crawl; too large and you overshoot or even diverge. The widget below makes this physical.

Click anywhere on the contour to drop a starting point. Hit Play. Try:

Convex bowl with step size 0.15: smooth convergence to the origin. Now push the step to 0.24: overshoot and zigzag. Push further toward 0.5 (not slider-reachable, try typing): divergence.
Ill-conditioned bowl: the level sets are long ellipses. Gradient descent bounces side-to-side through the narrow direction, barely making progress along the long direction. This is the “zigzag” pathology that makes plain gradient descent slow on many real problems.
Two-minimum: start in the right well or the left well. Gradient descent falls into whichever minimum is downhill from its starting point — it has no way to see the other one. Finding the global minimum on a non-convex surface is an unsolved problem; FWI uses frequency continuation and careful initialization to dodge this.

Slow convergence. Too-small step: slow monotone approach. Fix: increase α, or use accelerated methods (momentum, conjugate gradients).
Zigzag / overshoot. Step size too large or surface too ill-conditioned. Fix: smaller α, preconditioning, or quasi-Newton methods that use curvature information.
Local minimum. Non-convex surface. Fix: multi-start, annealing, frequency continuation in FWI, or convex relaxations.

4. The Lipschitz / condition-number connection

For a quadratic $f(x) = \tfrac{1}{2}x^{T}Ax$ , the optimal step size for plain gradient descent is $\alpha^{\ast} = 2/(\lambda_{\max} + \lambda_{\min})$ , and the convergence rate depends on the condition number

\kappa = \lambda_{\max}/\lambda_{\min}

If κ = 1, gradient descent converges in one step. If κ = 100, it takes roughly 100 steps to converge to within 1/e of the optimum. The ill-conditioned-bowl preset has κ ≈ 15; that is why it zigzags.

Every practical inversion is ill-conditioned — the Jacobian has huge singular-value spread. The solutions people actually use are conjugate gradients, preconditioning, and quasi-Newton updates like L-BFGS. These are all in the same family as gradient descent; they just spend more memory or compute per step to bend the trajectory smarter.

5. Regularization: adding a penalty

If the cost has many equally-good solutions, or the gradient is unstable near the optimum, add a penalty that breaks ties toward desirable solutions:

Tikhonov (L2): $+\epsilon|x|_2^{2}$ shrinks the norm toward zero. Adds ε I to A^TA — numerically stabilizes inversion.
Smoothing: $+\epsilon|Dx|_2^{2}$ where D is a discrete derivative. Prefers smooth solutions. Tomographic velocity inversion uses this heavily.
Sparsity (L1): $+\epsilon|x|_{1}$ prefers sparse solutions with few non-zero entries. Compressed-sensing reconstruction lives here.

6. From one iteration to FWI

FWI is the outer loop above the widget. Every FWI iteration:

Run a wave-equation simulator forward to get predicted data.
Subtract from observed data to get the data residual.
Back-project the residual via the adjoint-state method to get a gradient in model space.
Take a gradient step: $m_{k+1} = m_k - \alpha, g_k$ .
Check stopping criterion; loop.

One FWI iteration on a modern 3D dataset can take thousands of CPU hours. The efficiency of the outer optimizer (gradient vs. L-BFGS vs. truncated-Newton) directly determines how many iterations you can afford.

**The one sentence to remember**

Pick a cost function, compute its gradient, step downhill, repeat — with a step size small enough to avoid zigzag and a regularizer strong enough to stabilize. That recipe underlies FWI, tomography, least-squares migration, and every modern seismic inversion.

Where this goes next

Part 0 closes with §0.10: the wave equation. The physics behind the simulator inside FWI, the propagation model that migration inverts, and the starting point for modern imaging. Thirty minutes of PDEs, then we are out of prerequisites and into processing proper.

References

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge.
Tarantola, A. (1984). Inversion of seismic reflection data in the acoustic approximation. Geophysics, 49, 1259.
Virieux, J., Operto, S. (2009). An overview of full-waveform inversion in exploration geophysics. Geophysics, 74, WCC1.
Pratt, R. G. (1999). Seismic waveform inversion in the frequency domain, Part 1. Geophysics, 64, 888.