Overfitting, Underfitting, Variance, and Bias

Chapter 10: Generalization, Bias, and Variance

Learning objectives

Explain the bias-variance tradeoff mathematically and intuitively
Identify underfitting (high bias) and overfitting (high variance) from training/test performance
Describe how training and validation error curves change with model complexity
Apply L1 (Lasso), L2 (Ridge), and Elastic Net regularization
Understand k-fold cross-validation and its variants
Explain early stopping and dropout as regularization techniques
Recognize overfitting risks specific to geoscience (small datasets, spatial correlation)

The Fundamental Problem

Every machine learning model faces a tension: we want the model to be complex enough to capture the true patterns in the data, but not so complex that it memorizes noise. This tension is formalized by the bias-variance tradeoff.

The Bias-Variance Decomposition

For a regression problem with true function $f(x)$ and noise $\epsilon$ with $E[\epsilon] = 0$ and $\text{Var}(\epsilon) = \sigma^2$ , the expected prediction error at a point $x$ can be decomposed as:

Bias-Variance Decomposition

E[(y - \hat{y})^2] = \underbrace{(E[\hat{y}] - f(x))^2}_{\text{Bias}^2} + \underbrace{E[(\hat{y} - E[\hat{y}])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Noise}}

Bias: The error from incorrect assumptions in the model. High bias means the model is too simple to capture the true relationship. Measures how far the average prediction is from the truth.
Variance: The error from sensitivity to fluctuations in the training data. High variance means the model changes dramatically with different training samples. Measures how spread out predictions are across different training sets.
Irreducible noise: The inherent randomness in the data that no model can eliminate (measurement error, natural variability).

The total error is the sum of all three components. Since we cannot reduce the irreducible noise, we must balance bias and variance:

Decreasing model complexity → increases bias but decreases variance
Increasing model complexity → decreases bias but increases variance

The optimal model minimizes the total error, which is achieved at a moderate complexity.

Underfitting (High Bias)

A model underfits when it is too simple to capture the underlying pattern. Signs of underfitting:

Poor performance on training data (high training error)
Poor performance on test/validation data (high test error)
Training error ≈ test error (both are high)

Example: Fitting a straight line (degree-1 polynomial) to data that follows a parabolic curve. The line cannot capture the curvature, regardless of how much data is available.

Fixes: Use a more complex model, add more features, reduce regularization, train longer (for neural networks).

Overfitting (High Variance)

A model overfits when it memorizes the training data, including the noise. Signs of overfitting:

Excellent performance on training data (very low training error)
Poor performance on test/validation data (high test error)
Large gap between training error and test error

Example: Fitting a degree-20 polynomial to 25 data points. The polynomial passes through every point perfectly but oscillates wildly between points, making terrible predictions on new data.

Fixes: Get more training data, reduce model complexity, add regularization, use cross-validation, feature selection, dropout (neural networks), early stopping.

Training vs. Validation Curves

The relationship between training error and validation error as model complexity increases reveals the bias-variance tradeoff:

How to Read the Curves

Low complexity (left side): Both training and validation errors are high. The model underfits — it cannot even fit the training data well.

Optimal complexity (middle): Training error is low and validation error is at its minimum. The model captures the true pattern without memorizing noise.

High complexity (right side): Training error approaches zero, but validation error increases. The model overfits — it memorizes training noise that does not generalize.

The sweet spot is where validation error is minimized. The gap between training and validation error indicates the degree of overfitting.

Regularization

Regularization penalizes model complexity by adding a penalty term to the loss function. This discourages the model from fitting noise.

L1 Regularization (Lasso)

J_{\text{Lasso}} = J_{\text{original}} + \lambda \sum_{i=1}^{p} |w_i|

The L1 penalty is the sum of absolute values of the weights. It encourages sparsity: many weights become exactly zero, effectively performing feature selection. Use when you suspect many features are irrelevant.

L2 Regularization (Ridge)

J_{\text{Ridge}} = J_{\text{original}} + \lambda \sum_{i=1}^{p} w_i^2

The L2 penalty is the sum of squared weights. It shrinks all weights toward zero but rarely makes them exactly zero. It handles correlated features well by distributing weight among them. Use when all features may be relevant.

Elastic Net

J_{\text{ElasticNet}} = J + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2

Combines L1 and L2 penalties. Gets the best of both worlds: sparsity from L1 and stability from L2. Controlled by a mixing parameter $\alpha$ : $\alpha = 1$ is pure Lasso, $\alpha = 0$ is pure Ridge.

The hyperparameter $\lambda$ (also written as $C = 1/\lambda$ in some libraries) controls the strength of regularization:

$\lambda = 0$ : no regularization (original model)
Small $\lambda$ : weak regularization, model can be complex
Large $\lambda$ : strong regularization, model is forced to be simple
$\lambda \to \infty$ : all weights shrink to zero

Cross-Validation

Instead of a single train/test split, cross-validation provides a more robust estimate of model performance by using multiple splits.

k-Fold Cross-Validation

Split the data into $k$ equally-sized folds (commonly $k = 5$ or $k = 10$ ).
For each fold $i = 1, \ldots, k$ : train on all folds except fold $i$ , validate on fold $i$ .
Average the $k$ validation scores.

Result: every sample is used for validation exactly once. The average score is a more reliable estimate of generalization than a single split.

Stratified k-Fold

Ensures that each fold has approximately the same proportion of each class as the full dataset. Essential for imbalanced classification (e.g., rare mineral deposits comprise only 5% of samples).

Leave-One-Out Cross-Validation (LOOCV)

$k = n$ (one fold per sample). Trains $n$ models, each leaving out one sample. Gives an almost unbiased estimate but is computationally expensive for large datasets. Useful when $n$ is very small (common in geoscience: only 20 core samples available).

Early Stopping

For iterative algorithms (gradient descent, neural networks, boosting), early stopping monitors the validation error during training and stops when it starts increasing, even if training error is still decreasing. This prevents the model from entering the overfitting regime.

The number of training iterations serves as an implicit regularization parameter: fewer iterations = simpler model.

Dropout (Neural Network Preview)

Dropout randomly sets a fraction of neuron outputs to zero during each training step (e.g., dropout rate = 0.5 means each neuron has a 50% chance of being dropped). This prevents neurons from co-adapting and creates an implicit ensemble of sub-networks. At prediction time, all neurons are active but outputs are scaled accordingly.

Geoscience-Specific Considerations

Small Datasets

Geoscience datasets are often small (tens to hundreds of samples from expensive well logs or field campaigns). This makes overfitting a severe risk. Strategies:

Use simpler models (fewer parameters relative to number of samples)
Use strong regularization
Use k-fold or leave-one-out CV instead of a single train/test split
Consider data augmentation where physically meaningful

Spatial Autocorrelation

Geoscience data is spatially correlated: samples from nearby wells or adjacent grid cells are more similar than random pairs. Standard random train/test splitting can cause data leakage through spatial proximity — a model that appears to generalize well may simply be interpolating between nearby training samples.

Solution: Use spatial cross-validation — split data by spatial blocks (e.g., leave one well out, or split by geographic region) so that training and validation data are spatially separated.

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 7 (model assessment, bias-variance) & ch. 3 (regularization, ridge, lasso). Springer.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 5 (resampling, cross-validation) & ch. 6 (regularization). Springer.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 1.5 & 3.1 (bias-variance, regularization). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 4.5 (regularization). MIT Press.