Overfitting, Underfitting, Variance, and Bias
Learning objectives
- Explain the bias-variance tradeoff mathematically and intuitively
- Identify underfitting (high bias) and overfitting (high variance) from training/test performance
- Describe how training and validation error curves change with model complexity
- Apply L1 (Lasso), L2 (Ridge), and Elastic Net regularization
- Understand k-fold cross-validation and its variants
- Explain early stopping and dropout as regularization techniques
- Recognize overfitting risks specific to geoscience (small datasets, spatial correlation)
The Fundamental Problem
Every machine learning model faces a tension: we want the model to be complex enough to capture the true patterns in the data, but not so complex that it memorizes noise. This tension is formalized by the bias-variance tradeoff.
The Bias-Variance Decomposition
For a regression problem with true function and noise with and , the expected prediction error at a point can be decomposed as:
Bias-Variance Decomposition
- Bias: The error from incorrect assumptions in the model. High bias means the model is too simple to capture the true relationship. Measures how far the average prediction is from the truth.
- Variance: The error from sensitivity to fluctuations in the training data. High variance means the model changes dramatically with different training samples. Measures how spread out predictions are across different training sets.
- Irreducible noise: The inherent randomness in the data that no model can eliminate (measurement error, natural variability).
The total error is the sum of all three components. Since we cannot reduce the irreducible noise, we must balance bias and variance:
- Decreasing model complexity → increases bias but decreases variance
- Increasing model complexity → decreases bias but increases variance
The optimal model minimizes the total error, which is achieved at a moderate complexity.
Underfitting (High Bias)
A model underfits when it is too simple to capture the underlying pattern. Signs of underfitting:
- Poor performance on training data (high training error)
- Poor performance on test/validation data (high test error)
- Training error ≈ test error (both are high)
Example: Fitting a straight line (degree-1 polynomial) to data that follows a parabolic curve. The line cannot capture the curvature, regardless of how much data is available.
Fixes: Use a more complex model, add more features, reduce regularization, train longer (for neural networks).
Overfitting (High Variance)
A model overfits when it memorizes the training data, including the noise. Signs of overfitting:
- Excellent performance on training data (very low training error)
- Poor performance on test/validation data (high test error)
- Large gap between training error and test error
Example: Fitting a degree-20 polynomial to 25 data points. The polynomial passes through every point perfectly but oscillates wildly between points, making terrible predictions on new data.
Fixes: Get more training data, reduce model complexity, add regularization, use cross-validation, feature selection, dropout (neural networks), early stopping.
Training vs. Validation Curves
The relationship between training error and validation error as model complexity increases reveals the bias-variance tradeoff:
How to Read the Curves
Low complexity (left side): Both training and validation errors are high. The model underfits — it cannot even fit the training data well.
Optimal complexity (middle): Training error is low and validation error is at its minimum. The model captures the true pattern without memorizing noise.
High complexity (right side): Training error approaches zero, but validation error increases. The model overfits — it memorizes training noise that does not generalize.
The sweet spot is where validation error is minimized. The gap between training and validation error indicates the degree of overfitting.
Regularization
Regularization penalizes model complexity by adding a penalty term to the loss function. This discourages the model from fitting noise.
L1 Regularization (Lasso)
The L1 penalty is the sum of absolute values of the weights. It encourages sparsity: many weights become exactly zero, effectively performing feature selection. Use when you suspect many features are irrelevant.
L2 Regularization (Ridge)
The L2 penalty is the sum of squared weights. It shrinks all weights toward zero but rarely makes them exactly zero. It handles correlated features well by distributing weight among them. Use when all features may be relevant.
Elastic Net
Combines L1 and L2 penalties. Gets the best of both worlds: sparsity from L1 and stability from L2. Controlled by a mixing parameter : is pure Lasso, is pure Ridge.
The hyperparameter (also written as in some libraries) controls the strength of regularization:
- : no regularization (original model)
- Small : weak regularization, model can be complex
- Large : strong regularization, model is forced to be simple
- : all weights shrink to zero
Cross-Validation
Instead of a single train/test split, cross-validation provides a more robust estimate of model performance by using multiple splits.
k-Fold Cross-Validation
- Split the data into equally-sized folds (commonly or ).
- For each fold : train on all folds except fold , validate on fold .
- Average the validation scores.
Result: every sample is used for validation exactly once. The average score is a more reliable estimate of generalization than a single split.
Stratified k-Fold
Ensures that each fold has approximately the same proportion of each class as the full dataset. Essential for imbalanced classification (e.g., rare mineral deposits comprise only 5% of samples).
Leave-One-Out Cross-Validation (LOOCV)
(one fold per sample). Trains models, each leaving out one sample. Gives an almost unbiased estimate but is computationally expensive for large datasets. Useful when is very small (common in geoscience: only 20 core samples available).
Early Stopping
For iterative algorithms (gradient descent, neural networks, boosting), early stopping monitors the validation error during training and stops when it starts increasing, even if training error is still decreasing. This prevents the model from entering the overfitting regime.
The number of training iterations serves as an implicit regularization parameter: fewer iterations = simpler model.
Dropout (Neural Network Preview)
Dropout randomly sets a fraction of neuron outputs to zero during each training step (e.g., dropout rate = 0.5 means each neuron has a 50% chance of being dropped). This prevents neurons from co-adapting and creates an implicit ensemble of sub-networks. At prediction time, all neurons are active but outputs are scaled accordingly.
Geoscience-Specific Considerations
Small Datasets
Geoscience datasets are often small (tens to hundreds of samples from expensive well logs or field campaigns). This makes overfitting a severe risk. Strategies:
- Use simpler models (fewer parameters relative to number of samples)
- Use strong regularization
- Use k-fold or leave-one-out CV instead of a single train/test split
- Consider data augmentation where physically meaningful
Spatial Autocorrelation
Geoscience data is spatially correlated: samples from nearby wells or adjacent grid cells are more similar than random pairs. Standard random train/test splitting can cause data leakage through spatial proximity — a model that appears to generalize well may simply be interpolating between nearby training samples.
Solution: Use spatial cross-validation — split data by spatial blocks (e.g., leave one well out, or split by geographic region) so that training and validation data are spatially separated.
References
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 7 (model assessment, bias-variance) & ch. 3 (regularization, ridge, lasso). Springer.
- James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 5 (resampling, cross-validation) & ch. 6 (regularization). Springer.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 1.5 & 3.1 (bias-variance, regularization). Springer.
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 4.5 (regularization). MIT Press.