Cross-validation done right
Learning objectives
- Distinguish in-sample (training) error from out-of-sample (cross-validation) error
- Implement k-fold CV: partition data into k folds; train on k-1, test on the held-out fold; average
- Recognise the U-SHAPED CV curve as the empirical bias-variance trade-off across model complexity
- Recognise LOO-CV and leave-p-out CV as special cases
- Avoid common CV PITFALLS: data leakage, time-series violation, feature-selection contamination, hyperparameter optimisation overfit
Cross-validation (CV) is the workhorse for predictive-model evaluation: estimate out-of-sample performance by ITERATIVELY HOLDING OUT data, training on the rest, predicting on the held-out, and averaging. Combined with model selection (pick the model with min CV error), CV is the canonical safeguard against overfitting.
k-fold cross-validation
Partition the data into k roughly-equal folds. For each fold f:
- Fit the model on data EXCLUDING fold f (the training set).
- Predict on fold f (the test set).
- Record the prediction errors on fold f.
The k-fold CV estimate of out-of-sample error is the average of fold-level prediction errors. For regression, typically MSE; for classification, error rate or log-loss. Common choices: k = 5 or k = 10. Larger k → less BIAS (training set is closer to full N) but higher VARIANCE and computational cost; smaller k → more bias, less variance.
Special cases
- Leave-One-Out CV (LOOCV): k = N. Each held-out "fold" is one observation. Has nearly zero bias (training set is almost the full N) but high variance. For OLS, LOOCV has a closed-form shortcut: where is the leverage (no refitting needed).
- Leave-p-Out CV: hold out p observations at a time. Exact only for p > 1; usually too expensive.
- Repeated k-fold: run k-fold CV multiple times with different fold assignments and average. Reduces variance at the cost of compute.
- Stratified k-fold: for classification, ensure each fold has the same class proportions as the full data. Standard in classification.
The U-shaped CV curve
Plot CV error vs model complexity (polynomial degree, regularisation parameter, tree depth, etc.). The curve is typically U-shaped:
- Underfit (left): model too simple, both training and CV errors are high.
- Optimal (middle): training error decreasing, CV error at minimum. This is the model selected by CV.
- Overfit (right): training error still decreasing, but CV error rising. The model memorises noise in the training set.
This curve is the empirical realisation of the BIAS-VARIANCE TRADE-OFF: complex models reduce bias (better fit on average) but increase variance (more sensitive to training-set noise). CV picks the sweet spot.
The 1-SE rule
The strict minimum-CV degree is the absolute predicted-error optimum but may be sensitive to noise. The 1-STANDARD-ERROR rule picks the SIMPLEST model whose CV error is within one standard error of the minimum. Compute SE of fold errors; from the minimum, walk left (toward simpler models) until you find the first model whose CV exceeds the minimum by more than 1 SE. Use the model just before that. Bias toward simpler models = more interpretable + less likely to overfit. Standard in glmnet, scikit-learn's GridSearchCV with refit_param.
CV for time series
Vanilla k-fold randomises the partition, which DESTROYS time ordering. For time-series data:
- Walk-forward / expanding-window CV: train on data up to time t, predict t+1. Move forward. Respects time ordering.
- Block CV: partition into contiguous blocks; use each block as held-out test set with the rest as training. Preserves temporal structure within blocks.
The danger of random k-fold on time series: training on FUTURE data and testing on PAST. This contaminates the test set with information the model "shouldn't have", inflating CV performance estimates. Always use temporal-aware CV for time series.
Data leakage: the silent CV killer
The most common CV bug is DATA LEAKAGE: information from the test fold sneaks into the training process. Examples:
- Feature scaling before splitting: standardise X to (X − mean(X))/SD(X) using the WHOLE data's mean and SD. The test fold's information leaks into the training mean. Fix: compute scaling stats from training data only, apply to test.
- Feature selection on full data: pick top-k correlated features using all the data, then CV the model on these features. The selected features include information from test folds. Fix: select features INSIDE each CV fold using only training data.
- Imputation of missing values using all data: same problem as scaling. Impute INSIDE each fold.
- Hyperparameter tuning on the CV folds, then reporting CV performance: the hyperparameter chosen IS overfit to those folds. Fix: NESTED CV — outer CV for honest performance, inner CV for hyperparameter selection.
Modern best practice: build a single PIPELINE (e.g., sklearn Pipeline, R caret) where preprocessing happens inside each CV fold. Never compute anything from the test data outside this pipeline.
Nested cross-validation
For honest performance estimation under hyperparameter optimisation:
- OUTER LOOP: k-fold split. For each outer fold:
- INNER LOOP: nested k-fold on the training set ONLY. Use it to pick hyperparameters.
- Re-fit the best hyperparameter setting on the full outer training set.
- Predict on the outer test fold.
The outer-fold predictions give an honest estimate of generalisation error AFTER hyperparameter optimisation. Skipping this step (i.e., reporting inner-CV error as the final performance) systematically over-optimistic estimates.
Try it
- Defaults: N = 50, σ = 0.5, k = 5. Look at the right panel: training MSE decreases as polynomial degree grows; CV MSE is U-shaped with a clear minimum near degree 4-7. The best-fit polynomial on the left tracks the true sine + linear shape.
- Crank σ up to 1.5. More noise. The U-shape becomes more pronounced; the optimal degree may move lower (simpler model is preferred when noise drowns out higher-order structure).
- Reduce N to 20. Smaller training sets cause more overfitting risk; CV error at high degrees explodes. With only 20 data points, the 15-degree polynomial has nothing to constrain it on out-of-sample.
- Increase k from 5 to 10. The CV curve becomes smoother (less variance from fold-to-fold differences) but takes slightly longer to compute.
- Re-sample several times. The best degree fluctuates with seed at small N — this is variance in the CV estimate. The 1-SE rule would average across these and choose a more stable, simpler model.
An analyst standardises features (z-scores) using the full dataset, then runs 10-fold CV on a downstream classifier. They report CV accuracy of 92%. The model performs at 80% on a truly held-out test set. What likely went wrong?
What you now know
k-fold CV estimates out-of-sample predictive error by iterative hold-outs. The U-shaped CV curve picks the bias-variance optimum. Special cases: LOOCV, stratified, repeated, time-series-aware. Data leakage is the #1 silent CV killer — keep preprocessing inside the fold. For hyperparameter tuning, use NESTED CV. §8.4 next: rank-based methods, which complement CV by remaining valid under arbitrary distributions.
References
- Stone, M. (1974). "Cross-validatory choice and assessment of statistical predictions." JRSS-B 36(2), 111–147. (Foundational paper.)
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Section 7.10.
- Arlot, S., Celisse, A. (2010). "A survey of cross-validation procedures for model selection." Statistics Surveys 4, 40–79.
- Bergmeir, C., Benítez, J.M. (2012). "On the use of cross-validation for time series predictor evaluation." Information Sciences 191, 192–213.
- Cawley, G.C., Talbot, N.L.C. (2010). "On over-fitting in model selection and subsequent selection bias in performance evaluation." JMLR 11, 2079–2107. (Nested CV.)