Cross-validation done right

Part 8 — Resampling and nonparametrics

Learning objectives

  • Distinguish in-sample (training) error from out-of-sample (cross-validation) error
  • Implement k-fold CV: partition data into k folds; train on k-1, test on the held-out fold; average
  • Recognise the U-SHAPED CV curve as the empirical bias-variance trade-off across model complexity
  • Recognise LOO-CV and leave-p-out CV as special cases
  • Avoid common CV PITFALLS: data leakage, time-series violation, feature-selection contamination, hyperparameter optimisation overfit

Cross-validation (CV) is the workhorse for predictive-model evaluation: estimate out-of-sample performance by ITERATIVELY HOLDING OUT data, training on the rest, predicting on the held-out, and averaging. Combined with model selection (pick the model with min CV error), CV is the canonical safeguard against overfitting.

k-fold cross-validation

Partition the data into k roughly-equal folds. For each fold f:

  • Fit the model on data EXCLUDING fold f (the training set).
  • Predict on fold f (the test set).
  • Record the prediction errors on fold f.

The k-fold CV estimate of out-of-sample error is the average of fold-level prediction errors. For regression, typically MSE; for classification, error rate or log-loss. Common choices: k = 5 or k = 10. Larger k → less BIAS (training set is closer to full N) but higher VARIANCE and computational cost; smaller k → more bias, less variance.

Special cases

  • Leave-One-Out CV (LOOCV): k = N. Each held-out "fold" is one observation. Has nearly zero bias (training set is almost the full N) but high variance. For OLS, LOOCV has a closed-form shortcut: LOOCV=1Ni(yiy^i)2/(1hi)2\text{LOOCV} = \frac{1}{N} \sum_i (y_i - \hat{y}_i)^2 / (1 - h_i)^2 where hih_i is the leverage (no refitting needed).
  • Leave-p-Out CV: hold out p observations at a time. Exact only for p > 1; usually too expensive.
  • Repeated k-fold: run k-fold CV multiple times with different fold assignments and average. Reduces variance at the cost of compute.
  • Stratified k-fold: for classification, ensure each fold has the same class proportions as the full data. Standard in classification.

The U-shaped CV curve

Plot CV error vs model complexity (polynomial degree, regularisation parameter, tree depth, etc.). The curve is typically U-shaped:

  • Underfit (left): model too simple, both training and CV errors are high.
  • Optimal (middle): training error decreasing, CV error at minimum. This is the model selected by CV.
  • Overfit (right): training error still decreasing, but CV error rising. The model memorises noise in the training set.

This curve is the empirical realisation of the BIAS-VARIANCE TRADE-OFF: complex models reduce bias (better fit on average) but increase variance (more sensitive to training-set noise). CV picks the sweet spot.

The 1-SE rule

The strict minimum-CV degree is the absolute predicted-error optimum but may be sensitive to noise. The 1-STANDARD-ERROR rule picks the SIMPLEST model whose CV error is within one standard error of the minimum. Compute SE of fold errors; from the minimum, walk left (toward simpler models) until you find the first model whose CV exceeds the minimum by more than 1 SE. Use the model just before that. Bias toward simpler models = more interpretable + less likely to overfit. Standard in glmnet, scikit-learn's GridSearchCV with refit_param.

CV for time series

Vanilla k-fold randomises the partition, which DESTROYS time ordering. For time-series data:

  • Walk-forward / expanding-window CV: train on data up to time t, predict t+1. Move forward. Respects time ordering.
  • Block CV: partition into contiguous blocks; use each block as held-out test set with the rest as training. Preserves temporal structure within blocks.

The danger of random k-fold on time series: training on FUTURE data and testing on PAST. This contaminates the test set with information the model "shouldn't have", inflating CV performance estimates. Always use temporal-aware CV for time series.

Data leakage: the silent CV killer

The most common CV bug is DATA LEAKAGE: information from the test fold sneaks into the training process. Examples:

  • Feature scaling before splitting: standardise X to (X − mean(X))/SD(X) using the WHOLE data's mean and SD. The test fold's information leaks into the training mean. Fix: compute scaling stats from training data only, apply to test.
  • Feature selection on full data: pick top-k correlated features using all the data, then CV the model on these features. The selected features include information from test folds. Fix: select features INSIDE each CV fold using only training data.
  • Imputation of missing values using all data: same problem as scaling. Impute INSIDE each fold.
  • Hyperparameter tuning on the CV folds, then reporting CV performance: the hyperparameter chosen IS overfit to those folds. Fix: NESTED CV — outer CV for honest performance, inner CV for hyperparameter selection.

Modern best practice: build a single PIPELINE (e.g., sklearn Pipeline, R caret) where preprocessing happens inside each CV fold. Never compute anything from the test data outside this pipeline.

Nested cross-validation

For honest performance estimation under hyperparameter optimisation:

  • OUTER LOOP: k-fold split. For each outer fold:
  • INNER LOOP: nested k-fold on the training set ONLY. Use it to pick hyperparameters.
  • Re-fit the best hyperparameter setting on the full outer training set.
  • Predict on the outer test fold.

The outer-fold predictions give an honest estimate of generalisation error AFTER hyperparameter optimisation. Skipping this step (i.e., reporting inner-CV error as the final performance) systematically over-optimistic estimates.

Cross Validation ExplorerInteractive figure — enable JavaScript to interact.

Try it

  • Defaults: N = 50, σ = 0.5, k = 5. Look at the right panel: training MSE decreases as polynomial degree grows; CV MSE is U-shaped with a clear minimum near degree 4-7. The best-fit polynomial on the left tracks the true sine + linear shape.
  • Crank σ up to 1.5. More noise. The U-shape becomes more pronounced; the optimal degree may move lower (simpler model is preferred when noise drowns out higher-order structure).
  • Reduce N to 20. Smaller training sets cause more overfitting risk; CV error at high degrees explodes. With only 20 data points, the 15-degree polynomial has nothing to constrain it on out-of-sample.
  • Increase k from 5 to 10. The CV curve becomes smoother (less variance from fold-to-fold differences) but takes slightly longer to compute.
  • Re-sample several times. The best degree fluctuates with seed at small N — this is variance in the CV estimate. The 1-SE rule would average across these and choose a more stable, simpler model.

An analyst standardises features (z-scores) using the full dataset, then runs 10-fold CV on a downstream classifier. They report CV accuracy of 92%. The model performs at 80% on a truly held-out test set. What likely went wrong?

What you now know

k-fold CV estimates out-of-sample predictive error by iterative hold-outs. The U-shaped CV curve picks the bias-variance optimum. Special cases: LOOCV, stratified, repeated, time-series-aware. Data leakage is the #1 silent CV killer — keep preprocessing inside the fold. For hyperparameter tuning, use NESTED CV. §8.4 next: rank-based methods, which complement CV by remaining valid under arbitrary distributions.

References

  • Stone, M. (1974). "Cross-validatory choice and assessment of statistical predictions." JRSS-B 36(2), 111–147. (Foundational paper.)
  • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Section 7.10.
  • Arlot, S., Celisse, A. (2010). "A survey of cross-validation procedures for model selection." Statistics Surveys 4, 40–79.
  • Bergmeir, C., Benítez, J.M. (2012). "On the use of cross-validation for time series predictor evaluation." Information Sciences 191, 192–213.
  • Cawley, G.C., Talbot, N.L.C. (2010). "On over-fitting in model selection and subsequent selection bias in performance evaluation." JMLR 11, 2079–2107. (Nested CV.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.