Model selection: AIC, BIC, cross-validation

Part 4 — Linear regression, done seriously

Learning objectives

  • Diagnose overfitting via the gap between training and test error
  • Apply AIC = -2 log L + 2p and BIC = -2 log L + p log n as in-sample penalised likelihood criteria
  • Distinguish AIC's prediction-optimal asymptotics from BIC's true-model-recovery asymptotics
  • Run K-fold and leave-one-out cross-validation honestly
  • Recognise that model selection should match the inferential goal (prediction vs explanation vs scientific testing)

By Part 4 we've seen how to ENRICH a regression model: add covariates, interactions, polynomial terms, splines (§4.6); add robust estimators (§4.5); add sandwich SEs or WLS (§4.4). The remaining question: how to CHOOSE among the many possible specifications? This is model selection — arguably the most consequential step in applied regression, and the one most prone to silent abuse via specification searches.

The overfitting problem

For ANY dataset, you can drive training-set residual sum of squares to zero by adding enough parameters (p = n gives a perfect interpolating fit). But the resulting model predicts new data terribly. Training error monotonically decreases with model complexity; TEST error has a U-shape — too simple = under-fit (high bias); too complex = overfit (high variance). The model-selection job is to land in the middle.

AIC: prediction-optimal in-sample criterion

Akaike Information Criterion:

AIC=2logL(β^)+2p,\mathrm{AIC} = -2 \log L(\hat{\boldsymbol{\beta}}) + 2 p,

where LL is the maximised likelihood and pp is the number of parameters. Lower AIC = better fit. The 2logL-2 \log L term rewards goodness of fit; the +2p+ 2p term penalises complexity. Asymptotic interpretation: AIC selects the model with the lowest expected out-of-sample prediction error (under regularity).

BIC: true-model-recovery criterion

Bayesian Information Criterion:

BIC=2logL(β^)+plogn.\mathrm{BIC} = -2 \log L(\hat{\boldsymbol{\beta}}) + p \log n.

Same likelihood term, but the complexity penalty grows with n (log n > 2 for n > 7). BIC selects the TRUE model with probability → 1 as n → ∞, IF the true model is in the candidate set. Almost always picks a SMALLER model than AIC (heavier penalty).

When to use AIC vs BIC

  • AIC: predictive goals; you want the model that forecasts best on new data. Robust to "true model is just an approximation".
  • BIC: hypothesis-testing / scientific-discovery goals; you want to know "is this covariate truly part of the data-generating process?" Risk: BIC is harsh on important-but-small effects.

Both are IN-SAMPLE criteria — they don't actually evaluate prediction on held-out data.

Cross-validation: out-of-sample evaluation

The honest test of prediction. K-fold CV:

  • Randomly split data into K folds (typically K=5 or 10).
  • For each fold: fit the model on the OTHER K-1 folds; predict the held-out fold; compute prediction error.
  • Average prediction errors across folds.

Leave-one-out (LOO) is K=n. Computationally expensive for large n, but mathematically elegant. For OLS, LOO has a closed form: CVloo=1n(ei1hii)2\mathrm{CV}{loo} = \frac{1}{n} \sum \left( \frac{e_i}{1 - h{ii}} \right)^2.

The hidden hazard: model selection inflates Type-I error

The biggest danger of model selection is using the SAME data to (a) select the model and (b) test inferences from the model. If you tried 100 specifications and reported the best, the "best" p-values are biased toward significance — the same p-hacking risk as §2.4.

Defences:

  • Pre-register the model specification.
  • Split the data: select on a TRAINING set, test on a HOLDOUT.
  • Use cross-validation for the FULL selection procedure (not just the selected model's prediction error).
  • Report ALL specifications considered.

What about R² and adjusted R²?

R² is monotonic in model complexity — not a model-selection criterion. Adjusted R² penalises complexity but inconsistently; AIC and BIC are sharper. Use R² for descriptive reporting; use AIC/BIC/CV for selection.

Model Selection Aic BicInteractive figure — enable JavaScript to interact.

Try it

  • Defaults: N = 60, σ = 0.40, currently displayed degree = 4. The top plot shows the underlying sinusoidal truth (green), noisy data (blue dots), and the degree-4 polynomial fit (red). The bottom plot shows training MSE (blue), held-out test MSE (red), CV-5 (orange dashed), and rescaled AIC + BIC (purple + green dashed). Note each criterion has its OWN argmin at a possibly different degree.
  • Drag the degree slider from 0 to 15. At d = 0 the fit is a flat line (underfit). At d = 15 the fit wiggles wildly through every data point (overfit). The TEST MSE finds the sweet spot, typically around d = 3–6 for this truth.
  • Increase σ from 0.40 to 1.20. Noise dominates. The argmin degree shifts LOWER — with more noise, simpler models are preferred because fitting the noise hurts generalisation more.
  • Reduce N from 60 to 20. The criteria curves become noisier; BIC's preference for low degree becomes more pronounced (the p·log(n) penalty is stronger for small n). AIC and CV are still relatively close.
  • Click Resample. The data changes but the truth stays. With a fresh seed, the argmin degrees may shift by 1–2 across criteria — small-sample model selection is itself noisy.
  • Compare AIC (purple) vs BIC (green): with N = 60 and the true model close to a low-degree polynomial, BIC tends to pick a LOWER degree than AIC. This is BIC's true-model-recovery vs AIC's prediction-optimal asymptotic.

A colleague reports an AIC-best model with 12 covariates from a candidate pool of 30. They did not pre-register and report only the final p-values from this single model. What two things do you ask before believing the p-values?

What you now know

Model selection trades fit quality against complexity. AIC for prediction; BIC for scientific discovery; CV for honest out-of-sample evaluation. The biggest hazard is using the same data twice. §4.8 closes Part 4 with a critical reminder: ALL OF THIS IS REGRESSION, NOT CAUSATION — even the best-selected model with the highest-quality covariates fits CORRELATIONS, not effects.

References

  • Akaike, H. (1974). "A new look at the statistical model identification." IEEE Trans. Automatic Control 19(6), 716–723. (The foundational AIC paper.)
  • Schwarz, G. (1978). "Estimating the dimension of a model." Annals of Statistics 6(2), 461–464. (The BIC paper.)
  • Stone, M. (1974). "Cross-validatory choice and assessment of statistical predictions." J. Roy. Stat. Soc. B 36(2), 111–147. (The foundational CV paper.)
  • Burnham, K.P., Anderson, D.R. (2002). Model Selection and Multimodel Inference, 2nd ed. Springer. (Comprehensive treatment of AIC and multi-model inference.)
  • Hastie, T., Tibshirani, R., Friedman, J. (2009). Elements of Statistical Learning, 2nd ed. Chapter 7 covers model selection and CV.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.