The ML mindset: prediction vs explanation

Part 9 — Machine learning for researchers

Learning objectives

Articulate the PREDICTION vs EXPLANATION trade-off as the core ML choice
Recognise Breiman's (2001) "two cultures" framing of statistics vs ML
Identify when prediction is the goal (clinical decision support, recommender systems) vs explanation (causal inference, scientific reporting)
Recognise SHRINKAGE, REGULARISATION, and FLEXIBILITY as ML's defining moves
Identify modern hybrid approaches (double ML, causal forests) that pursue both prediction and causal inference

Classical statistics emphasises EXPLANATION: produce parameter estimates with interpretable meaning, test specific hypotheses, build confidence in causal mechanisms. Machine learning emphasises PREDICTION: produce a model that performs well on new data, regardless of whether its internal structure is interpretable. These are different ENDS, and they justify different MEANS. §9.1 develops this distinction and shows when each is appropriate.

The two cultures (Breiman 2001)

Leo Breiman's "two cultures" paper named the cultural divide explicitly:

Data Modeling Culture (classical stats): assume a probability model (e.g., linear regression, GLM, Cox survival). Fit. Interpret coefficients. Build inferential machinery (tests, CIs, p-values) on the model. Goal: scientific understanding of the data-generating process.
Algorithmic Modeling Culture (machine learning): treat the data-generating process as a black box. Fit any algorithm (random forest, neural network, gradient boosting) that minimises out-of-sample prediction error. Don't worry about parameters; worry about predictions. Goal: useful predictions on new data.

Breiman argued (controversially) that algorithmic modeling often beats data modeling on predictive tasks, and that stats was missing out by neglecting it. Two decades later, the synthesis is mainstream: use ML for prediction tasks, classical stats for inference, and modern hybrid approaches (DOUBLE ML, CAUSAL FORESTS) when you need both.

The ML toolkit's defining moves

FLEXIBILITY: use highly expressive models (deep neural networks, gradient-boosted trees, kernel methods) that can capture nearly any functional form.
REGULARISATION (§9.2): constrain the model's complexity (L1, L2, dropout) to prevent overfitting.
CROSS-VALIDATION (§8.3): use held-out data to select model complexity and hyperparameters.
ENSEMBLES (§9.3): combine many models (bagging, boosting, stacking) to reduce variance and improve predictive performance.
FEATURE ENGINEERING / REPRESENTATION LEARNING: create informative features automatically (deep learning) or via domain knowledge (classical ML).

When prediction is the goal

Clinical decision support: predict patient risk of stroke from EHR data; the decision is "treat or don't" and accuracy matters more than interpretability.
Recommender systems: predict which movie a user will like; no one cares why.
Risk scoring: credit scoring, fraud detection.
Operations: demand forecasting, supply-chain optimisation.

In these contexts, "explain WHY a particular prediction was made" matters less than "WILL the prediction be right next time?". ML's flexibility-and-cross-validation toolkit is well-suited.

When explanation is the goal

Scientific reporting: "what is the effect of treatment X on outcome Y?" requires a single, interpretable estimate with a CI.
Policy evaluation: regulatory decisions require traceable, defensible reasoning.
Causal inference: §6 of this book; the interpretation IS the deliverable.
Trustworthy decision-making in high-stakes settings: medical, legal, fiscal.

Classical statistics (linear models, GLMs, causal inference toolkit, Bayesian methods) provides the interpretive infrastructure these applications need. Pure ML methods (black-box predictors) are not enough.

Modern hybrids

Double ML / Debiased ML (Chernozhukov et al. 2017): use ML to estimate the nuisance functions (e.g., propensity scores, conditional means of Y on X) and then plug into a moment condition that yields valid CIs for the parameter of interest. Combines ML flexibility with classical inference.
Causal forests (Athey-Wager 2019, §9.6): random-forest variant that estimates conditional treatment effects without bias from heterogeneous treatment-effect modeling.
Targeted Maximum Likelihood Estimation (TMLE) (van der Laan): semi-parametric efficient estimation using ML for nuisances.
Bayesian nonparametrics: Gaussian processes, BART (Bayesian additive regression trees) for flexible-yet-interpretable models.

How ML can mislead

Over-confidence from spurious patterns: ML models will find ANYTHING in noisy data. CV catches some but not all.
Dataset shift / distribution drift: ML models trained on past data fail when the distribution of new data differs (e.g., COVID-19 disrupting hospital-data-trained risk models).
Biased training data → biased predictions: if training data reflects discriminatory historical decisions, the model will perpetuate them (see §9.5).
Confounding: predictive accuracy says NOTHING about causation. A model that predicts well via a confounder will fail under intervention.
Out-of-distribution failures: ML models often perform extraordinarily well on samples from the training distribution and arbitrarily badly outside it.

Try it

Defaults: nonlinearity = 1.0, kNN k = 7. The OLS line (red) captures the overall trend but misses the sine wave and quadratic shape; kNN (green) tracks the data more closely. kNN MSE is lower — better predictively. But OLS gives an interpretable slope β₁ ≈ 0.5; kNN gives no single coefficient.
Drag nonlinearity to 0.0 (purely linear truth). OLS now fits beautifully; kNN is comparable. With LINEAR data, both achieve similar MSE. The classical-stats advantage is clearer: interpretable model with full inferential machinery.
Drag nonlinearity to 2.5 (highly nonlinear truth). OLS line is bad — the linear model can't capture the curves. kNN tracks them. The ML advantage is clear.
With high nonlinearity, change k to 1 (very flexible). kNN can wiggle through every point — overfit. Change k to 30 (almost linear, very smoothed). kNN becomes nearly OLS-equivalent. The k slider IS a complexity slider — the same role as polynomial degree in §8.3.
Re-sample several times under high nonlinearity. kNN MSE varies (kNN is high-variance); OLS MSE is stable but high (high bias). The CLASSICAL bias-variance trade-off, made concrete.

A researcher uses gradient boosting to predict patient risk of stroke from 200 EHR features, achieving AUC = 0.85. They then claim "feature X has a strong causal effect on stroke risk" based on the model's permutation importance. What's wrong with this reasoning?

What you now know

Statistical inference and machine learning are two cultures with different goals. Use ML for prediction; use classical stats for inference. Modern hybrid approaches (double ML, causal forests, TMLE) bridge both. The chief ML failure modes: spurious patterns, dataset shift, biased training data, confounding, OOD failures. Subsequent sections develop ML's tools: regularisation (§9.2), trees and forests (§9.3), calibration (§9.4), fairness (§9.5), causal ML (§9.6), reporting (§9.7), and when NOT to use ML (§9.8).

References

Breiman, L. (2001). "Statistical modeling: The two cultures." Statistical Science 16(3), 199–231. (The seminal essay.)
Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. (The standard graduate ML+stats text.)
James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer. (Undergraduate-level companion.)
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J. (2017). "Double/debiased machine learning for treatment and structural parameters." The Econometrics Journal 21(1), C1–C68.
Athey, S., Imbens, G.W. (2019). "Machine learning methods that economists should know about." Annual Review of Economics 11, 685–725.