When NOT to use ML

Part 9, Machine learning for researchers

Learning objectives

Recognise hypothesis testing and inference as classical-stats territory, not ML
Identify small-N or n < p settings where ML overfits and parametric methods win
Recognise distribution-shift and OOD-failure risk as ML's deployment Achilles heel
Recognise high-stakes settings requiring interpretability + fairness audit + calibration
Recognise regulated settings where interpretable models or SHAP explanations are required

Machine learning has become so dominant that the question "what tool should I use?" often becomes "which ML method?". Part 9 has shown ML's power. §9.8 closes by emphasising that ML is one tool among many; the right tool depends on the problem. Sometimes classical stats wins. Sometimes a structured approach beats a black-box. Knowing WHEN NOT to use ML is as important as knowing how to use it.

Cases where ML is NOT the right tool

1. Hypothesis testing about specific parameters

If your goal is "is treatment X effective compared to placebo?", classical stats (RCT + t-test or Bayesian) is the right tool. ML can fit a flexible model, but it's designed for prediction; getting a single defensible point estimate with CI for the treatment parameter is what classical stats was designed for. Use ML ONLY for prediction tasks downstream of the inferential question.

2. Small sample sizes (N < 100 or n < p)

With small N, flexible ML overfits dramatically. Parametric models (Bayesian linear regression with informative priors, simple logistic, lasso with strong regularisation) often beat random forests or gradient boosting in this regime. Cross-validation has high variance with small N; nested CV is barely feasible. Use ML when you have N >> p and ideally N > 1000.

3. Out-of-distribution deployment

ML models assume training and deployment distributions match. They generalise within distribution but often fail catastrophically out-of-distribution. If deployment will involve significantly different conditions (new geography, new time period, demographic shifts), ML may fail unexpectedly. Solutions: domain adaptation, ensemble methods, careful monitoring, or staying with simpler models that are more robust.

4. High-stakes decisions requiring trust

Medical, legal, financial decisions: a black-box ML model that makes an unexplainable wrong prediction is a liability. For such applications, use INTERPRETABLE models (logistic, decision trees, linear) by default; deploy black-box ML only with SHAP-based per-instance explanations + calibration + fairness audit + Model Card + human-in-the-loop oversight + monitoring.

5. Regulated settings

Many domains require explanability or interpretability by law: credit scoring (Equal Credit Opportunity Act), healthcare (FDA / EMA rules), HR/hiring (Title VII, EU AI Act), criminal justice. In these settings, interpretable models or SHAP-based explanations are MANDATORY; deep black-box ML is unacceptable. Modern best practice: build with interpretability in mind; deploy with full explainability infrastructure.

6. When the right answer is "more / better data"

ML can't fix data quality problems. If your dataset has selection bias, mismeasurement, missing key variables, ML may perpetuate or amplify these flaws. The right answer is: collect better data, label correctly, address selection bias at the source. ML applied to bad data produces bad predictions reliably.

7. When simpler is sufficient

If a simple OLS gets you 90% of the accuracy of XGBoost, often OLS is the right answer: faster, interpretable, easier to debug, easier to maintain in production. ML's marginal accuracy gain is sometimes not worth the operational complexity.

Modern best practice: tool selection

State the GOAL: prediction, explanation, hypothesis test, decision support.
Match the goal to the right tool family: classical stats for inference/hypothesis tests; ML for prediction; causal inference for effects.
Consider deployment constraints: interpretability, fairness, reliability, distribution-shift exposure.
Pick the SIMPLEST tool that meets the requirements. Add complexity only when justified.
Document trade-offs in the Model Card.

Try it

Walk through the 5 questions with a real problem you've worked on. Each "Yes" to caveats flags a consideration.
Compare to a recent ML deployment in your field: would the decision tree have predicted the failures or trade-offs you saw?
Note that NONE of the questions ask "is ML cool enough for this problem?". The decision is always driven by goals and constraints, not popularity.

A regulatory body proposes that high-stakes ML deployment in healthcare must include interpretable predictions, fairness audits, and calibration. Argue why this is essential, citing failure modes you've learned in §§9.1-9.7.

What you now know

ML is not the right tool for: hypothesis testing, small N, distribution-shifted deployment, high-stakes unexplained decisions, regulated settings, or when data quality is the bottleneck. The right tool matches the goal. Modern responsible practice: state goals, evaluate options, pick simplest tool meeting requirements. PART 9 COMPLETE. The book continues with capstones (§10) and master workflow reference (§12).

References

Rudin, C. (2019). "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1(5), 206-215.
Heaven, D. (2020). "Why deep-learning AIs are so easy to fool." Nature 574, 163-166.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A. (2021). "A survey on bias and fairness in machine learning." ACM Computing Surveys 54(6), Article 115.
D'Amour, A., et al. (2022). "Underspecification presents challenges for credibility in modern machine learning." JMLR 23(226), 1-61.
Ribeiro, M.T., Singh, S., Guestrin, C. (2016). "Why should I trust you? Explaining the predictions of any classifier." KDD. (LIME.)