Calibration and probability outputs

Part 9 — Machine learning for researchers

Learning objectives

Distinguish CALIBRATION (predicted probabilities match empirical frequencies) from DISCRIMINATION (AUC, accuracy)
Compute the BRIER SCORE and EXPECTED CALIBRATION ERROR (ECE)
Recognise OVERCONFIDENT vs UNDERCONFIDENT predictions on a reliability diagram
Apply PLATT SCALING and ISOTONIC REGRESSION as post-hoc recalibration techniques
Recognise that high AUC does not imply good calibration

Many ML applications need PROBABILITIES, not just rankings. Medical decisions: "patient has 70% probability of stroke" only makes sense if that 70% is actually the empirical rate of stroke in similar patients. CALIBRATION measures whether predicted probabilities are trustworthy as probabilities — a fundamentally different question from whether predictions are accurate.

Calibration vs discrimination

DISCRIMINATION (measured by AUC, accuracy): how well does the model RANK positives above negatives? Doesn't require probabilities to be on a meaningful scale.
CALIBRATION: do predicted probabilities match the empirical frequencies? A predicted 0.7 should correspond to a 70% positive rate in the bin of all observations with predicted ≈ 0.7.

A model can have HIGH AUC but POOR CALIBRATION: it ranks correctly but its 0.9 outputs only correspond to a 60% true positive rate. For decisions based on probability thresholds (treat or don't treat), this is a disaster.

The reliability diagram

The canonical calibration plot: bin predictions by predicted probability (e.g., 10 deciles), compute the empirical frequency of positives within each bin, plot bin's mean predicted probability vs bin's empirical frequency. PERFECT calibration: points lie on the diagonal y = x. Overconfident models lie BELOW the diagonal at high p (predicted high; actual lower); UNDERCONFIDENT models lie ABOVE the diagonal at high p.

Brier score

The Brier score combines calibration and discrimination:

\text{Brier} = \frac{1}{N} \sum_{i=1}^N (p_i - y_i)^2.

Lower is better. Can be decomposed (Murphy 1973): Brier = Calibration error + Refinement loss. A perfectly calibrated random classifier achieves Brier = base_rate × (1 − base_rate); a perfect classifier achieves 0. The Brier score is a proper scoring rule — minimised in expectation by the true probabilities. R's and Python's ML packages compute it routinely.

Expected Calibration Error (ECE)

ECE isolates calibration from refinement:

\text{ECE} = \sum_{k=1}^K \frac{n_k}{N} |\bar{p}_k - \bar{y}_k|,

where bin k has $n_k$ observations with mean predicted prob $\bar{p}_k$ and mean outcome $\bar{y}_k$ . ECE = 0 means perfect calibration; ECE > 0.05 typically warrants recalibration. Reported in modern ML calibration papers.

Why models miscalibrate

Overconfidence: deep neural networks (Guo et al. 2017), boosted trees, SVMs with sigmoid scores often produce predictions too close to 0/1.
Underconfidence: heavily regularised models (L2 logistic), models trained on small or noisy data.
Bias: distribution shift (training class balance ≠ deployment), incorrect cost functions.
Random forest extremism: outputs that are fractions of trees (e.g., 30/100 → 0.30) can be miscalibrated; out-of-bag calibration helps.

Post-hoc recalibration

If a model is miscalibrated, fix it on a held-out CALIBRATION SET:

Platt scaling (Platt 1999): fit a logistic regression of (true labels) on (predicted scores) on the calibration set; use the fitted sigmoid to transform predictions. Two parameters (slope + intercept). Best for sigmoidal miscalibration (SVMs, boosting).
Isotonic regression: fit a non-decreasing piecewise-constant function of predicted score to outcome. More flexible than Platt; needs more calibration data. Best when miscalibration is non-sigmoidal.
Temperature scaling (Guo et al. 2017): for neural networks, multiply logits by 1/T and apply softmax; tune T on held-out data. Single parameter, simple, preserves AUC.
Beta calibration (Kull et al. 2017): three-parameter Beta-family fit; handles asymmetric miscalibration.

Cross-validation calibration: split training data, train on one part, calibrate on the other.

Conformal prediction: calibrated UNCERTAINTY

Vovk et al. (2005) and modern split conformal prediction (Romano et al. 2019, Angelopoulos-Bates 2021) provide DISTRIBUTION-FREE PREDICTION INTERVALS with guaranteed marginal coverage. For each new prediction, output a SET of plausible labels (classification) or an INTERVAL (regression) that has guaranteed 90% (or 95%) coverage on test data, regardless of model class. Companion to calibration: gives reliable uncertainty estimates, not just point probabilities.

Try it

Default scenario: Calibrated. The reliability curve hugs the green diagonal; ECE near 0; Brier score is approximately the no-skill baseline times some refinement. Trustworthy.
Switch to Overconfident. Predictions are pushed toward 0/1; the curve dips below the diagonal at high p and above at low p. ECE rises; Brier may even worsen. Fix: Platt or isotonic recalibration.
Switch to Underconfident. Predictions pushed toward 0.5; curve above diagonal at high p (model says 0.6, actual rate 0.85). ECE rises. Fix: temperature scaling with T < 1.
Switch to Biased. Curve is parallel to diagonal but SHIFTED — same shape, wrong intercept. Fix: just add a bias correction.
Crank N up to 5000. Reliability curve smooths out; ECE is more reliable. With small N, ECE has high variance — interpret with caution.

A binary classifier achieves AUC = 0.92 (excellent ranking) but reliability diagram shows curve consistently below the diagonal (overconfident). What does the analyst do, and is AUC enough to deploy this model?

What you now know

Calibration measures whether predicted probabilities are trustworthy. Reliability diagrams visualize; Brier score combines calibration + refinement; ECE isolates calibration. Post-hoc recalibration via Platt, isotonic, temperature, or Beta methods fixes most miscalibration on a held-out set. Modern conformal prediction provides distribution-free intervals with guaranteed coverage. §9.5 next: fairness audits — ensuring ML predictions are equitable across demographic groups.

References

Platt, J. (1999). "Probabilistic outputs for support vector machines." In Advances in Large Margin Classifiers, 61–74. MIT Press.
Niculescu-Mizil, A., Caruana, R. (2005). "Predicting good probabilities with supervised learning." ICML.
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q. (2017). "On calibration of modern neural networks." ICML. (Temperature scaling.)
Murphy, A.H. (1973). "A new vector partition of the probability score." J. Applied Meteorology 12(4), 595–600.
Angelopoulos, A., Bates, S. (2021). "A gentle introduction to conformal prediction and distribution-free uncertainty quantification." arXiv:2107.07511.