Calibration: when 95% really means 95%

Part 3, Confidence intervals and uncertainty

Learning objectives

Define CALIBRATION as the property that a procedure claiming (1 − α) coverage / a forecast claiming probability p delivers that frequency in repeated experience. State that calibration is a property of the PROCEDURE under repeated sampling, not of a single realised interval or forecast
Operationalise empirical CI calibration: simulate R datasets from a known DGP, build the CI on each, count the fraction covering true θ, compare to nominal. State that for discrete sampling distributions (Binomial, Poisson) the coverage can be computed EXACTLY by summation, no Monte-Carlo noise, and that this is what the §3.1 coverage-explorer and §3.5 coverage-calibration widgets do
Recall the §3.1 Brown-Cai-DasGupta (2001) verdict: Wald systematically UNDER-COVERS near binomial boundaries (a structural failure, not a finite-n artefact); Wilson is close to nominal; Clopper-Pearson over-covers (a deliberate price for guaranteed coverage on a discrete sample space). State that bootstrap-percentile coverage can drift under non-smooth functionals or near boundaries
Define probabilistic-forecast calibration for a Bernoulli outcome: P(Y = 1 | p̂ = p) = p for all p ∈ [0, 1]. Distinguish from accuracy (a forecast can be perfectly calibrated yet uninformative, e.g. always 0.5)
Define the RELIABILITY DIAGRAM (Murphy & Winkler 1987, Monthly Weather Review): bin predictions into K bins, plot mean predicted probability (x) vs observed frequency (y). Perfect calibration is the y = x diagonal. State that bin counts matter: sparse bins have wide confidence bands, dense bins have tight bands
State the BRIER SCORE (Brier 1950): B = (1/N) Σ (p̂_i − Y_i)². The proper-scoring decomposition (Murphy 1973): Brier = reliability − resolution + uncertainty. Reliability ≥ 0 measures calibration; resolution ≥ 0 measures how much the forecast varies across distinct outcomes; uncertainty depends on the marginal Pr(Y = 1) only. Brier is a STRICTLY PROPER scoring rule (Gneiting & Raftery 2007), the forecaster minimises EBrier by reporting the true conditional probability
Define the EXPECTED CALIBRATION ERROR (ECE, Naeini-Cooper-Hauskrecht 2015; Guo et al. 2017): ECE = Σ_b (n_b / N) · |mean(p̂_b) − mean(Y_b)|. Computable from a reliability diagram in one line. Lower is better; 0 means perfect calibration in the chosen binning
Describe REGRESSION calibration: standardised residuals (Y_i − Ŷ_i)/SÊ_i should be approximately N(0, 1). Q-Q plot vs the standard Normal visualises calibration; departures show under-/over-coverage of the regression CI / PI
Describe PLATT SCALING (Platt 1999): fit logit(p_recal) = a · logit(p̂) + b on a held-out CALIBRATION SET. Slope a < 1 cures overconfidence; slope a > 1 sharpens an underconfident forecaster; intercept b removes bias. Parametric, two degrees of freedom
Describe ISOTONIC REGRESSION (Zadrozny & Elkan 2002): nonparametric monotone-non-decreasing fit obtained by the Pool-Adjacent-Violators (PAV) algorithm. More flexible than Platt, can fit non-sigmoid miscalibration, but uses more degrees of freedom and can overfit on small calibration sets
State the ML CALIBRATION FINDING (Guo, Pleiss, Sun, Weinberger 2017, ICML): modern deep neural networks are systematically OVERCONFIDENT, high-capacity classifiers achieve high accuracy at the cost of poor calibration. Temperature scaling, Platt scaling with a single shared slope across logits, is the cheap, surprisingly-effective fix. Preview §9.4 (ML for researchers, calibration and probability outputs)
Articulate the THREE HONEST CAVEATS: (i) calibration ≠ accuracy, a uninformative constant forecast can be perfectly calibrated; (ii) calibration requires enough data per bin, sparse bins are noisy; (iii) on-training-distribution calibration does NOT guarantee OOD calibration. The recalibration set must be representative of deployment

The four sections that opened Part 3 each delivered a CONFIDENCE-INTERVAL procedure that CLAIMS (1 − α) coverage. The Wald CI claims it via the CLT (§3.1); Wilson via the score-test inversion (§3.1); Clopper-Pearson via exact-binomial summation (§3.1); the bootstrap via the empirical-quantile pivot (§3.2); the LRT via the chi-square inversion (§3.3); the Normal-PI and conformal-PI via the predictive pivot (§3.4). Every one of those procedures is a CLAIM. §3.5 is the EMPIRICAL TEST.

The unifying concept is CALIBRATION: a procedure that claims a frequency $1 - \alpha$ should DELIVER that frequency in repeated sampling. For a CI, calibration means the long-run fraction of intervals covering $\theta$ matches $1 - \alpha$ . For a probabilistic forecast (Murphy & Winkler 1987, Monthly Weather Review), calibration means $P(Y = 1 \mid \hat p = p) = p$ for all $p \in [0, 1]$ : on days the forecaster says "70% chance of rain", it rains on $\approx 70%$ of them. For a regression, calibration means the standardised residuals $(Y_i - \hat Y_i)/\widehat{\mathrm{SE}}_i$ behave like $\mathcal{N}(0, 1)$ : the bands around predictions cover the realised outcomes at nominal frequency. The mathematical machinery differs across these contexts but the empirical-testability question is the same.

The most-cited finding in §3.5 is the Brown-Cai-DasGupta (2001) verdict from §3.1, recast as a calibration statement: the Wald CI for the binomial CLAIMS $1 - \alpha$ coverage but DELIVERS substantially less near the boundary (e.g. 80-85% at $p = 0.05, n = 30$ ). The CI is MIS-CALIBRATED. Clopper-Pearson CLAIMS $1 - \alpha$ and DELIVERS more (97-98% at the same setting): also mis-calibrated, but on the conservative side. Wilson hugs nominal. The §3.5 coverage-calibration widget extends this from the §3.1 single-n snapshot to a SAMPLE-SIZE SWEEP, showing how the calibration of each method evolves as $n$ grows.

The §3.5 arc has twelve stops. First, the formal definition of calibration as a procedural property. Second, the empirical-calibration recipe: simulate, count, compare. Third, the coverage-calibration widget, the empirical-coverage curve vs $n$ for four binomial CIs. Fourth, probabilistic-forecast calibration and the reliability diagram (Murphy & Winkler 1987). Fifth, the Brier score (Brier 1950) and the strictly-proper-scoring framework (Gneiting & Raftery 2007). Sixth, the expected calibration error (ECE, Guo et al. 2017). Seventh, the reliability-diagram widget. Eighth, regression calibration via standardised residuals and the Q-Q plot. Ninth, Platt scaling (Platt 1999). Tenth, isotonic regression (Zadrozny & Elkan 2002) and the Pool-Adjacent-Violators algorithm. Eleventh, the ML calibration finding (Guo et al. 2017) and a §9.4 preview. Twelfth, the three honest caveats: calibration ≠ accuracy, sparse-bin noise, and OOD failure.

Calibration: a property of the procedure, not of the realised interval

The §3.1 distinction between NOMINAL and ACTUAL coverage is the cleanest entry point. A CI procedure $C(X)$ at nominal level $1 - \alpha$ satisfies, by construction or by claim, $P_\theta(\theta \in C(X)) \ge 1 - \alpha$ . The ACTUAL coverage $C_{\mathrm{emp}}(\theta)$ is computed by integrating over the true sampling distribution:

C_{\mathrm{emp}}(\theta) \;=\; P_\theta\bigl(\theta \in C(X)\bigr) \;=\; \int \mathbb{1}\!\left[\theta \in C(x)\right] f(x \mid \theta)\,dx.

For a discrete sampling distribution like Binomial $(n, p)$ the integral is a sum and computable EXACTLY: $C_{\mathrm{emp}}(p) = \sum_{k=0}^n \mathbb{1}[p \in C(k)] \binom{n}{k} p^k(1-p)^{n-k}$ , what the §3.1 coverage-explorer widget computes. The procedure is CALIBRATED at $\theta$ when $C_{\mathrm{emp}}(\theta) \approx 1 - \alpha$ ; UNDER-CALIBRATED (under-covers) when $C_{\mathrm{emp}}(\theta) < 1 - \alpha$ ; OVER-CALIBRATED (over-covers) when $C_{\mathrm{emp}}(\theta) > 1 - \alpha$ . The CALIBRATION ERROR is the gap $|C_{\mathrm{emp}}(\theta) - (1 - \alpha)|$ , integrated or maxed over the $\theta$ -space depending on the report you want.

Calibration is therefore a property of the PROCEDURE under the FULL sampling distribution. A single realised interval [0.42, 0.61] is not "calibrated" or "miscalibrated", those words are reserved for the procedure that produced it. Just as §3.1 warned against reading "I am 95% confident $\theta \in [0.42, 0.61]$ " as a probability statement about a single realised interval, §3.5 warns against reading "this CI is calibrated" without the implicit "the procedure that produced it is calibrated, in the long run, under the assumed DGP." Calibration is a long-run frequency claim, and it is empirically testable.

The empirical-calibration recipe

For any CI procedure $C(\cdot)$ and any DGP $P_\theta$ , the empirical calibration is computed in five steps:

Fix the procedure $C$ and the true parameter $\theta$ (or scan a grid of $\theta$ values).
Simulate $R$ datasets $X^{(1)}, \ldots, X^{(R)}$ from $P_\theta$ .
For each, compute the CI $C(X^{(r)})$ .
Count the fraction $\hat C = \frac{1}{R}\sum_{r=1}^R \mathbb{1}[\theta \in C(X^{(r)})]$ .
Report $\hat C$ with a Wilson-score Monte-Carlo band: $\hat C \pm z_{1-\alpha/2}\sqrt{\hat C(1-\hat C)/R}$ (approximately).

For discrete sampling distributions the integration in step (2)-(4) collapses to summation and the answer is EXACT, no Monte-Carlo noise, no $R$ -dependence. For continuous distributions the $R \to \infty$ limit of $\hat C$ is the true coverage, with the Monte-Carlo band shrinking at the $1/\sqrt R$ rate. Either way, the procedural fact, "what fraction of intervals contain $\theta$ when the procedure runs on data from $P_\theta$ ?", is empirically computable. This is the conceptual axis on which calibration becomes testable.

The empirical recipe also works for procedures that have NO analytic coverage formula. Bootstrap-percentile CIs have approximate coverage $1 - \alpha + O(1/\sqrt n)$ asymptotically (Efron 1979; §3.2), but at finite $n$ the coverage depends on the underlying distribution. Simulation IS the calibration check. The coverage-calibration widget below uses the exact summation for Wald, Wilson, and Clopper-Pearson, and Monte-Carlo (B = 400 bootstrap resamples per $k$ ) for the bootstrap-percentile method; the noise on the bootstrap curve at small $n$ is the signature of the $1/\sqrt B$ Monte-Carlo error.

The first widget extends the §3.1 coverage-explorer in two ways: from a single $n$ to a SWEEP over $n$ , and from a fixed CI method to a four-method comparison. Pick a true binomial $p$ , a nominal level (90% / 95% / 99%), and which CI methods to display. The widget evaluates $C_{\mathrm{emp}}(p)$ at $n \in {5, 10, 15, 20, 30, 40, 50, 75, 100, 150, 200, 300, 500}$ via the exact-summation formula (Wald, Wilson, Clopper-Pearson) or the bootstrap (400 resamples per $k$ ), plots the resulting empirical-coverage curve on a log-x axis, and overlays the nominal-coverage horizontal line.

Things to verify in the widget:

Default settings: $p = 0.10$ , 95% nominal, all methods. The Wald (red) curve is below 90% at $n = 5$ and drifts UP toward 95% as $n$ grows, but never quite reaches it for any finite $n$ at this near-boundary $p$ . Wilson (green) is within $\pm 1%$ of 95% across the whole range. Clopper-Pearson (blue) is at 97-98% across the whole range, strictly above nominal. Bootstrap-percentile (amber) tracks Wilson at large $n$ and degrades at small $n$ to near-Wald levels.
Slide true $p$ to 0.30 (interior). All four methods converge to within $\pm 1%$ of nominal by $n = 30$ . The Wald failure is BOUNDARY-specific, not a structural failure at every parameter value. This is the Brown-Cai-DasGupta (2001) point: Wald is fine for interior $p$ ; it is the boundary regions where Wald breaks.
Slide $p$ down to 0.05 (very near boundary). Wald coverage drops dramatically, to $\sim 80%$ at $n = 10$ , climbing to $\sim 90%$ at $n = 100$ , and still under nominal at $n = 500$ . This is the regime where Wald is structurally MIS-CALIBRATED; the symmetric $\hat p \pm z\widehat{\mathrm{SE}}$ formula cannot land on a near-boundary $p$ when $\hat p$ has discrete support. The other three methods continue to hug nominal.
Toggle the bootstrap method ON. Note the curve has VISIBLE noise compared to Wilson, that is the Monte-Carlo error from B = 400 bootstrap resamples per $k$ . Theoretical large-sample coverage is $1 - \alpha + O(1/\sqrt n)$ (Efron 1987); at $n = 5$ the bootstrap is far from the limit and coverage drifts; at $n = 500$ it tracks Wilson closely. The bootstrap is a general-purpose CI but it is NOT a free lunch, small- $n$ near-boundary coverage requires more sophistication (BCa, double bootstrap).
Toggle the nominal level to 99%. Wald coverage drops further BELOW the new 99% line, the structural Wald failure scales with the gap between symmetric-Normal tails and the discrete-binomial reality. Clopper-Pearson hits 99.5%+ at small $n$ .
Move the focus-n slider. The numeric table updates with the exact coverage at the focus $n$ , the gap (nominal − empirical), and a verdict ("under-covers" / "over-covers" / "on target"). The colour cues mirror the §3.1 table: Wald is typically red ("under-covers"); Clopper-Pearson is amber ("over-covers"); Wilson is green ("on target"); bootstrap depends on $n$ .

Probabilistic forecast calibration: from CI to weather forecasting

The CI calibration story generalises beyond intervals. Murphy and Winkler (1987, Monthly Weather Review 115(7), 1330-1338) gave the canonical framework for verifying probabilistic forecasts: a forecaster issues predicted probabilities $\hat p_i \in [0, 1]$ for binary outcomes $Y_i \in {0, 1}$ , and the forecasts are CALIBRATED if

P(Y = 1 \mid \hat p = p) \;=\; p \qquad \text{for all } p \in [0, 1].

The frequency interpretation: across all days the forecaster says "70% chance of rain", the actual rain rate is $70%$ . Across all days the forecaster says "5% chance", the actual rain rate is $5%$ . This is the same calibration concept as the CI one, the procedure should DELIVER the frequency it CLAIMS. The objects differ (interval vs predicted probability) but the property is identical.

The empirical test is the RELIABILITY DIAGRAM (Murphy 1973; Murphy & Winkler 1987). Bin the predictions into $K$ equal-width bins (typically $K = 10$ ). In each bin $b$ , compute the bin's MEAN predicted probability $\bar{\hat p}_b$ and the bin's OBSERVED frequency $\bar Y_b$ . Plot $(\bar{\hat p}_b, \bar Y_b)$ for each bin on a unit square. Perfect calibration is the $y = x$ DIAGONAL. Above the diagonal: the forecaster is UNDER-confident (the actual rate exceeds the predicted). Below the diagonal: the forecaster is OVER-confident.

The reliability-diagram visualisation is now standard in operational weather forecasting (Brier 1950; Murphy & Winkler 1987), in machine learning (Niculescu-Mizil & Caruana 2005; Guo et al. 2017), in medical risk prediction (Steyerberg et al. 2010), and in climate / hurricane risk communication. The bin-count distribution matters: bins with few observations have noisy frequencies; bins with many observations are statistically tight. Most software draws the diagram with marker size proportional to bin count and adds Monte-Carlo confidence bands on each bin's observed frequency.

The Brier score and the expected calibration error

A reliability diagram is a picture. Two NUMBERS summarise it:

The BRIER SCORE (Brier 1950, Monthly Weather Review 78(1), 1-3):

B \;=\; \frac{1}{N}\sum_{i=1}^N (\hat p_i - Y_i)^2.

It is the mean squared error between the predicted probability and the realised outcome (an indicator on ${0, 1}$ ). LOWER is BETTER; the minimum $B = 0$ requires perfect prediction ( $\hat p_i = Y_i$ for every $i$ , which is impossible for genuinely random outcomes); the maximum $B = 1$ requires the forecaster to be exactly wrong on every observation. A constant forecast of $\bar Y$ (the marginal mean) achieves $B = \bar Y(1 - \bar Y)$ : the Bernoulli variance evaluated at the marginal.

The MURPHY (1973) decomposition of the Brier score, given a reliability-diagram binning, is

B \;=\; \underbrace{\frac{1}{N}\sum_b n_b (\bar{\hat p}_b - \bar Y_b)^2}_{\text{reliability}} \;-\; \underbrace{\frac{1}{N}\sum_b n_b (\bar Y_b - \bar Y)^2}_{\text{resolution}} \;+\; \underbrace{\bar Y(1 - \bar Y)}_{\text{uncertainty}}.

Reliability is the calibration component, zero for a perfectly calibrated forecaster. Resolution is the discrimination component, high for a forecaster that produces different probabilities on different outcomes. Uncertainty is intrinsic to the outcome and identical for every forecaster on the same data. Brier is a STRICTLY PROPER scoring rule (Gneiting & Raftery 2007, JASA): the forecaster minimises $\mathbb{E}[B]$ by reporting the true conditional probability, and the minimum is unique. This makes Brier suitable for COMPETITIVE FORECASTING, different forecasters can be ranked by their Brier scores and the ranking respects truthful prediction.

The EXPECTED CALIBRATION ERROR (ECE), introduced by Naeini, Cooper, and Hauskrecht (2015) and popularised in ML by Guo et al. (2017, ICML):

\mathrm{ECE} \;=\; \sum_{b=1}^K \frac{n_b}{N}\, \bigl|\bar{\hat p}_b - \bar Y_b\bigr|.

Weighted L1 distance between predicted bin mean and observed bin mean, weighted by bin size. LOWER is BETTER; 0 is perfect calibration in the chosen binning. Unlike Brier, ECE depends on the chosen $K$ and binning (equal-width vs equal-mass), so reporting both the value and the binning convention is critical. Sparse bins are noisy; with too few observations per bin the ECE estimate is itself noisy. Guo et al. (2017) recommend $K = 15$ equal-width bins for ImageNet-scale problems; smaller $K$ for smaller $N$ .

The second widget makes the calibration story visible for probabilistic forecasts. Pick a miscalibration profile (well-calibrated baseline, overconfident sigmoid, underconfident, biased), a sample size $N \in {200, 500, 1000, 2000, 5000}$ , a latent-q spread (the Beta prior on the true probability), a bin count $K$ , and a recalibration option (none, Platt scaling, isotonic regression). The widget draws $N$ (predicted prob, binary outcome) pairs, plots the reliability diagram with markers sized by bin count, and reports Brier and ECE both BEFORE and AFTER recalibration.

Things to verify in the widget:

Start with "calibrated" profile, $N = 1000$ , $K = 10$ , recal = none. The reliability points should lie on the $y = x$ diagonal within Monte-Carlo noise ( $\sim 1/\sqrt{n_b}$ per bin). Brier and ECE are small. This is the reference case the §3.5 calibration story is built around.
Switch profile to "overconfident". The reliability curve is an S-SHAPE: predictions near $0$ fire too often (points above the diagonal at low $\hat p$ ); predictions near $1$ fail too often (points below the diagonal at high $\hat p$ ). Brier increases. ECE increases. This is the canonical neural-network miscalibration (Guo et al. 2017): high-capacity classifiers extract sharp signal from data and the resulting probabilities are pulled toward the extremes.
With "overconfident" still selected, apply Platt scaling. The recalibrated curve straightens toward the diagonal; Brier drops; ECE drops. The widget reports the fitted slope $a$ and intercept $b$ of $\mathrm{logit}(p_{\mathrm{recal}}) = a\cdot\mathrm{logit}(\hat p) + b$ . For overconfident inputs the fitted slope is $< 1$ : Platt scaling cures overconfidence by FLATTENING the sigmoid.
Switch profile to "biased (+0.10)". The reliability curve is a PARALLEL OFFSET below the diagonal, every bin's observed frequency is $\sim 0.10$ lower than the predicted bin mean. Apply Platt scaling: the recalibrated curve sits on the diagonal; the fitted Platt intercept $b$ absorbs the shift. Biased shifts are exactly what Platt scaling is designed to remove.
Compare Platt vs isotonic on the same biased data. Platt straightens the curve to the diagonal (parametric correction with 2 d.o.f.). Isotonic is also near-diagonal but with VISIBLE step structure, isotonic uses up to $K$ pool-adjacent-violator blocks of d.o.f. and the higher flexibility shows on a small calibration set as small overfitting wobbles.
Reduce $N$ from 1000 to 200, keeping "overconfident" with isotonic recalibration. The recalibrated curve becomes MORE jagged, isotonic overfits on small calibration sets. Switch to Platt: the parametric correction is more stable. The bias-variance trade-off (§1.5): Platt is biased (assumes sigmoid miscalibration) but low variance; isotonic is low-biased (no monotone-sigmoid assumption) but higher variance.

Regression calibration: standardised residuals and Q-Q plots

For a fitted regression $Y_i = X_i^\top \hat\beta + \hat\varepsilon_i$ , the predicted value $\hat Y_i = X_i^\top \hat\beta$ comes with a standard error $\widehat{\mathrm{SE}}_i$ on the predictive distribution. The STANDARDISED residual is

r_i \;=\; \frac{Y_i - \hat Y_i}{\widehat{\mathrm{SE}}_i}.

Under the homoscedastic-Normal assumption, the $r_i$ are approximately $\mathcal{N}(0, 1)$ . CALIBRATION of the regression-based PI then asks: does the empirical distribution of the $r_i$ match the standard Normal? A Q-Q plot of the empirical $r_{(i)}$ quantiles against the theoretical Normal quantiles answers this. Departures from the $y = x$ Q-Q diagonal are interpreted as Part 4 will describe: heavy tails (S-shaped Q-Q), skew (curved Q-Q), heteroscedasticity (Q-Q OK marginally but $r_i$ vs $\hat Y_i$ plot widens).

The CI for the conditional mean $\mathbb{E}[Y \mid X]$ and the PI for $Y_{\mathrm{new}}$ each carry a NOMINAL level. The empirical-calibration test is a held-out PI coverage check: build the PI from training data, count the fraction of test observations inside, compare to nominal. This was the §3.4 pi-calibration widget mechanic. Part 4 will revisit it with regression-specific PI machinery (predictor-dependent widths, confidence bands vs prediction bands, conformal prediction). For §3.5 the point is structural: the same calibration question, DOES THE PROCEDURE DELIVER ITS CLAIM?, applies across CIs, PIs, probabilistic forecasts, and regression.

Platt scaling: parametric recalibration of a binary classifier

When the reliability diagram shows the predicted probabilities are mis-calibrated, RECALIBRATION post-processes them to recover calibration. Platt (1999, Advances in Large Margin Classifiers) introduced the simplest parametric recalibrator. Given raw scores $\hat p_i$ from a base classifier and binary outcomes $Y_i$ on a held-out CALIBRATION set, fit a one-variable logistic regression with predictor $x_i = \mathrm{logit}(\hat p_i)$ :

\mathrm{logit}\bigl(p_{\mathrm{recal}}(\hat p)\bigr) \;=\; a \cdot \mathrm{logit}(\hat p) \;+\; b.

Estimate $(a, b)$ by maximum likelihood, typically via iteratively-reweighted least squares (IRLS) or gradient descent (the widget uses IRLS, 20 iterations max). At inference time, apply the fitted map $p_{\mathrm{recal}}(\hat p) = \sigma(a \cdot \mathrm{logit}(\hat p) + b)$ to every raw probability the classifier emits. Two parameters: slope $a$ controls how sharp the recalibrated probabilities are (slope $< 1$ flattens overconfidence; slope $> 1$ sharpens an underconfident forecaster), and intercept $b$ controls bias (positive $b$ shifts all probabilities upward; negative $b$ shifts down).

Platt scaling assumes the miscalibration has a SIGMOID-shaped reliability curve. When the miscalibration is genuinely sigmoid (the canonical SVM-margin case Platt originally addressed; the canonical neural-net overconfidence case Guo et al. 2017 documented), Platt is near-optimal. When the miscalibration is non-monotone or has multiple bumps, Platt is mis-specified and isotonic regression does better. Platt is parametric (2 d.o.f., low variance, requires the model assumption to hold) and isotonic is nonparametric (up to $N$ d.o.f., higher variance, no model assumption beyond monotonicity).

Platt scaling on the LOGIT (not on the raw probability) is critical. Recalibrating on the probability scale would clip the recalibrated values to $[0, 1]$ and produce non-smooth corrections at the boundaries. The logit transformation moves the boundaries to $\pm\infty$ , makes the recalibration smooth, and lets the IRLS optimisation converge robustly.

Isotonic regression: nonparametric monotone recalibration

Zadrozny and Elkan (2002, KDD-02) introduced ISOTONIC REGRESSION as a more flexible alternative to Platt scaling. The goal is the same, fit a monotone-non-decreasing map $m: [0, 1] \to [0, 1]$ such that $m(\hat p) \approx P(Y = 1 \mid \hat p)$ , but the form of $m$ is nonparametric. The fit is computed by the POOL-ADJACENT-VIOLATORS (PAV) algorithm:

Sort the pairs $(\hat p_i, Y_i)$ by $\hat p_i$ ascending.
Initialise blocks: each observation is a block of size 1 with mean $Y_i$ .
While there exist adjacent blocks with $\bar m_b > \bar m_{b+1}$ (a monotonicity violation), MERGE them: combine the observations, recompute the pooled mean, replace the two blocks by one.
Repeat until no violations remain. The resulting block means form a non-decreasing step function, the isotonic fit.
At inference time, for a new $\hat p_{\mathrm{new}}$ , find the block containing it and return that block's mean (or interpolate linearly between adjacent block means).

The PAV algorithm runs in $O(N)$ amortised time after the $O(N\log N)$ sort. It is the unique minimiser of $\sum_i (m(\hat p_i) - Y_i)^2$ subject to $m$ being non-decreasing in $\hat p$ , a constrained-least-squares problem with a beautifully simple combinatorial solution (Barlow, Bartholomew, Bremner, Brunk 1972, Statistical Inference Under Order Restrictions). Niculescu-Mizil and Caruana (2005, ICML) compared Platt and isotonic on a broad ML benchmark suite: isotonic dominates Platt when $N \gtrsim 1000$ in the calibration set; Platt dominates isotonic for smaller calibration sets where the nonparametric variance bites.

Both Platt and isotonic require a HELD-OUT calibration set, data NOT used to fit the base classifier. Using the same data for both fitting and calibrating produces optimistic estimates of the calibration improvement; the held-out requirement is the same train/test discipline that pervades cross-validation (§8.3). Modern ML pipelines (e.g., Hugging Face, scikit-learn) provide both calibrators with cross-validated calibration sets as the default.

Modern neural networks are systematically miscalibrated

Guo, Pleiss, Sun, and Weinberger (2017, "On calibration of modern neural networks", ICML) documented what is now a canonical observation in ML: deep neural networks trained with modern recipes (dropout, batch normalisation, large capacity, long training) are systematically OVERCONFIDENT on their training distribution. The same networks achieve dramatically higher accuracy than the small networks of the 1990s, but their predicted probabilities are concentrated at the extremes (close to $0$ and close to $1$ ) regardless of the actual conditional probability.

The visible signature is a reliability diagram with an S-SHAPED curve below the diagonal at high predicted probability and above the diagonal at low predicted probability, exactly the "overconfident" profile in the §3.5 widget. ECE on ImageNet for ResNet-50 reaches 4-8% before recalibration; after temperature scaling (a single-parameter Platt-scaling variant: divide the network's logits by a learned scalar $T$ ), ECE drops to $\sim 1%$ without affecting accuracy. The fix is cheap and effective; the diagnostic is even cheaper. Guo et al. (2017) made the reliability diagram and ECE table mandatory artefacts in many ML benchmarks.

The connection to §3.5 calibration is direct: a network that achieves 95% top-1 accuracy on ImageNet does NOT thereby produce 95%-calibrated predicted probabilities. ACCURACY and CALIBRATION are separate axes of evaluation. Reporting accuracy without a reliability diagram or ECE leaves the calibration story untold, and downstream consumers of the network's probabilities (decision systems, ensembles, scientific risk assessment) need calibration, not just accuracy.

Part 9 (ML for researchers) §9.4 develops this in depth: temperature scaling, vector scaling, matrix scaling (multi-class Platt variants), label smoothing as a calibration-aware training trick, focal loss, distillation, and modern uncertainty-aware methods (deep ensembles, Bayesian neural nets, Monte-Carlo dropout). For §3.5 the message is the entry point: predicted probabilities from any classifier, neural net, random forest, logistic regression, should be calibration-checked before they are used as probabilities in any downstream pipeline.

Three honest caveats

The §3.5 framework is empirically testable AND has known limitations. The three honest caveats:

Calibration is NOT accuracy. A forecaster that always reports $\hat p = \bar Y$ (the marginal positive rate) is perfectly calibrated by construction: $P(Y = 1 \mid \hat p = \bar Y) = \bar Y$ by the law of total probability. The forecaster is also USELESS, it produces no discrimination between different examples. Calibration is necessary but not sufficient. The Murphy (1973) decomposition makes this concrete: Brier = reliability − RESOLUTION + uncertainty. Calibration sets reliability to 0; resolution captures the discrimination side. A good forecaster needs LOW reliability AND HIGH resolution. Many ML reports focus on accuracy and ignore calibration; many calibration reports focus on Brier and ignore resolution. The full Murphy decomposition is the responsible report.
Calibration tests need enough data per bin. The reliability-diagram diagnostic with $K = 10$ bins requires $\gtrsim 30$ observations per bin for a reliable visual; with $N = 100$ overall, each bin gets $\sim 10$ observations and the per-bin observed-frequency Monte-Carlo error is $\sim 0.16$ at $\bar Y_b = 0.5$ . Sparse bins are NOISY and can falsely suggest miscalibration where the data just has too few points per bin. Standard fix: equal-MASS binning (each bin has the same number of observations) instead of equal-width. Standard caveat: ECE is binning-dependent; report the binning used.
On-distribution calibration does NOT imply OOD calibration. A classifier trained on ImageNet and calibrated on a held-out ImageNet split is calibrated FOR IMAGENET. Deploy it on COVID-radiograph images and the calibration breaks: the OOD test distribution has features the calibration set did not see. The MED-OOD literature (Hendrycks & Gimpel 2017, ICLR; Ovadia et al. 2019, NeurIPS) shows the breakdown empirically, calibration is itself a function of the deployment distribution. The §3.4 conformal-prediction machinery is one response (distribution-free coverage in finite samples); Bayesian / deep-ensemble methods are another. Neither makes OOD calibration automatic; both flag uncertainty that recalibration alone cannot create.

The §3.5 calibration story is therefore POWERFUL but PARTIAL. It gives the reader a checkable property, does my procedure deliver what it claims?, and a recalibration toolkit (Platt, isotonic, conformal). It does NOT substitute for accuracy, discrimination, or out-of-distribution robustness. The mature practitioner reports calibration alongside accuracy alongside OOD diagnostics, with each axis carrying its own diagnostic suite.

Try it

In the coverage-calibration, set $p = 0.10$ , 95% nominal, all four methods. Confirm the verdict table: Wald under-covers, Clopper-Pearson over-covers, Wilson is on target. Read off the empirical coverage at $n = 50$ for each method (Wald $\approx 91%$ , Wilson $\approx 95%$ , CP $\approx 97%$ ).
Same widget. Slide $p$ to 0.30 (interior). All four methods now converge to within $\pm 1%$ of nominal by $n = 30$ . State: Wald is fine at interior $p$ ; the §3.1 Brown-Cai-DasGupta verdict is BOUNDARY-specific.
Same widget. Slide $p$ to 0.03 (very near boundary). Wald coverage falls to 75-85% at small $n$ and reaches $\sim 92%$ at $n = 500$ . Why doesn't Wald asymptote to 95% at this $p$ ? Hint: the symmetric $\hat p \pm z\widehat{\mathrm{SE}}$ formula at $\hat p \approx p \approx 0$ has $\widehat{\mathrm{SE}} = 0$ when $k = 0$ and a discontinuous SE jump otherwise.
In the reliability-diagram, set profile = overconfident, $N = 1000$ , $K = 10$ , recal = none. Read off the raw Brier score and ECE. Note the S-curve shape: above diagonal at low $\hat p$ , below diagonal at high $\hat p$ . This is the canonical neural-network miscalibration (Guo et al. 2017).
Same widget. Apply Platt scaling. Compare the Brier and ECE rows before and after. Platt typically reduces ECE by $40-70%$ . Note the fitted slope $a < 1$ , overconfidence is cured by flattening the sigmoid.
Same widget. Switch profile to "biased (+0.10)". Note the parallel offset of the reliability curve below the diagonal. Apply Platt. The fitted intercept $b$ should be negative (it absorbs the +0.10 shift); the slope $a$ should be close to 1 (no shape correction needed). ECE drops to $\sim 0%$ .
Same widget. Compare Platt vs isotonic on the "overconfident" profile at $N = 200$ . Isotonic shows visible jagged steps, overfitting on a small calibration set. Increase $N$ to 5000: isotonic now smooths out and dominates Platt. State the trade-off: Platt is biased / low-variance; isotonic is low-bias / high-variance.
Pen-and-paper. Define the Brier score and its Murphy (1973) decomposition (reliability − resolution + uncertainty). Argue that a constant forecast $\hat p \equiv \bar Y$ has reliability = 0 (perfect calibration), resolution = 0, and Brier = uncertainty = $\bar Y(1 - \bar Y)$ . Conclude that calibration is necessary but not sufficient, resolution captures the useful signal.
Pen-and-paper. Show that for a binary outcome, the EXPECTED Brier score $\mathbb{E}[B] = \mathbb{E}[(\hat p - Y)^2]$ is minimised by $\hat p = \mathbb{E}[Y \mid X] = P(Y = 1 \mid X)$ . This is the strictly-proper-scoring-rule property (Gneiting & Raftery 2007): the forecaster minimises expected Brier by reporting the true conditional probability.
Pen-and-paper. Sketch a reliability diagram for: (a) a perfectly calibrated forecaster; (b) an overconfident classifier; (c) an underconfident classifier; (d) a biased forecaster (+0.10 shift). Annotate where each curve lies relative to the $y = x$ diagonal.

Pause and reflect: §3.5 has cast CALIBRATION as the unifying empirical-testability property linking §§3.1-3.4 to the broader inference world. A CI procedure CLAIMS $1 - \alpha$ coverage; a probabilistic forecast CLAIMS a frequency; a regression CI / PI CLAIMS a coverage rate; a neural-network softmax output CLAIMS a class probability. CALIBRATION asks: does the procedure DELIVER that claim under repeated sampling? The empirical recipe (simulate, count, compare) and the diagnostic toolkit (reliability diagram, Brier score, ECE, Q-Q plot of standardised residuals) make calibration EMPIRICALLY TESTABLE. The recalibration toolkit (Platt scaling, isotonic regression, conformal prediction) fixes detected miscalibration. The three honest caveats, calibration ≠ accuracy, bins need data, on-distribution ≠ OOD, keep the framework honest. §3.6 closes Part 3 with the communication side: how to REPORT the calibrated uncertainty without lying.

What you now know

You can articulate CALIBRATION as the procedural property that a CI / forecast / regression claim is DELIVERED under repeated sampling. You know calibration is a property of the PROCEDURE under the full sampling distribution, not of a single realised interval or prediction, and that empirical calibration is computable EXACTLY for discrete sampling distributions (Binomial summation) and via Monte Carlo for continuous ones.

You can describe the empirical-calibration recipe: simulate $R$ datasets from a known DGP, build the CI on each, count the fraction covering $\theta$ , compare to nominal, report with a Wilson-score Monte-Carlo band. You can use the coverage-calibration widget to verify the §3.1 Brown-Cai-DasGupta (2001) verdict on a sample-size sweep: Wald drifts up from below; Wilson hugs nominal; Clopper-Pearson sits above; bootstrap-percentile depends on $n$ . Calibration is BOUNDARY-specific: at interior $p$ all methods converge to nominal quickly.

You can define probabilistic-forecast calibration (Murphy & Winkler 1987): $P(Y = 1 \mid \hat p = p) = p$ . You can construct and interpret a RELIABILITY DIAGRAM (bin predictions, plot mean predicted prob vs observed frequency, compare to $y = x$ diagonal). You know that bin counts matter and that sparse bins produce noisy frequencies.

You can state the BRIER SCORE $B = (1/N)\sum(\hat p_i - Y_i)^2$ (Brier 1950), the Murphy (1973) decomposition (reliability − resolution + uncertainty), and the strictly-proper-scoring property (Gneiting & Raftery 2007): the expected Brier score is minimised by truthful prediction. You can compute the EXPECTED CALIBRATION ERROR (ECE = $\sum_b (n_b/N)|\bar{\hat p}_b - \bar Y_b|$ , Naeini et al. 2015; Guo et al. 2017). You know both depend on the binning convention.

You can describe REGRESSION CALIBRATION via standardised residuals $(Y_i - \hat Y_i)/\widehat{\mathrm{SE}}_i \sim \mathcal{N}(0, 1)$ and the Q-Q plot diagnostic. You know Part 4 develops the regression-specific machinery: predictor-dependent PI widths, confidence bands vs prediction bands, conformal prediction for regression.

You can describe RECALIBRATION techniques: Platt scaling (Platt 1999) fits $\mathrm{logit}(p_{\mathrm{recal}}) = a\cdot\mathrm{logit}(\hat p) + b$ on a held-out calibration set, parametric, 2 d.o.f., low variance, assumes sigmoid miscalibration. Isotonic regression (Zadrozny & Elkan 2002) fits a monotone-non-decreasing map via Pool-Adjacent-Violators, nonparametric, more flexible, but higher variance and prone to overfitting on small calibration sets. The bias-variance trade-off: Platt for small $N$ ; isotonic for large $N$ .

You can describe the ML calibration finding (Guo, Pleiss, Sun, Weinberger 2017): modern deep neural networks are systematically OVERCONFIDENT, high accuracy at the cost of poor calibration. Temperature scaling (Platt with a single shared slope across logits) is the cheap, effective fix. Reliability diagram + ECE table are now standard ML reporting artefacts. Part 9.4 develops the full ML-calibration toolkit.

You can articulate the THREE HONEST CAVEATS: (i) calibration ≠ accuracy, a constant forecast at the marginal mean is perfectly calibrated and useless; (ii) calibration tests require enough data per bin, sparse bins are noisy; (iii) on-distribution calibration does NOT imply OOD calibration. Calibration is empirically testable, but the test is local to the DGP that produced the data.

Where this lands in the rest of Part 3 and the textbook. §3.6 closes Part 3 with the communication side: how to REPORT uncertainty without lying about its calibration. Part 4 (regression) uses standardised-residual diagnostics and Q-Q plots as the default calibration check for OLS, and develops PI calibration for predictions. Part 5 (GLMs) covers GLM-specific calibration (deviance residuals, Pearson residuals). Part 7 (Bayesian) introduces POSTERIOR-PREDICTIVE CHECKS, the Bayesian analogue of calibration. Part 9.4 (ML calibration) develops temperature scaling, label smoothing, focal loss, deep ensembles, conformal prediction, and the modern uncertainty-aware ML toolkit. The Brier-score / reliability-diagram / ECE machinery you just learned is the backbone of all of those.

References

Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review 78(1), 1-3. (Originates the Brier score $B = (1/N)\sum(\hat p_i - Y_i)^2$ as the verification statistic for probabilistic forecasts in meteorology. Foundational for modern calibration measurement.)
Murphy, A.H. (1973). "A new vector partition of the probability score." Journal of Applied Meteorology 12(4), 595-600. (The Brier score decomposition Brier = reliability − resolution + uncertainty. Splits the calibration component out as its own term.)
Murphy, A.H., Winkler, R.L. (1987). "A general framework for forecast verification." Monthly Weather Review 115(7), 1330-1338. (The canonical framework for verifying probabilistic forecasts. Defines reliability, resolution, and the reliability diagram as the diagnostic visualisation.)
Gneiting, T., Raftery, A.E. (2007). "Strictly proper scoring rules, prediction, and estimation." Journal of the American Statistical Association 102(477), 359-378. (Defines strictly proper scoring rules, scores that are minimised in expectation by truthful prediction. Brier score, log score, CRPS are the canonical examples. Foundational reference for modern forecast-verification theory.)
Platt, J.C. (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." Advances in Large Margin Classifiers 10(3), 61-74, MIT Press. (Introduces Platt scaling: fit a 1-D logistic regression on the logit of the raw score to recalibrate. Originally for SVMs; now the default parametric recalibrator for any binary classifier.)
Zadrozny, B., Elkan, C. (2002). "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694-699. (Introduces isotonic regression via Pool-Adjacent-Violators as a nonparametric monotone alternative to Platt scaling. The other half of the standard recalibration toolkit.)
Niculescu-Mizil, A., Caruana, R. (2005). "Predicting good probabilities with supervised learning." Proceedings of the 22nd International Conference on Machine Learning, 625-632. (Empirical comparison of Platt vs isotonic across a broad ML benchmark suite. Isotonic dominates at N ≥ 1000; Platt dominates for smaller calibration sets.)
Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q. (2017). "On calibration of modern neural networks." Proceedings of the 34th International Conference on Machine Learning (ICML), 1321-1330. (Documents the systematic OVERCONFIDENCE of modern deep neural networks, high accuracy at the cost of poor calibration. Introduces temperature scaling as the canonical Platt-style fix for multi-class classifiers. The reliability diagram + ECE table is now standard ML reporting.)
Naeini, M.P., Cooper, G.F., Hauskrecht, M. (2015). "Obtaining well calibrated probabilities using Bayesian binning." Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2901-2907. (Introduces the Expected Calibration Error (ECE), the weighted L1 distance between bin-mean predicted probability and bin-mean observed frequency. The numeric summary of a reliability diagram.)
Brown, L.D., Cai, T.T., DasGupta, A. (2001). "Interval estimation for a binomial proportion." Statistical Science 16(2), 101-117. (Exact-coverage comparison framework for binomial CIs. Documents the Wald sawtooth, the Wilson near-nominal behaviour, the Clopper-Pearson over-coverage. The reference for §3.5's coverage-calibration widget.)
Steyerberg, E.W., Vickers, A.J., Cook, N.R., et al. (2010). "Assessing the performance of prediction models: a framework for traditional and novel measures." Epidemiology 21(1), 128-138. (Medical-statistics view: discrimination, calibration, clinical utility as three orthogonal evaluation axes for clinical prediction models. The §3.5 reliability-diagram + Brier-score approach for medical risk scores.)
Hendrycks, D., Gimpel, K. (2017). "A baseline for detecting misclassified and out-of-distribution examples in neural networks." International Conference on Learning Representations (ICLR). (Foundational OOD-detection paper. Demonstrates that on-distribution calibration does NOT carry over to OOD test examples. The empirical evidence behind the §3.5 OOD caveat.)