Prediction intervals vs confidence intervals

Part 3 — Confidence intervals and uncertainty

Learning objectives

State the conceptual contrast: a CONFIDENCE INTERVAL is about a PARAMETER (e.g. μ) — a fixed unknown — whereas a PREDICTION INTERVAL is about a RANDOM VARIABLE (e.g. the next observation X_{n+1}). The CI bands estimation uncertainty alone; the PI bands estimation uncertainty PLUS the intrinsic variance of the future draw
Derive the Normal-model PI with KNOWN σ². X_{n+1} − X̄ has mean 0 and variance σ²·(1 + 1/n) under independence (since X̄ uses n iid draws independent of X_{n+1}). Hence the (1 − α) PI is X̄ ± z_{1−α/2} · σ · √(1 + 1/n). The √(1 + 1/n) factor exceeds 1 for every finite n and converges to 1 as n → ∞
Derive the Normal-model PI with UNKNOWN σ². Replace σ by the sample sd s; the pivot (X_{n+1} − X̄)/[s · √(1 + 1/n)] follows Student-t with n − 1 degrees of freedom. The (1 − α) PI is X̄ ± t_{n−1, 1−α/2} · s · √(1 + 1/n). The CI for μ uses the SAME t quantile with the √(1 + 1/n) replaced by 1/√n
Compare the LIMITS: as n → ∞, the CI half-width z·σ/√n → 0 (the sampling distribution of X̄ concentrates on μ), while the PI half-width z·σ·√(1 + 1/n) → z·σ (the next-draw intrinsic variance σ² remains regardless of training-sample size). PI half-width has a non-vanishing FLOOR of z·σ
State the ratio PI/CI: at finite n the PI half-width / CI half-width = √(n + 1) ≈ √n for moderate n. At n = 20 the ratio is √21 ≈ 4.58; at n = 200 the ratio is √201 ≈ 14.18. The PI is always wider than the CI, and the gap grows with n on the absolute scale (CI shrinks faster than PI does)
Identify the MISUSE in literature: many papers report a CI when describing the range a NEW patient / NEW measurement should land in. Statements like "based on the analysis, the new sample falls in [a, b] with 95% confidence" are PIs, not CIs — they require the √(1 + 1/n) factor. The classical reference is Hahn & Meeker (1991, Statistical Intervals: A Guide for Practitioners, §2)
PREVIEW prediction intervals in REGRESSION (deferred to Part 4). For a new x_new, the predicted value Ŷ has TWO sources of uncertainty: (i) regression coefficient uncertainty (Var(β̂·x_new)) and (ii) residual noise σ². The PI half-width is z · √(Var(β̂·x_new) + σ²) and is always wider than the CI for E[Y | x_new] = z · √Var(β̂·x_new), again by an additive σ² inside the square root
State BOOTSTRAP / NONPARAMETRIC PIs: instead of a parametric (Normal) PI, the bootstrap of the empirical predictive distribution yields quantile-based PIs. For predictive intervals from a fitted model, the bootstrap of (Ŷ − Y) residuals gives a quantile-based PI: PI = Ŷ ± quantile(bootstrap residuals, 1 − α/2). More robust to non-Normality than the Normal-PI (Geisser 1993, Predictive Inference: An Introduction)
Define CALIBRATION for PIs: a (1 − α) PI is CALIBRATED if it covers the next observation in (1 − α) of repeated experiments. State that PI calibration is EMPIRICALLY TESTABLE via cross-validation, leave-one-out, or held-out data: build the PI from training, count the fraction of test points inside, compare to nominal
Articulate the failure modes of Normal PIs: (1) HEAVY-TAILED data: the next draw is more often beyond ±z·σ than Normal-PI assumes; coverage < nominal. (2) OUT-OF-DISTRIBUTION test: training distribution ≠ test distribution; PI was calibrated to the wrong DGP. (3) HETEROSCEDASTICITY in regression: PI assumes constant σ²; with σ² = σ²(x), the PI width should vary with x and a constant-width PI is mis-calibrated
Preview CONFORMAL PREDICTION as a distribution-free fix (Vovk-Gammerman-Shafer 2005, Algorithmic Learning in a Random World; Lei et al. 2018, JASA). Use a held-out calibration set to compute residual quantiles, then PI = Ŷ ± empirical (1 − α) quantile of |Ŷ − Y| on calibration. Coverage holds in finite samples under EXCHANGEABILITY of (X, Y) pairs — no parametric assumption needed. Locally adaptive variants handle heteroscedasticity
State the PRACTICAL recommendation: USE A CI when reporting uncertainty about a parameter (mean treatment effect, regression coefficient, prevalence). USE A PI when reporting where a future observation will fall (next patient outcome, prediction for a new x). Never present a CI's [a, b] as if it bounded the next observation — that is the headline pedagogical error this section is built to prevent

Sections §3.1–§3.3 each built a CONFIDENCE INTERVAL methodology — Wald, Wilson, Clopper–Pearson, Garwood, Student-t (§3.1); percentile, basic, BCa, bootstrap-t (§3.2); profile likelihood and the LRT (§3.3). Every CI in those sections has the same conceptual content: it bands a PARAMETER — the population mean $\mu$ , the binomial $p$ , the Poisson rate $\lambda$ , the regression coefficient $\beta$ . The parameter is a fixed unknown; the band captures sampling variability of the estimation procedure.

§3.4 turns to the parallel object: the PREDICTION INTERVAL (PI). The PI bands NOT a parameter but a RANDOM VARIABLE — specifically, the next observation $X_{n+1}$ drawn from the same distribution as the training sample. The CI answers "where is the true mean?"; the PI answers "where will the NEXT observation fall?". These are different objects with different uncertainty sources, and they require different formulas.

The CI and PI for the Normal model with known $\sigma$ :

\boxed{\;C^{\mathrm{CI}}_{1-\alpha}(\mu) \;=\; \bar X \pm z_{1-\alpha/2} \cdot \dfrac{\sigma}{\sqrt{n}}\;}

\boxed{\;C^{\mathrm{PI}}_{1-\alpha}(X_{n+1}) \;=\; \bar X \pm z_{1-\alpha/2} \cdot \sigma \cdot \sqrt{1 + \tfrac{1}{n}}\;}

Two differences. First, the PI has a $\sqrt{1 + 1/n}$ factor where the CI has a $1/\sqrt n$ factor: the PI is always wider. Second — and the conceptually heavier point — as $n \to \infty$ the CI half-width $\to 0$ (the sampling distribution of $\bar X$ concentrates on $\mu$ ) but the PI half-width $\to z_{1-\alpha/2} \cdot \sigma$ (the intrinsic variance $\sigma^2$ of a future draw remains). The CI vanishes in the limit; the PI converges to a non-zero floor.

The arc has ten stops. First, the conceptual contrast: parameter vs random variable. Second, the Normal PI derivation with known $\sigma$ . Third, the unknown- $\sigma$ generalisation via Student-t. Fourth, the limit comparison and the canonical PI/CI ratio $\sqrt{n+1}$ . Fifth, the ci-vs-pi-explorer widget. Sixth, the misuse in literature: CIs reported where PIs should be. Seventh, prediction intervals in regression (a Part 4 preview). Eighth, robust / bootstrap PIs. Ninth, the pi-calibration widget — empirical coverage from train/test splits, including heavy-tailed and OOD failure modes. Tenth, conformal prediction as the modern distribution-free PI. Try-it, recap, references.

Parameter vs random variable: the keystone distinction

The most-cited misuse in applied statistics is mixing up the CI and the PI. The 95% confidence interval for $\mu$ says, roughly, "the procedure that produced this interval covers the true mean in 95% of repeated samples." The frequentist interpretation (Neyman 1937; §3.1) is procedural: a property of the CI-building procedure, not a probability about the realised interval. The PARAMETER $\mu$ is a fixed unknown number; the INTERVAL is a random object that depends on the sample.

The 95% prediction interval for $X_{n+1}$ is conceptually different. Now BOTH endpoints AND the target are random. The PI is a procedure that, in 95% of repeated experiments, produces an interval that covers the value of the next iid draw $X_{n+1}$ . The 95% lives in the joint probability over (i) the sample $X_1, \ldots, X_n$ that builds the PI and (ii) the future draw $X_{n+1}$ from the same distribution.

Why this matters in practice. Suppose we estimate the mean weight of newborns at a hospital and compute a 95% CI of $(3.2, 3.4)$ kg. The CI tells us, "the average newborn weight is somewhere in $(3.2, 3.4)$ ". It does NOT tell us the next newborn will weigh $(3.2, 3.4)$ kg — that is a PI claim. The PI for the next newborn might be $(2.4, 4.4)$ kg, much wider, because individual weights vary by $\pm 0.5$ kg around the mean. Treating the CI as a PI ("we are 95% confident the next newborn weighs $(3.2, 3.4)$ kg") would be wrong by an order of magnitude and would systematically under-cover. Hahn & Meeker (1991, §2) document this exact confusion as the most-common statistical interval error in applied work.

The Normal PI with known σ²: deriving the $\sqrt{1 + 1/n}$

c34,79.3,68.167,158.7,102.5,238c34.3,79.3,51.8,119.3,52.5,120 c340,-704.7,510.7,-1060.3,512,-1067 l0 -0 c4.7,-7.3,11,-11,19,-11 H40000v40H1012.3 s-271.3,567,-271.3,567c-38.7,80.7,-84,175,-136,283c-52,108,-89.167,185.3,-111.5,232 c-22.3,46.7,-33.8,70.3,-34.5,71c-4.7,4.7,-12.3,7,-23,7s-12,-1,-12,-1 s-109,-253,-109,-253c-72.7,-168,-109.3,-252,-110,-252c-10.7,8,-22,16.7,-34,26 c-22,17.3,-33.3,26,-34,26s-26,-26,-26,-26s76,-59,76,-59s76,-60,76,-60z M1001 80h400000v40h-400000z"/> factor

Assume $X_1, \ldots, X_n \overset{\mathrm{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ and a future draw $X_{n+1}$ from the same distribution, INDEPENDENT of the training sample. The point predictor is $\hat X_{n+1} = \bar X$ . The prediction ERROR is

X_{n+1} - \bar X.

Its mean is $E[X_{n+1}] - E[\bar X] = \mu - \mu = 0$ , and its variance, using independence of $X_{n+1}$ and $\bar X$ , is

\mathrm{Var}(X_{n+1} - \bar X) \;=\; \mathrm{Var}(X_{n+1}) + \mathrm{Var}(\bar X) \;=\; \sigma^2 + \dfrac{\sigma^2}{n} \;=\; \sigma^2 \left(1 + \dfrac{1}{n}\right).

Two TERMS, two SOURCES of uncertainty:

$\sigma^2$ is the INTRINSIC variance of the next draw. Even with infinite training data, the next observation has variance $\sigma^2$ around its mean $\mu$ .
$\sigma^2/n$ is the estimation uncertainty: how far $\bar X$ is from $\mu$ . This shrinks with $n$ .

By Normality of both $X_{n+1}$ and $\bar X$ , the prediction error is Normal: $X_{n+1} - \bar X \sim \mathcal{N}!\left(0,;\sigma^2(1 + 1/n)\right)$ . The pivot $(X_{n+1} - \bar X)/[\sigma\sqrt{1 + 1/n}]$ is standard Normal. Inverting that pivot for the central $1 - \alpha$ band gives

P\!\left(\bar X - z_{1-\alpha/2}\cdot\sigma\sqrt{1 + 1/n} \;\le\; X_{n+1} \;\le\; \bar X + z_{1-\alpha/2}\cdot\sigma\sqrt{1 + 1/n}\right) \;=\; 1 - \alpha.

The interval $\bar X \pm z_{1-\alpha/2} \cdot \sigma \sqrt{1 + 1/n}$ is the $(1 - \alpha)$ PI for $X_{n+1}$ . The $\sqrt{1 + 1/n}$ factor is the immediate algebraic difference from the CI, and it captures BOTH the $\sigma^2$ next-draw variance AND the $\sigma^2/n$ training-sample variance combined into one variance.

Three numerical landmarks at 95% nominal ( $z_{0.975} = 1.96$ ):

$n = 10$ : $\sqrt{1 + 1/10} = \sqrt{1.10} \approx 1.049$ . PI half-width = $1.96 \cdot \sigma \cdot 1.049 \approx 2.05\sigma$ . CI half-width = $1.96\sigma / \sqrt{10} \approx 0.620\sigma$ . Ratio PI/CI = $2.05 / 0.620 \approx 3.31 = \sqrt{11}$ .
$n = 100$ : $\sqrt{1.01} \approx 1.005$ . PI half-width $\approx 1.97\sigma$ . CI half-width $= 1.96\sigma / 10 = 0.196\sigma$ . Ratio PI/CI $\approx 10.04 = \sqrt{101}$ .
$n = 10,000$ : $\sqrt{1.0001} \approx 1.00005$ . PI half-width $\approx 1.960\sigma$ . CI half-width = $0.0196\sigma$ . Ratio $\approx 100 = \sqrt{10001}$ .

The CI shrinks linearly in $1/\sqrt n$ ; the PI converges to the FIXED half-width $z_{1-\alpha/2} \cdot \sigma \approx 1.96\sigma$ . The gap between them grows on the absolute scale as $n$ grows: the CI vanishes, the PI does not.

Unknown σ²: replace by sample s and use the Student-t

In practice $\sigma$ is rarely known. The standard fix is to replace it by the sample sd $s = \sqrt{\sum(X_i - \bar X)^2 / (n-1)}$ and use the Student-t distribution instead of the Normal. The pivot becomes

T \;=\; \dfrac{X_{n+1} - \bar X}{s\sqrt{1 + 1/n}} \;\sim\; t_{n-1}.

This is the Student-t pivot for the PI, analogous to the Student-t pivot for the CI in §3.1. The proof uses independence of $\bar X$ and $s^2$ for a Normal sample (Fisher 1925; Casella & Berger 2002, §5.3): $(X_{n+1} - \bar X)/[\sigma\sqrt{1 + 1/n}]$ is standard Normal, $(n-1)s^2/\sigma^2 \sim \chi^2_{n-1}$ independently, and their ratio (with the appropriate denominator scaling) is Student-t with $n-1$ df.

The $(1 - \alpha)$ PI under unknown $\sigma$ :

\boxed{\;C^{\mathrm{PI}}_{1-\alpha}(X_{n+1}) \;=\; \bar X \;\pm\; t_{n-1, 1-\alpha/2} \cdot s \cdot \sqrt{1 + \tfrac{1}{n}}\;}

The corresponding $(1 - \alpha)$ CI for $\mu$ shares the t multiplier:

C^{\mathrm{CI}}_{1-\alpha}(\mu) \;=\; \bar X \;\pm\; t_{n-1, 1-\alpha/2} \cdot \dfrac{s}{\sqrt n}.

The only difference between the CI and PI formulae under unknown $\sigma$ is the multiplier on $s$ : $1/\sqrt n$ for the CI, $\sqrt{1 + 1/n}$ for the PI. At $n = 20$ and 95% nominal: $t_{19, 0.975} \approx 2.093$ . CI half-width = $2.093 \cdot s / \sqrt{20} \approx 0.468 s$ . PI half-width = $2.093 \cdot s \cdot \sqrt{1.05} \approx 2.145 s$ . PI/CI $\approx 4.58 = \sqrt{21}$ . Same ratio $\sqrt{n+1}$ as the known- $\sigma$ case (because the $t$ multiplier cancels).

The Student-t pivot is the exact-finite-sample answer for Normal data with unknown variance. For non-Normal data the $t$ -based PI is approximate (CLT for $\bar X$ , but the next-draw distribution is NOT Normal so the predictive part is approximate even asymptotically). Robust / nonparametric PIs (later in this section) relax the Normality assumption.

The first widget makes the CI–PI distinction visible. Pick a Normal model ( $\mu, \sigma$ ), a sample size $n$ , and a confidence level. The widget draws one sample, computes the CI for $\mu$ and the PI for $X_{n+1}$ , and plots both as horizontal bars against the $x$ -axis. Above the bars it draws TWO densities: the SAMPLING DISTRIBUTION of $\bar X$ (narrow, scales as $\sigma/\sqrt n$ ) and the PREDICTIVE DISTRIBUTION of $X_{n+1}$ (wider, scales as $\sigma$ ). The reader can toggle between $\sigma$ known (z multiplier) and $\sigma$ unknown (t multiplier).

Things to verify in the widget:

Start at $\mu = 0, \sigma = 1, n = 20$ , 95% confidence, $\sigma$ known. The CI bar (green) sits tightly around $\bar X$ ; the PI bar (blue) is roughly $4.58\times$ wider — the $\sqrt{21}$ ratio. The green sampling-of- $\bar X$ density is a tall narrow Gaussian; the blue predictive-of- $X_{n+1}$ density is the wider unit-variance Gaussian. The widget reports the exact ratio in the table.
Slide $n$ up to 200. The CI half-width shrinks from $\sim 0.44$ to $\sim 0.14$ . The PI half-width barely changes — it converges to $1.96 \cdot \sigma \approx 1.96$ . Compare the table's "n → ∞ limit (half)" column: 0 for the CI, $\approx 1.96$ for the PI. The ratio jumps from $\sqrt{21} \approx 4.58$ to $\sqrt{201} \approx 14.18$ .
Drop $n$ to 3. The PI half-width inflates: $\sqrt{1 + 1/3} = \sqrt{4/3} \approx 1.155$ , so PI half-width $\approx 1.96 \cdot 1.155 \approx 2.26$ . The CI inflates much more: $1.96 / \sqrt 3 \approx 1.13$ . Both intervals are wide, but the CI shrinks fast while the PI stays put.
Toggle to $\sigma$ UNKNOWN. The widget switches to the $t$ quantile: at $n = 20$ , $t_{19, 0.975} \approx 2.09$ vs the z quantile 1.96. Both intervals widen by about 7%. At $n = 5$ , $t_{4, 0.975} \approx 2.78$ — both intervals widen by $\sim 42%$ ; the small-sample $t$ -correction is significant.
Slide $\sigma$ up from 1 to 2. The PI half-width doubles (it scales linearly with $\sigma$ ). The CI half-width also doubles. The PI floor $z \cdot \sigma$ moves from $\sim 1.96$ to $\sim 3.92$ : the irreducible noise of the next draw scales with the true noise of the DGP.
Re-roll the sample a few times. The CI moves around $\mu$ ; sometimes covers, sometimes misses (it should cover in 95% of re-rolls under the assumed model). The PI also moves with $\bar X$ but is wide enough that it almost always brackets the predictive density. The "covers truth?" flag in the table tracks the CI; the PI flag is omitted because there is no single "truth" to check — the truth is the random $X_{n+1}$ density.

The misuse in literature: CIs reported when PIs were needed

Hahn & Meeker (1991), Statistical Intervals: A Guide for Practitioners, document the CI-as-PI confusion as the most common statistical interval error in applied work. The phrasing is the giveaway. A CI says "we estimate the mean is in [a, b]"; a PI says "the next observation will fall in [a, b]". Common misuse patterns:

Clinical trials. "Based on our analysis, the next patient's blood-pressure response will fall in $[120, 130]$ with 95% confidence." This is a PI claim — needs the $\sqrt{1 + 1/n}$ factor. The reported CI half-width is roughly $\sigma/\sqrt n$ ; the PI half-width is roughly $\sigma\sqrt{1 + 1/n} \approx \sigma$ , which is $\sqrt n$ times wider. For $n = 100$ : the CI says $\pm 0.196\sigma$ (about $\pm 1$ mmHg if $\sigma = 5$ ); the PI says $\pm 1.97\sigma$ (about $\pm 10$ mmHg) — TEN TIMES wider.
Manufacturing tolerance. "The 95% CI for the mean weight of a component is $(99.8, 100.2)$ grams." This bands the mean weight; if a quality engineer reads it as "any new component will weigh $(99.8, 100.2)$ grams" they are wrong — the new component weight has its OWN variance $\sigma^2$ added on top.
Forecasting. A CI for the mean of a forecast distribution is not the same as a prediction interval for the next realisation. Demand forecasting, financial returns, and engineering reliability all need PIs, not CIs.

The remedy is procedural: when describing where a future observation will fall, USE A PI. The Normal-PI formula is one line of code more than the CI formula — replace $/\sqrt n$ with $\sqrt{1 + 1/n}$ . The cost is trivial; the correctness is non-negotiable.

Prediction intervals in regression (a Part 4 preview)

The CI vs PI distinction extends to regression and to any prediction model. In ordinary-least-squares regression $Y = X\beta + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2)$ , the predicted value at a NEW $x_{\mathrm{new}}$ is $\hat Y = x_{\mathrm{new}}^\top \hat\beta$ . Two parallel intervals exist:

\mathrm{CI~for~}E[Y \mid x_{\mathrm{new}}]: \quad \hat Y \;\pm\; z_{1-\alpha/2} \cdot \sqrt{\sigma^2 \cdot x_{\mathrm{new}}^\top (X^\top X)^{-1} x_{\mathrm{new}}}.

\mathrm{PI~for~}Y_{\mathrm{new}}: \quad \hat Y \;\pm\; z_{1-\alpha/2} \cdot \sqrt{\sigma^2 \cdot x_{\mathrm{new}}^\top (X^\top X)^{-1} x_{\mathrm{new}} \;+\; \sigma^2}.

Read the difference. The CI for $E[Y \mid x_{\mathrm{new}}]$ bands the REGRESSION-FUNCTION uncertainty: how far the fitted line is from the true conditional mean at $x_{\mathrm{new}}$ . The PI for $Y_{\mathrm{new}}$ bands the FUTURE-OBSERVATION uncertainty: where the actual $Y_{\mathrm{new}}$ will land. The PI adds $\sigma^2$ inside the square root — the same $\sigma^2$ as the residual noise. As $n \to \infty$ , the coefficient-uncertainty term $\sigma^2 \cdot x^\top (X^\top X)^{-1} x \to 0$ (the regression coefficients are consistent); the residual $\sigma^2$ term DOES NOT shrink. The PI converges to $\pm z \sigma$ — the same intrinsic-noise floor as the iid Normal case.

In Part 4 this generalises to confidence vs prediction bands (curves of CI/PI half-widths plotted as a function of $x$ ), and the PI/CI ratio is largest at the EDGES of the training- $X$ range and smallest near the centroid. For non-Normal errors the formulae are first-order approximations; conformal prediction (below) gives a distribution-free finite-sample replacement.

Robust and bootstrap PIs

The Normal-PI formula assumes Normal data. For heavy-tailed or skewed populations the formula is mis-calibrated: it under-covers when the true distribution puts more probability in the tails than the assumed Normal does. Two non-parametric alternatives:

Quantile-based PI. Read off the empirical $\alpha/2$ and $1 - \alpha/2$ quantiles of the training sample directly. The PI is $(X_{(\alpha/2)}, X_{(1-\alpha/2)})$ . For large $n$ this approaches the true marginal-distribution quantiles. No distributional assumption needed. Cost: requires $n \gtrsim 100$ for stable tail quantiles.
Bootstrap predictive distribution PI. Bootstrap the training sample to get bootstrap means $\bar X^$ and bootstrap residuals $e^$ _b = X^_b - \bar X^b $e_{b}^{*} = X_{b}^{*} - \overset{ˉ}{X}_{b}^{*}$ . The predictive distribution of $X$ {n+1} $X_{n + 1}$ is approximated by the convolution of the bootstrap distribution of $\bar X$ with the residual distribution. Quantile-based PI from this convolution gives a distribution-free PI (Davison & Hinkley 1997, §5.4).

Geisser (1993), Predictive Inference: An Introduction, develops the predictive-inference framework: the goal is the predictive distribution, not the parameter, and Bayesian and bootstrap methods give natural quantile-based PIs from posterior or empirical predictive distributions. The robust approach is wider than the Normal-PI under Normality (efficiency cost) but better calibrated under non-Normality (robustness gain).

Unlike the CI, which has no observable truth (the parameter is unknown), the PI has an observable truth: the next observation. PI calibration is EMPIRICALLY TESTABLE. The procedure is simple:

Split the data (or simulate): $n$ training points, $m$ test points.
Build the PI from the training data.
Count the fraction of test points inside the PI.
Average over many train/test splits (Monte-Carlo or cross-validation).
Compare to nominal: a 95% PI should cover 95% of test points.

The pi-calibration widget runs this experiment. Pick a training distribution (Normal or heavy-tailed Student- $t_3$ ), a test distribution (Normal, $t_3$ , or shifted Normal for OOD), a training size $n$ , a test size $m$ , and the number of splits $R$ . The widget runs the Monte-Carlo simulation and reports pooled empirical coverage with a Wilson-score 95% confidence band on the proportion estimate.

Things to verify in the widget:

Start with Normal training and Normal test (matched), $n = 20, m = 50, R = 500$ , 95% nominal. Pooled empirical coverage should be $\approx 95%$ with a tight Monte-Carlo band ([94%, 96%] or similar). The widget says "calibrated". The histogram of per-split coverage rates centres on 0.95. This is the canonical "PIs work" picture under correct distributional assumptions.
Switch test distribution to "Normal shifted by +σ (OOD)". The pooled coverage collapses dramatically — typically to 50–80% depending on $n$ . The widget flags "out-of-distribution test: training was Normal but the test draws come from a SHIFTED Normal." PI assumes i.i.d. with the training distribution; OOD test draws break that assumption and coverage tanks.
Switch test distribution to "Student- $t_3$ (heavy tails)" with Normal training. The PI was built assuming Normal residuals but the test draws have $t_3$ tails ( $\mathrm{Var}(t_3) = 3$ ; rescale to $\sigma^2 = 1$ ). Coverage drops by 2–5 percentage points below nominal. The heavy $t_3$ tails put more probability beyond $\pm 1.96\sigma$ than the Normal does, so the Normal-PI under-covers.
Set BOTH train and test to $t_3$ . The Normal-PI formula still UNDER-COVERS, even with matched distributions, because the $t$ -quantile correction only adjusts for ESTIMATION of $\sigma$ from a Normal sample — it does NOT correct for non-Normal TAILS. The PI uses $t_{n-1}$ but the true predictive distribution is $t_3$ ; mismatch persists. Coverage typically $\sim 90%$ at 95% nominal.
Increase $R$ from 500 to 2000. The Wilson-score Monte-Carlo band tightens by $\sqrt 4 = 2\times$ ; statistical precision on the coverage estimate improves. The verdict ("calibrated" / "under-covers" / "over-covers") stabilises.
Slide $n$ from 5 to 200 (matched Normal). The $t$ -correction shrinks as $n$ grows; the PI converges to the z-based PI. Coverage stays $\approx 95%$ across $n$ because under matched Normal the formula is exact — calibration does not depend on $n$ if the model is correct.

Conformal prediction: the modern distribution-free PI

The widget makes the failure modes of the Normal-PI visible. The cure, for distribution-free finite-sample coverage, is CONFORMAL PREDICTION. The framework was developed by Vovk, Gammerman, and Shafer (2005), Algorithmic Learning in a Random World, and made widely accessible to statistics by Lei, G'Sell, Rinaldo, Tibshirani, and Wasserman (2018), "Distribution-free predictive inference for regression," JASA 113(523), 1094–1111. The SPLIT-CONFORMAL recipe (Lei et al. 2018, §2):

Split data into TRAINING set $D_{\mathrm{tr}}$ (size $n_1$ ) and CALIBRATION set $D_{\mathrm{cal}}$ (size $n_2$ ).
Fit a regression / prediction model $\hat\mu$ on $D_{\mathrm{tr}}$ .
Compute calibration residuals $R_i = |Y_i - \hat\mu(X_i)|$ for $(X_i, Y_i) \in D_{\mathrm{cal}}$ .
Let $\hat q_{1-\alpha}$ be the $\lceil (n_2 + 1)(1 - \alpha) \rceil$ -th smallest residual.
For a new $X_{\mathrm{new}}$ : PI = $\hat\mu(X_{\mathrm{new}}) \pm \hat q_{1-\alpha}$ .

The theorem (Vovk-Gammerman-Shafer 2005; Lei et al. 2018): if $(X_i, Y_i)$ are EXCHANGEABLE (in particular, iid), the resulting PI satisfies $P(Y_{\mathrm{new}} \in \mathrm{PI}) \ge 1 - \alpha$ in FINITE samples, distribution-free. The coverage holds for ANY base prediction model $\hat\mu$ — including misspecified ones. The cost is the calibration set (data not used for fitting) and possibly looseness when $\hat\mu$ is a poor fit (the residual distribution is wider).

Locally adaptive variants (CQR, conformal quantile regression; Romano-Patterson-Candes 2019) replace the constant width $\hat q$ with $x$ -dependent widths — handling heteroscedasticity. Group-conditional and cross-conformal variants tighten further. Conformal prediction is now the standard finite-sample valid PI procedure for ML predictions; it gives the formal coverage guarantee that the Normal-PI cannot deliver under non-Normal data.

Try it

In the ci-vs-pi-explorer, set $\mu = 0, \sigma = 1, n = 20$ , 95%, $\sigma$ known. Read off the CI half-width and PI half-width from the table. Verify the PI/CI ratio $\approx \sqrt{21} \approx 4.58$ . Compute by hand: CI half = $1.96/\sqrt{20} = 0.438$ , PI half = $1.96\sqrt{1.05} = 2.009$ , ratio = $2.009/0.438 = 4.58$ . Match the widget reading to four decimal places.
Same widget. Slide $n$ from 20 to 2000 in steps. Watch the CI half-width shrink toward 0 and the PI half-width converge toward 1.96. Plot mentally the two curves vs $n$ : CI decays like $1/\sqrt n$ , PI asymptotes at $z\sigma$ . Note the ratio $\sqrt{n+1}/1 \to \sqrt n$ .
Same widget. Toggle $\sigma$ unknown at $n = 5$ . Note the t-quantile $t_{4, 0.975} = 2.776$ vs the z-quantile 1.96 — the small-sample correction widens both intervals by $\sim 42%$ . Re-roll a few times to see the $s$ estimate fluctuate; the CI and PI both inherit the $s$ noise.
In the pi-calibration, set matched Normal training and Normal test, $n = 20, m = 50, R = 500$ , 95% nominal. Verify pooled empirical coverage $\approx 95%$ with a tight band. This is the "PI works under matched assumptions" reference case.
Same widget. Switch test distribution to "Normal shifted by +σ". Observe pooled coverage drops to $\sim 50%$ . Argue: the PI is centred on $\bar X_{\mathrm{train}} \approx \mu$ but the test draws now have mean $\mu + \sigma$ , so about half the test draws are outside the upper PI endpoint $\mu + 1.96\sigma\sqrt{1.05} \approx \mu + 2.01\sigma$ .
Same widget. Switch training to $t_3$ , test to $t_3$ (both heavy-tailed). Coverage falls below nominal because the Normal-PI assumed Normal tails. Compute the true 95% quantile of the standardised $t_3$ : it is $\approx 1.83\sqrt{3} \approx 3.18$ (rescaled to unit variance), much wider than 1.96. The Normal-PI uses $\sim 1.96$ ; it misses the actual heavy tail.
Pen-and-paper. State the variance decomposition for $X_{n+1} - \bar X$ under iid Normality: $\sigma^2 + \sigma^2/n$ . Why are the two terms additive (not multiplicative)? Hint: independence of $X_{n+1}$ and $\bar X$ .
Pen-and-paper. Derive the regression PI: $\hat Y = x_{\mathrm{new}}^\top \hat\beta$ . Compute $\mathrm{Var}(Y_{\mathrm{new}} - \hat Y) = \mathrm{Var}(Y_{\mathrm{new}}) + \mathrm{Var}(\hat Y) = \sigma^2 + \sigma^2 x_{\mathrm{new}}^\top (X^\top X)^{-1} x_{\mathrm{new}}$ . Compare with the CI for $E[Y \mid x_{\mathrm{new}}]$ which only has the second term. Argue when the difference is large (small $n$ ) vs small (large $n$ relative to $x_{\mathrm{new}}^\top (X^\top X)^{-1} x_{\mathrm{new}}$ ).
Pen-and-paper. Describe the split-conformal PI procedure: training, calibration, prediction. Why does the coverage $1 - \alpha$ hold in finite samples? Hint: exchangeability of $(R_1, \ldots, R_{n_2}, R_{\mathrm{new}})$ means the rank of $R_{\mathrm{new}}$ is uniform on ${1, \ldots, n_2 + 1}$ , so $P(R_{\mathrm{new}} \le R_{(\lceil (n_2+1)(1-\alpha)\rceil)}) \ge 1 - \alpha$ .

Pause and reflect: §3.4 has made the CI–PI distinction explicit. The CI is about a PARAMETER (the true mean $\mu$ , a regression coefficient, a probability) — a fixed unknown. The PI is about a RANDOM VARIABLE (the next observation $X_{n+1}$ , the next predicted $Y$ ) — both endpoints AND target are random. For the Normal model with known $\sigma$ , the difference is one algebraic factor: $1/\sqrt n$ for the CI, $\sqrt{1 + 1/n}$ for the PI. The CI vanishes as $n \to \infty$ ; the PI converges to the irreducible noise floor $z\sigma$ . The two ci-vs-pi-explorer and pi-calibration widgets make this visible and EMPIRICALLY TESTABLE — PI calibration is checkable from data, where CI calibration of an unknown parameter is not. §3.5 will pick up the broader calibration thread: when does a procedure that CLAIMS 95% coverage really deliver 95% coverage, and how do you check, across all the CI methodologies of §§3.1–3.3 and the PIs of §3.4?

What you now know

You can articulate the CONCEPTUAL distinction between a CI (about a parameter — a fixed unknown) and a PI (about a random variable — a future observation). You know the CI bands estimation uncertainty alone, while the PI bands estimation uncertainty PLUS the intrinsic variance of the next draw.

You can derive the Normal PI with known $\sigma^2$ : $\mathrm{Var}(X_{n+1} - \bar X) = \sigma^2(1 + 1/n)$ by independence, leading to PI = $\bar X \pm z_{1-\alpha/2} \cdot \sigma\sqrt{1 + 1/n}$ . You can derive the unknown- $\sigma$ version via the Student-t pivot: PI = $\bar X \pm t_{n-1, 1-\alpha/2} \cdot s\sqrt{1 + 1/n}$ . You know the corresponding CIs have the $\sqrt{1 + 1/n}$ factor replaced by $1/\sqrt n$ and the LIMIT behaviour: CI half-width $\to 0$ , PI half-width $\to z_{1-\alpha/2}\sigma$ . PI/CI ratio at finite $n$ is $\sqrt{n + 1}$ .

You can identify the MISUSE in literature where authors report a CI when describing where a future observation will fall — Hahn & Meeker (1991) calls this the most-cited error in applied statistical intervals. You know the remedy: use the $\sqrt{1 + 1/n}$ factor and call it a PI.

You can state the regression PI: $\hat Y \pm z \cdot \sqrt{\sigma^2 \cdot x_{\mathrm{new}}^\top (X^\top X)^{-1} x_{\mathrm{new}} + \sigma^2}$ where the second $\sigma^2$ is the next-draw noise that the CI for $E[Y \mid x_{\mathrm{new}}]$ does NOT include. You know that as $n \to \infty$ the CI for the regression mean shrinks to 0 but the PI converges to the irreducible $\pm z\sigma$ floor — same shape as the iid case.

You can describe NONPARAMETRIC PIs: quantile-based PIs from the empirical CDF, bootstrap-based PIs from the predictive distribution, and the rationale (robustness to non-Normal data at the cost of needing larger $n$ ). You know Geisser (1993) developed the predictive-inference framework where these alternatives sit naturally.

You can describe PI CALIBRATION as an empirically testable property — coverage is verified by train/test splits or cross-validation, unlike CIs for unknown parameters which lack an observable truth. You can use the pi-calibration widget to verify that Normal-PI coverage matches nominal under matched Normal data, drops several percentage points under heavy tails, and collapses under out-of-distribution test draws.

You can describe CONFORMAL PREDICTION as the distribution-free finite-sample-valid PI procedure: split into training and calibration sets, compute calibration residuals, set the PI half-width to the appropriate empirical-residual quantile. Coverage $\ge 1 - \alpha$ holds under exchangeability for ANY base prediction model. Vovk, Gammerman, Shafer (2005) is the canonical reference; Lei et al. (2018) is the modern statistics-friendly treatment. Locally adaptive variants (CQR) handle heteroscedasticity.

Where this lands in the rest of Part 3 and the textbook. §3.5 takes CALIBRATION as a topic in its own right: when does a CI procedure that CLAIMS 95% coverage really deliver 95%, and how do you check across all the methodologies (Wald, bootstrap, profile-LRT, Normal-PI, conformal)? §3.6 closes Part 3 on the communication side — how to report uncertainty without lying. Part 4 (regression) develops the regression-PI machinery in full: predictor-dependent widths, confidence bands vs prediction bands, conformal prediction for regression. The $\sqrt{1 + 1/n}$ factor you just learned generalises to $\sqrt{\mathrm{leverage} + 1}$ in regression.

References

Hahn, G.J., Meeker, W.Q. (1991). Statistical Intervals: A Guide for Practitioners. Wiley. (The standard practitioner reference. Chapter 2 distinguishes the four interval types — confidence, prediction, tolerance, and enclosure — and documents the CI-as-PI confusion as the most-cited error in applied work. Chapter 4 gives the Normal PI formulae with $\sqrt{1 + 1/n}$ ; Chapter 5 covers regression PIs.)
Faulkenberry, G.D. (1973). "A method of obtaining prediction intervals." Journal of the American Statistical Association 68(343), 433–435. (Early formal treatment of the Normal-model PI and its derivation via the predictive pivot. Cited as a foundational PI reference.)
Geisser, S. (1993). Predictive Inference: An Introduction. Chapman & Hall. (The predictive-inference framework. Argues that the prediction of future observations is the natural object of statistical inference and develops Bayesian and bootstrap predictive distributions. Chapter 3 covers Normal-model PIs; Chapter 5 covers nonparametric / bootstrap PIs.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (§7.2 distinguishes CIs and PIs for the Normal mean and derives the $\sqrt{1 + 1/n}$ formula. Readable introductory treatment.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (§9.2 develops the pivot-based interval framework; the prediction interval emerges as a pivot on $X_{n+1} - \bar X$ . Section 11.3 covers the regression PI.)
Vovk, V., Gammerman, A., Shafer, G. (2005). Algorithmic Learning in a Random World. Springer. (The conformal-prediction monograph. Defines transductive and inductive conformal predictors; proves finite-sample distribution-free coverage under exchangeability.)
Lei, J., G'Sell, M., Rinaldo, A., Tibshirani, R.J., Wasserman, L. (2018). "Distribution-free predictive inference for regression." Journal of the American Statistical Association 113(523), 1094–1111. (The modern statistics-friendly conformal-prediction treatment. Develops split conformal, full conformal, and jackknife+ for regression with finite-sample distribution-free coverage guarantees.)
Romano, Y., Patterson, E., Candes, E.J. (2019). "Conformalized quantile regression." NeurIPS 2019. (Conformal quantile regression: locally adaptive PIs that handle heteroscedasticity. The state-of-the-art conformal PI for regression with predictor-dependent width.)
Davison, A.C., Hinkley, D.V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. (§5.4 covers bootstrap predictive distributions and quantile-based bootstrap PIs. The practical reference for nonparametric PIs.)

Prediction intervals vs confidence intervals

Learning objectives

Parameter vs random variable: the keystone distinction

The Normal PI with known σ²: deriving the $\sqrt{1 + 1/n}$

Unknown σ²: replace by sample s and use the Student-t

The ci-vs-pi-explorer widget

The misuse in literature: CIs reported when PIs were needed

Prediction intervals in regression (a Part 4 preview)

Robust and bootstrap PIs

The pi-calibration widget: empirical coverage from train/test splits

Conformal prediction: the modern distribution-free PI

Try it

What you now know

References

Learning objectives

Parameter vs random variable: the keystone distinction

The Normal PI with known σ²: deriving the 1+1/n\sqrt{1 + 1/n}1+1/n<path d="M263,681c0.7,0,18,39.7,52,119

Unknown σ²: replace by sample s and use the Student-t

The ci-vs-pi-explorer widget

The misuse in literature: CIs reported when PIs were needed

Prediction intervals in regression (a Part 4 preview)

Robust and bootstrap PIs

The pi-calibration widget: empirical coverage from train/test splits

Conformal prediction: the modern distribution-free PI

Try it

What you now know

References

The Normal PI with known σ²: deriving the $\sqrt{1 + 1/n}$