Prediction intervals vs confidence intervals
Learning objectives
- State the conceptual contrast: a CONFIDENCE INTERVAL is about a PARAMETER (e.g. μ) — a fixed unknown — whereas a PREDICTION INTERVAL is about a RANDOM VARIABLE (e.g. the next observation X_{n+1}). The CI bands estimation uncertainty alone; the PI bands estimation uncertainty PLUS the intrinsic variance of the future draw
- Derive the Normal-model PI with KNOWN σ². X_{n+1} − X̄ has mean 0 and variance σ²·(1 + 1/n) under independence (since X̄ uses n iid draws independent of X_{n+1}). Hence the (1 − α) PI is X̄ ± z_{1−α/2} · σ · √(1 + 1/n). The √(1 + 1/n) factor exceeds 1 for every finite n and converges to 1 as n → ∞
- Derive the Normal-model PI with UNKNOWN σ². Replace σ by the sample sd s; the pivot (X_{n+1} − X̄)/[s · √(1 + 1/n)] follows Student-t with n − 1 degrees of freedom. The (1 − α) PI is X̄ ± t_{n−1, 1−α/2} · s · √(1 + 1/n). The CI for μ uses the SAME t quantile with the √(1 + 1/n) replaced by 1/√n
- Compare the LIMITS: as n → ∞, the CI half-width z·σ/√n → 0 (the sampling distribution of X̄ concentrates on μ), while the PI half-width z·σ·√(1 + 1/n) → z·σ (the next-draw intrinsic variance σ² remains regardless of training-sample size). PI half-width has a non-vanishing FLOOR of z·σ
- State the ratio PI/CI: at finite n the PI half-width / CI half-width = √(n + 1) ≈ √n for moderate n. At n = 20 the ratio is √21 ≈ 4.58; at n = 200 the ratio is √201 ≈ 14.18. The PI is always wider than the CI, and the gap grows with n on the absolute scale (CI shrinks faster than PI does)
- Identify the MISUSE in literature: many papers report a CI when describing the range a NEW patient / NEW measurement should land in. Statements like "based on the analysis, the new sample falls in [a, b] with 95% confidence" are PIs, not CIs — they require the √(1 + 1/n) factor. The classical reference is Hahn & Meeker (1991, Statistical Intervals: A Guide for Practitioners, §2)
- PREVIEW prediction intervals in REGRESSION (deferred to Part 4). For a new x_new, the predicted value Ŷ has TWO sources of uncertainty: (i) regression coefficient uncertainty (Var(β̂·x_new)) and (ii) residual noise σ². The PI half-width is z · √(Var(β̂·x_new) + σ²) and is always wider than the CI for E[Y | x_new] = z · √Var(β̂·x_new), again by an additive σ² inside the square root
- State BOOTSTRAP / NONPARAMETRIC PIs: instead of a parametric (Normal) PI, the bootstrap of the empirical predictive distribution yields quantile-based PIs. For predictive intervals from a fitted model, the bootstrap of (Ŷ − Y) residuals gives a quantile-based PI: PI = Ŷ ± quantile(bootstrap residuals, 1 − α/2). More robust to non-Normality than the Normal-PI (Geisser 1993, Predictive Inference: An Introduction)
- Define CALIBRATION for PIs: a (1 − α) PI is CALIBRATED if it covers the next observation in (1 − α) of repeated experiments. State that PI calibration is EMPIRICALLY TESTABLE via cross-validation, leave-one-out, or held-out data: build the PI from training, count the fraction of test points inside, compare to nominal
- Articulate the failure modes of Normal PIs: (1) HEAVY-TAILED data: the next draw is more often beyond ±z·σ than Normal-PI assumes; coverage < nominal. (2) OUT-OF-DISTRIBUTION test: training distribution ≠ test distribution; PI was calibrated to the wrong DGP. (3) HETEROSCEDASTICITY in regression: PI assumes constant σ²; with σ² = σ²(x), the PI width should vary with x and a constant-width PI is mis-calibrated
- Preview CONFORMAL PREDICTION as a distribution-free fix (Vovk-Gammerman-Shafer 2005, Algorithmic Learning in a Random World; Lei et al. 2018, JASA). Use a held-out calibration set to compute residual quantiles, then PI = Ŷ ± empirical (1 − α) quantile of |Ŷ − Y| on calibration. Coverage holds in finite samples under EXCHANGEABILITY of (X, Y) pairs — no parametric assumption needed. Locally adaptive variants handle heteroscedasticity
- State the PRACTICAL recommendation: USE A CI when reporting uncertainty about a parameter (mean treatment effect, regression coefficient, prevalence). USE A PI when reporting where a future observation will fall (next patient outcome, prediction for a new x). Never present a CI's [a, b] as if it bounded the next observation — that is the headline pedagogical error this section is built to prevent
Sections §3.1–§3.3 each built a CONFIDENCE INTERVAL methodology — Wald, Wilson, Clopper–Pearson, Garwood, Student-t (§3.1); percentile, basic, BCa, bootstrap-t (§3.2); profile likelihood and the LRT (§3.3). Every CI in those sections has the same conceptual content: it bands a PARAMETER — the population mean , the binomial , the Poisson rate , the regression coefficient . The parameter is a fixed unknown; the band captures sampling variability of the estimation procedure.
§3.4 turns to the parallel object: the PREDICTION INTERVAL (PI). The PI bands NOT a parameter but a RANDOM VARIABLE — specifically, the next observation drawn from the same distribution as the training sample. The CI answers "where is the true mean?"; the PI answers "where will the NEXT observation fall?". These are different objects with different uncertainty sources, and they require different formulas.
The CI and PI for the Normal model with known :
Two differences. First, the PI has a factor where the CI has a factor: the PI is always wider. Second — and the conceptually heavier point — as the CI half-width (the sampling distribution of concentrates on ) but the PI half-width (the intrinsic variance of a future draw remains). The CI vanishes in the limit; the PI converges to a non-zero floor.
The arc has ten stops. First, the conceptual contrast: parameter vs random variable. Second, the Normal PI derivation with known . Third, the unknown- generalisation via Student-t. Fourth, the limit comparison and the canonical PI/CI ratio . Fifth, the ci-vs-pi-explorer widget. Sixth, the misuse in literature: CIs reported where PIs should be. Seventh, prediction intervals in regression (a Part 4 preview). Eighth, robust / bootstrap PIs. Ninth, the pi-calibration widget — empirical coverage from train/test splits, including heavy-tailed and OOD failure modes. Tenth, conformal prediction as the modern distribution-free PI. Try-it, recap, references.
Parameter vs random variable: the keystone distinction
The most-cited misuse in applied statistics is mixing up the CI and the PI. The 95% confidence interval for says, roughly, "the procedure that produced this interval covers the true mean in 95% of repeated samples." The frequentist interpretation (Neyman 1937; §3.1) is procedural: a property of the CI-building procedure, not a probability about the realised interval. The PARAMETER is a fixed unknown number; the INTERVAL is a random object that depends on the sample.
The 95% prediction interval for is conceptually different. Now BOTH endpoints AND the target are random. The PI is a procedure that, in 95% of repeated experiments, produces an interval that covers the value of the next iid draw . The 95% lives in the joint probability over (i) the sample that builds the PI and (ii) the future draw from the same distribution.
Why this matters in practice. Suppose we estimate the mean weight of newborns at a hospital and compute a 95% CI of kg. The CI tells us, "the average newborn weight is somewhere in ". It does NOT tell us the next newborn will weigh kg — that is a PI claim. The PI for the next newborn might be kg, much wider, because individual weights vary by kg around the mean. Treating the CI as a PI ("we are 95% confident the next newborn weighs kg") would be wrong by an order of magnitude and would systematically under-cover. Hahn & Meeker (1991, §2) document this exact confusion as the most-common statistical interval error in applied work.
The Normal PI with known σ²: deriving the
c34,79.3,68.167,158.7,102.5,238c34.3,79.3,51.8,119.3,52.5,120 c340,-704.7,510.7,-1060.3,512,-1067 l0 -0 c4.7,-7.3,11,-11,19,-11 H40000v40H1012.3 s-271.3,567,-271.3,567c-38.7,80.7,-84,175,-136,283c-52,108,-89.167,185.3,-111.5,232 c-22.3,46.7,-33.8,70.3,-34.5,71c-4.7,4.7,-12.3,7,-23,7s-12,-1,-12,-1 s-109,-253,-109,-253c-72.7,-168,-109.3,-252,-110,-252c-10.7,8,-22,16.7,-34,26 c-22,17.3,-33.3,26,-34,26s-26,-26,-26,-26s76,-59,76,-59s76,-60,76,-60z M1001 80h400000v40h-400000z"/> factor
Assume and a future draw from the same distribution, INDEPENDENT of the training sample. The point predictor is . The prediction ERROR is
Its mean is , and its variance, using independence of and , is
Two TERMS, two SOURCES of uncertainty:
- is the INTRINSIC variance of the next draw. Even with infinite training data, the next observation has variance around its mean .
- is the estimation uncertainty: how far is from . This shrinks with .
By Normality of both and , the prediction error is Normal: . The pivot is standard Normal. Inverting that pivot for the central band gives
The interval is the PI for . The factor is the immediate algebraic difference from the CI, and it captures BOTH the next-draw variance AND the training-sample variance combined into one variance.
Three numerical landmarks at 95% nominal ():
- : . PI half-width = . CI half-width = . Ratio PI/CI = .
- : . PI half-width . CI half-width . Ratio PI/CI .
- : . PI half-width . CI half-width = . Ratio .
The CI shrinks linearly in ; the PI converges to the FIXED half-width . The gap between them grows on the absolute scale as grows: the CI vanishes, the PI does not.
Unknown σ²: replace by sample s and use the Student-t
In practice is rarely known. The standard fix is to replace it by the sample sd and use the Student-t distribution instead of the Normal. The pivot becomes
This is the Student-t pivot for the PI, analogous to the Student-t pivot for the CI in §3.1. The proof uses independence of and for a Normal sample (Fisher 1925; Casella & Berger 2002, §5.3): is standard Normal, independently, and their ratio (with the appropriate denominator scaling) is Student-t with df.
The PI under unknown :
The corresponding CI for shares the t multiplier:
The only difference between the CI and PI formulae under unknown is the multiplier on : for the CI, for the PI. At and 95% nominal: . CI half-width = . PI half-width = . PI/CI . Same ratio as the known- case (because the multiplier cancels).
The Student-t pivot is the exact-finite-sample answer for Normal data with unknown variance. For non-Normal data the -based PI is approximate (CLT for , but the next-draw distribution is NOT Normal so the predictive part is approximate even asymptotically). Robust / nonparametric PIs (later in this section) relax the Normality assumption.
The ci-vs-pi-explorer widget
The first widget makes the CI–PI distinction visible. Pick a Normal model (), a sample size , and a confidence level. The widget draws one sample, computes the CI for and the PI for , and plots both as horizontal bars against the -axis. Above the bars it draws TWO densities: the SAMPLING DISTRIBUTION of (narrow, scales as ) and the PREDICTIVE DISTRIBUTION of (wider, scales as ). The reader can toggle between known (z multiplier) and unknown (t multiplier).
Things to verify in the widget:
- Start at , 95% confidence, known. The CI bar (green) sits tightly around ; the PI bar (blue) is roughly wider — the ratio. The green sampling-of- density is a tall narrow Gaussian; the blue predictive-of- density is the wider unit-variance Gaussian. The widget reports the exact ratio in the table.
- Slide up to 200. The CI half-width shrinks from to . The PI half-width barely changes — it converges to . Compare the table's "n → ∞ limit (half)" column: 0 for the CI, for the PI. The ratio jumps from to .
- Drop to 3. The PI half-width inflates: , so PI half-width . The CI inflates much more: . Both intervals are wide, but the CI shrinks fast while the PI stays put.
- Toggle to UNKNOWN. The widget switches to the quantile: at , vs the z quantile 1.96. Both intervals widen by about 7%. At , — both intervals widen by ; the small-sample -correction is significant.
- Slide up from 1 to 2. The PI half-width doubles (it scales linearly with ). The CI half-width also doubles. The PI floor moves from to : the irreducible noise of the next draw scales with the true noise of the DGP.
- Re-roll the sample a few times. The CI moves around ; sometimes covers, sometimes misses (it should cover in 95% of re-rolls under the assumed model). The PI also moves with but is wide enough that it almost always brackets the predictive density. The "covers truth?" flag in the table tracks the CI; the PI flag is omitted because there is no single "truth" to check — the truth is the random density.
The misuse in literature: CIs reported when PIs were needed
Hahn & Meeker (1991), Statistical Intervals: A Guide for Practitioners, document the CI-as-PI confusion as the most common statistical interval error in applied work. The phrasing is the giveaway. A CI says "we estimate the mean is in [a, b]"; a PI says "the next observation will fall in [a, b]". Common misuse patterns:
- Clinical trials. "Based on our analysis, the next patient's blood-pressure response will fall in with 95% confidence." This is a PI claim — needs the factor. The reported CI half-width is roughly ; the PI half-width is roughly , which is times wider. For : the CI says (about mmHg if ); the PI says (about mmHg) — TEN TIMES wider.
- Manufacturing tolerance. "The 95% CI for the mean weight of a component is grams." This bands the mean weight; if a quality engineer reads it as "any new component will weigh grams" they are wrong — the new component weight has its OWN variance added on top.
- Forecasting. A CI for the mean of a forecast distribution is not the same as a prediction interval for the next realisation. Demand forecasting, financial returns, and engineering reliability all need PIs, not CIs.
The remedy is procedural: when describing where a future observation will fall, USE A PI. The Normal-PI formula is one line of code more than the CI formula — replace with . The cost is trivial; the correctness is non-negotiable.
Prediction intervals in regression (a Part 4 preview)
The CI vs PI distinction extends to regression and to any prediction model. In ordinary-least-squares regression with , the predicted value at a NEW is . Two parallel intervals exist:
Read the difference. The CI for bands the REGRESSION-FUNCTION uncertainty: how far the fitted line is from the true conditional mean at . The PI for bands the FUTURE-OBSERVATION uncertainty: where the actual will land. The PI adds inside the square root — the same as the residual noise. As , the coefficient-uncertainty term (the regression coefficients are consistent); the residual term DOES NOT shrink. The PI converges to — the same intrinsic-noise floor as the iid Normal case.
In Part 4 this generalises to confidence vs prediction bands (curves of CI/PI half-widths plotted as a function of ), and the PI/CI ratio is largest at the EDGES of the training- range and smallest near the centroid. For non-Normal errors the formulae are first-order approximations; conformal prediction (below) gives a distribution-free finite-sample replacement.
Robust and bootstrap PIs
The Normal-PI formula assumes Normal data. For heavy-tailed or skewed populations the formula is mis-calibrated: it under-covers when the true distribution puts more probability in the tails than the assumed Normal does. Two non-parametric alternatives:
- Quantile-based PI. Read off the empirical and quantiles of the training sample directly. The PI is . For large this approaches the true marginal-distribution quantiles. No distributional assumption needed. Cost: requires for stable tail quantiles.
- Bootstrap predictive distribution PI. Bootstrap the training sample to get bootstrap means and bootstrap residuals . The predictive distribution of is approximated by the convolution of the bootstrap distribution of with the residual distribution. Quantile-based PI from this convolution gives a distribution-free PI (Davison & Hinkley 1997, §5.4).
Geisser (1993), Predictive Inference: An Introduction, develops the predictive-inference framework: the goal is the predictive distribution, not the parameter, and Bayesian and bootstrap methods give natural quantile-based PIs from posterior or empirical predictive distributions. The robust approach is wider than the Normal-PI under Normality (efficiency cost) but better calibrated under non-Normality (robustness gain).
The pi-calibration widget: empirical coverage from train/test splits
Unlike the CI, which has no observable truth (the parameter is unknown), the PI has an observable truth: the next observation. PI calibration is EMPIRICALLY TESTABLE. The procedure is simple:
- Split the data (or simulate): training points, test points.
- Build the PI from the training data.
- Count the fraction of test points inside the PI.
- Average over many train/test splits (Monte-Carlo or cross-validation).
- Compare to nominal: a 95% PI should cover 95% of test points.
The pi-calibration widget runs this experiment. Pick a training distribution (Normal or heavy-tailed Student-), a test distribution (Normal, , or shifted Normal for OOD), a training size , a test size , and the number of splits . The widget runs the Monte-Carlo simulation and reports pooled empirical coverage with a Wilson-score 95% confidence band on the proportion estimate.
Things to verify in the widget:
- Start with Normal training and Normal test (matched), , 95% nominal. Pooled empirical coverage should be with a tight Monte-Carlo band ([94%, 96%] or similar). The widget says "calibrated". The histogram of per-split coverage rates centres on 0.95. This is the canonical "PIs work" picture under correct distributional assumptions.
- Switch test distribution to "Normal shifted by +σ (OOD)". The pooled coverage collapses dramatically — typically to 50–80% depending on . The widget flags "out-of-distribution test: training was Normal but the test draws come from a SHIFTED Normal." PI assumes i.i.d. with the training distribution; OOD test draws break that assumption and coverage tanks.
- Switch test distribution to "Student- (heavy tails)" with Normal training. The PI was built assuming Normal residuals but the test draws have tails (; rescale to ). Coverage drops by 2–5 percentage points below nominal. The heavy tails put more probability beyond than the Normal does, so the Normal-PI under-covers.
- Set BOTH train and test to . The Normal-PI formula still UNDER-COVERS, even with matched distributions, because the -quantile correction only adjusts for ESTIMATION of from a Normal sample — it does NOT correct for non-Normal TAILS. The PI uses but the true predictive distribution is ; mismatch persists. Coverage typically at 95% nominal.
- Increase from 500 to 2000. The Wilson-score Monte-Carlo band tightens by ; statistical precision on the coverage estimate improves. The verdict ("calibrated" / "under-covers" / "over-covers") stabilises.
- Slide from 5 to 200 (matched Normal). The -correction shrinks as grows; the PI converges to the z-based PI. Coverage stays across because under matched Normal the formula is exact — calibration does not depend on if the model is correct.
Conformal prediction: the modern distribution-free PI
The widget makes the failure modes of the Normal-PI visible. The cure, for distribution-free finite-sample coverage, is CONFORMAL PREDICTION. The framework was developed by Vovk, Gammerman, and Shafer (2005), Algorithmic Learning in a Random World, and made widely accessible to statistics by Lei, G'Sell, Rinaldo, Tibshirani, and Wasserman (2018), "Distribution-free predictive inference for regression," JASA 113(523), 1094–1111. The SPLIT-CONFORMAL recipe (Lei et al. 2018, §2):
- Split data into TRAINING set (size ) and CALIBRATION set (size ).
- Fit a regression / prediction model on .
- Compute calibration residuals for .
- Let be the -th smallest residual.
- For a new : PI = .
The theorem (Vovk-Gammerman-Shafer 2005; Lei et al. 2018): if are EXCHANGEABLE (in particular, iid), the resulting PI satisfies in FINITE samples, distribution-free. The coverage holds for ANY base prediction model — including misspecified ones. The cost is the calibration set (data not used for fitting) and possibly looseness when is a poor fit (the residual distribution is wider).
Locally adaptive variants (CQR, conformal quantile regression; Romano-Patterson-Candes 2019) replace the constant width with -dependent widths — handling heteroscedasticity. Group-conditional and cross-conformal variants tighten further. Conformal prediction is now the standard finite-sample valid PI procedure for ML predictions; it gives the formal coverage guarantee that the Normal-PI cannot deliver under non-Normal data.
Try it
- In the ci-vs-pi-explorer, set , 95%, known. Read off the CI half-width and PI half-width from the table. Verify the PI/CI ratio . Compute by hand: CI half = , PI half = , ratio = . Match the widget reading to four decimal places.
- Same widget. Slide from 20 to 2000 in steps. Watch the CI half-width shrink toward 0 and the PI half-width converge toward 1.96. Plot mentally the two curves vs : CI decays like , PI asymptotes at . Note the ratio .
- Same widget. Toggle unknown at . Note the t-quantile vs the z-quantile 1.96 — the small-sample correction widens both intervals by . Re-roll a few times to see the estimate fluctuate; the CI and PI both inherit the noise.
- In the pi-calibration, set matched Normal training and Normal test, , 95% nominal. Verify pooled empirical coverage with a tight band. This is the "PI works under matched assumptions" reference case.
- Same widget. Switch test distribution to "Normal shifted by +σ". Observe pooled coverage drops to . Argue: the PI is centred on but the test draws now have mean , so about half the test draws are outside the upper PI endpoint .
- Same widget. Switch training to , test to (both heavy-tailed). Coverage falls below nominal because the Normal-PI assumed Normal tails. Compute the true 95% quantile of the standardised : it is (rescaled to unit variance), much wider than 1.96. The Normal-PI uses ; it misses the actual heavy tail.
- Pen-and-paper. State the variance decomposition for under iid Normality: . Why are the two terms additive (not multiplicative)? Hint: independence of and .
- Pen-and-paper. Derive the regression PI: . Compute . Compare with the CI for which only has the second term. Argue when the difference is large (small ) vs small (large relative to ).
- Pen-and-paper. Describe the split-conformal PI procedure: training, calibration, prediction. Why does the coverage hold in finite samples? Hint: exchangeability of means the rank of is uniform on , so .
Pause and reflect: §3.4 has made the CI–PI distinction explicit. The CI is about a PARAMETER (the true mean , a regression coefficient, a probability) — a fixed unknown. The PI is about a RANDOM VARIABLE (the next observation , the next predicted ) — both endpoints AND target are random. For the Normal model with known , the difference is one algebraic factor: for the CI, for the PI. The CI vanishes as ; the PI converges to the irreducible noise floor . The two ci-vs-pi-explorer and pi-calibration widgets make this visible and EMPIRICALLY TESTABLE — PI calibration is checkable from data, where CI calibration of an unknown parameter is not. §3.5 will pick up the broader calibration thread: when does a procedure that CLAIMS 95% coverage really deliver 95% coverage, and how do you check, across all the CI methodologies of §§3.1–3.3 and the PIs of §3.4?
What you now know
You can articulate the CONCEPTUAL distinction between a CI (about a parameter — a fixed unknown) and a PI (about a random variable — a future observation). You know the CI bands estimation uncertainty alone, while the PI bands estimation uncertainty PLUS the intrinsic variance of the next draw.
You can derive the Normal PI with known : by independence, leading to PI = . You can derive the unknown- version via the Student-t pivot: PI = . You know the corresponding CIs have the factor replaced by and the LIMIT behaviour: CI half-width , PI half-width . PI/CI ratio at finite is .
You can identify the MISUSE in literature where authors report a CI when describing where a future observation will fall — Hahn & Meeker (1991) calls this the most-cited error in applied statistical intervals. You know the remedy: use the factor and call it a PI.
You can state the regression PI: where the second is the next-draw noise that the CI for does NOT include. You know that as the CI for the regression mean shrinks to 0 but the PI converges to the irreducible floor — same shape as the iid case.
You can describe NONPARAMETRIC PIs: quantile-based PIs from the empirical CDF, bootstrap-based PIs from the predictive distribution, and the rationale (robustness to non-Normal data at the cost of needing larger ). You know Geisser (1993) developed the predictive-inference framework where these alternatives sit naturally.
You can describe PI CALIBRATION as an empirically testable property — coverage is verified by train/test splits or cross-validation, unlike CIs for unknown parameters which lack an observable truth. You can use the pi-calibration widget to verify that Normal-PI coverage matches nominal under matched Normal data, drops several percentage points under heavy tails, and collapses under out-of-distribution test draws.
You can describe CONFORMAL PREDICTION as the distribution-free finite-sample-valid PI procedure: split into training and calibration sets, compute calibration residuals, set the PI half-width to the appropriate empirical-residual quantile. Coverage holds under exchangeability for ANY base prediction model. Vovk, Gammerman, Shafer (2005) is the canonical reference; Lei et al. (2018) is the modern statistics-friendly treatment. Locally adaptive variants (CQR) handle heteroscedasticity.
Where this lands in the rest of Part 3 and the textbook. §3.5 takes CALIBRATION as a topic in its own right: when does a CI procedure that CLAIMS 95% coverage really deliver 95%, and how do you check across all the methodologies (Wald, bootstrap, profile-LRT, Normal-PI, conformal)? §3.6 closes Part 3 on the communication side — how to report uncertainty without lying. Part 4 (regression) develops the regression-PI machinery in full: predictor-dependent widths, confidence bands vs prediction bands, conformal prediction for regression. The factor you just learned generalises to in regression.
References
- Hahn, G.J., Meeker, W.Q. (1991). Statistical Intervals: A Guide for Practitioners. Wiley. (The standard practitioner reference. Chapter 2 distinguishes the four interval types — confidence, prediction, tolerance, and enclosure — and documents the CI-as-PI confusion as the most-cited error in applied work. Chapter 4 gives the Normal PI formulae with ; Chapter 5 covers regression PIs.)
- Faulkenberry, G.D. (1973). "A method of obtaining prediction intervals." Journal of the American Statistical Association 68(343), 433–435. (Early formal treatment of the Normal-model PI and its derivation via the predictive pivot. Cited as a foundational PI reference.)
- Geisser, S. (1993). Predictive Inference: An Introduction. Chapman & Hall. (The predictive-inference framework. Argues that the prediction of future observations is the natural object of statistical inference and develops Bayesian and bootstrap predictive distributions. Chapter 3 covers Normal-model PIs; Chapter 5 covers nonparametric / bootstrap PIs.)
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (§7.2 distinguishes CIs and PIs for the Normal mean and derives the formula. Readable introductory treatment.)
- Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (§9.2 develops the pivot-based interval framework; the prediction interval emerges as a pivot on . Section 11.3 covers the regression PI.)
- Vovk, V., Gammerman, A., Shafer, G. (2005). Algorithmic Learning in a Random World. Springer. (The conformal-prediction monograph. Defines transductive and inductive conformal predictors; proves finite-sample distribution-free coverage under exchangeability.)
- Lei, J., G'Sell, M., Rinaldo, A., Tibshirani, R.J., Wasserman, L. (2018). "Distribution-free predictive inference for regression." Journal of the American Statistical Association 113(523), 1094–1111. (The modern statistics-friendly conformal-prediction treatment. Develops split conformal, full conformal, and jackknife+ for regression with finite-sample distribution-free coverage guarantees.)
- Romano, Y., Patterson, E., Candes, E.J. (2019). "Conformalized quantile regression." NeurIPS 2019. (Conformal quantile regression: locally adaptive PIs that handle heteroscedasticity. The state-of-the-art conformal PI for regression with predictor-dependent width.)
- Davison, A.C., Hinkley, D.V. (1997). Bootstrap Methods and Their Application. Cambridge University Press. (§5.4 covers bootstrap predictive distributions and quantile-based bootstrap PIs. The practical reference for nonparametric PIs.)