Exact vs asymptotic CIs

Part 3, Confidence intervals and uncertainty

Learning objectives

State the FREQUENTIST definition of a confidence interval as a PROCEDURE: for confidence level 1 − α, the procedure C(X) = [L(X), U(X)] satisfies P_θ(θ ∈ C(X)) ≥ 1 − α for every θ, coverage is a property of the random PROCEDURE under the true parameter, NOT a probability statement about a single computed [a, b]
Recognise that 'I am 95% confident θ ∈ [a, b]' is a frequentist sloppy short-hand; the formally correct reading is 'the procedure that produced [a, b] covers θ in 95% of repeated samples'
Derive the WALD CI θ̂ ± z_{1 − α/2} · SÊ(θ̂) from the asymptotic normality result (CLT, §1.6, §1.9) and recognise it as the asymptotic CI
List the regimes where Wald BREAKS DOWN: parameter values near a boundary (Binomial p near 0 or 1; Poisson λ near 0), highly skewed sampling distributions, small n
State the CLOPPER-PEARSON (1934) exact CI for Binomial p by inverting the binomial test: [Beta(α/2; k, n − k + 1), Beta(1 − α/2; k + 1, n − k)]; note the EXACT coverage guarantee P(θ ∈ C) ≥ 1 − α (NOT equality), the CI is CONSERVATIVE
State the WILSON (1927) score CI: (p̂ + z²/(2n) ± z·√(p̂(1-p̂)/n + z²/(4n²)))/(1 + z²/n). Recognise it as the inversion of the SCORE test, with coverage close to nominal across most of [0, 1] and no need for boundary clipping Agresti-Coull (1998) and Brown-Cai-DasGupta (2001) endorse it as the modern default
Preview the LIKELIHOOD-RATIO CI (§3.3 develops it): {θ : 2[ℓ(θ̂) − ℓ(θ)] ≤ χ²_{1, 1−α}}. Recognise it as the general-purpose finite-sample CI that often outperforms both Wald and exact CIs for irregular models
Define COVERAGE PROBABILITY as the actual fraction of times the procedure covers θ in repeated samples; distinguish NOMINAL from ACTUAL coverage; show that Wald under-covers and Clopper-Pearson over-covers, with Wilson closest to nominal (Brown-Cai-DasGupta 2001)
State the EXACT POISSON CI (gamma-based, Garwood 1936): lo = ½χ²_{2k, α/2}/n, hi = ½χ²_{2(k+1), 1−α/2}/n; recognise it as the gamma-Poisson conjugacy applied to the rate λ
State the NORMAL μ CI (Student-t, Gosset 1908): X̄ ± t_{1−α/2, n−1}·s/√n; recognise it as exact under Normality (the assumption is what 'exact' means)
Recognise the difference between a two-sided CI [L, U] and a one-sided UPPER or LOWER confidence BOUND (∞-or 0-truncated); state when each is appropriate, and recall that the bioequivalence machinery in §2.7 used a one-sided / TOST construction
Distinguish FREQUENTIST CIs (a procedure property) from BAYESIAN CREDIBLE intervals (P(θ ∈ [a, b] | data), a posterior-probability statement); both can take the same numeric value yet they answer different questions, and §7 develops the Bayesian alternative

Part 2 spent eight sections on HYPOTHESIS TESTS, the Neyman-Pearson decision rule, the p-value, multiple-testing corrections, the equivalence framework, and the replication crisis that follows when those disciplines are absent at the literature level. Part 3 turns to the DUAL inferential object: the CONFIDENCE INTERVAL. A test answers "is the parameter different from the null?"; a CI answers "given the data, what range of parameter values is consistent with what we observed?". The two are mathematically equivalent, every CI inverts a test, every test corresponds to a CI, but the CI carries an effect size and an uncertainty band on the SAME object, which is exactly the reporting recommendation Wasserstein, Schirm, Lazar (2019, American Statistician) made the centrepiece of "Moving to a world beyond p < 0.05".

The §3.1 task is to lay out the CI FRAMEWORK and the two complementary roads into it. The asymptotic road, the Wald CI θ̂ ± z·SÊ(θ̂), is the workhorse: it works for any estimator with a tractable standard error and an asymptotic normality result. It dominates introductory textbooks because it is short and mnemonic. The exact road, Clopper-Pearson (1934) for the binomial, the gamma-based CI (Garwood 1936) for the Poisson, the Student-t CI (Gosset 1908) for the Normal mean, gives an interval whose coverage is guaranteed to hit the nominal level on the parameter space, not merely in the n → ∞ limit. Both have failure modes. The §3.1 widgets make them visible.

The arc has twelve stops. First, the formal definition of a CI and the SLOPPY-SHORT-HAND warning. Second, the Wald CI derivation from CLT-style asymptotics. Third, where Wald breaks: boundaries, skew, small n. Fourth, the Clopper-Pearson exact CI for the binomial. Fifth, the Wilson score CI and why it is the modern default. Sixth, the EXACT POISSON CI and the Normal-μ Student-t CI. Seventh, a preview of the likelihood-ratio CI (which §3.3 develops in depth). Eighth, COVERAGE PROBABILITY, nominal vs actual, and the Brown-Cai-DasGupta (2001) verdict on Wald. Ninth, the ci-methods-comparison widget: one sample, four CIs, all drawn on the same axis. Tenth, the coverage-explorer widget: EXACT coverage vs true p, by full summation over k = 0, …, n. Eleventh, one-sided CIs and the bioequivalence cross-link. Twelfth, the credible-interval cross-link to Part 7.

What a confidence interval IS, and is NOT

The formal definition (Neyman 1937, Phil. Trans. R. Soc. A 236, 333-380) is operational. Let $X = (X_1, \ldots, X_n)$ be the sample and $\theta$ the parameter of interest. A confidence interval procedure at level $1 - \alpha$ is a pair of functions $L(X) \le U(X)$ such that, for EVERY parameter value $\theta$ ,

P_{\theta}\bigl(L(X) \le \theta \le U(X)\bigr) \;\ge\; 1 - \alpha.

Read the LHS slowly. The randomness lives in $X$ ; the CI bounds $L(X), U(X)$ are random because they are functions of $X$ . The parameter $\theta$ is FIXED but unknown. The probability is taken over the distribution of $X$ when $\theta$ is the true value. So the statement is: over the long run of imagined replications of the experiment, the procedure that produces these bounds covers the true $\theta$ at least $1 - \alpha$ of the time.

The deceptive part is what happens AFTER you observe a single dataset. Suppose you collect data, compute $L(x) = 0.42$ and $U(x) = 0.61$ , and write down "95% CI: [0.42, 0.61]". The temptation is to say "I am 95% confident $\theta$ is in [0.42, 0.61]" or, worse, "the probability that $\theta \in [0.42, 0.61]$ is 0.95". Both are technically WRONG under the frequentist reading. After you observe the data, the bounds $0.42, 0.61$ are FIXED numbers; $\theta$ is also a fixed number; the event " $\theta \in [0.42, 0.61]$ " either happens or does not happen, it has no probability between 0 and 1 from a frequentist standpoint. The 95% lives in the PROCEDURE, not in the post-data interval.

The honest reading is: "the procedure that produced this interval covers $\theta$ in 95% of repeated samples". The procedure is what is 95%-reliable. The single interval [0.42, 0.61] either contains $\theta$ or it does not, and we cannot know which without seeing the truth. Calling [0.42, 0.61] "the 95% CI" is convention, every applied textbook uses the phrase, but the underlying object that carries the 95% is the procedure $(L, U)$ , not the realised numbers. The widgets in this section drive that distinction home: when you re-roll a simulation, you see DIFFERENT realised intervals from the SAME procedure, and the long-run fraction that covers $\theta$ is what the 95% refers to.

Bayesian credible intervals (Part 7) make a different statement: given a prior on $\theta$ and the data, $P(\theta \in [a, b] \mid X) = 1 - \alpha$ DOES hold for the realised interval. The numeric value can be the same as a frequentist CI in many problems (Jeffreys-prior Bayesian intervals for the binomial coincide with mid-P CIs to high accuracy at moderate n), but the conceptual content is different. Part 7 develops that distinction. For now, every CI in this section is a frequentist procedure, and "95%" is its long-run coverage rate.

The Wald CI: derived from asymptotic normality

The asymptotic CI most readers see first is the WALD CI. Its derivation is one line of §1.6 + §1.9 machinery. Let $\hat\theta_n$ be an estimator with an asymptotic-normality result of the form

\sqrt{n}\,(\hat\theta_n - \theta) \;\xrightarrow{d}\; N\bigl(0, \sigma^2(\theta)\bigr) \qquad \text{as } n \to \infty.

Examples: the sample mean $\bar X$ for any iid population with finite variance (CLT, §0.7); the MLE for a regular model (§1.3 + §1.9); the sample proportion $\hat p = k/n$ for the binomial (special case of the CLT with $\sigma^2 = p(1-p)$ ). In every case, for large $n$ the distribution of $\hat\theta_n$ is approximately $N\bigl(\theta, \sigma^2(\theta) / n\bigr)$ . Plug in a consistent estimate $\widehat{\mathrm{SE}}(\hat\theta_n) = \hat\sigma_n/\sqrt{n}$ (typically $\hat\sigma_n = \sigma(\hat\theta_n)$ ; this is the "plug-in" step) and you get the asymptotic pivot

\frac{\hat\theta_n - \theta}{\widehat{\mathrm{SE}}(\hat\theta_n)} \;\xrightarrow{d}\; N(0, 1).

Invert this to construct an interval. For nominal coverage $1 - \alpha$ in a two-sided test, the central-95% region of a standard Normal is $(-z_{1-\alpha/2}, z_{1-\alpha/2})$ with $z_{0.975} \approx 1.96$ . Then

P\!\left(\hat\theta_n - z_{1-\alpha/2}\,\widehat{\mathrm{SE}}(\hat\theta_n) \;\le\; \theta \;\le\; \hat\theta_n + z_{1-\alpha/2}\,\widehat{\mathrm{SE}}(\hat\theta_n)\right) \;\to\; 1 - \alpha.

The Wald CI at level $1 - \alpha$ is therefore

$\boxed{;\hat\theta_n ;\pm; z_{1-\alpha/2} \cdot \widehat{\mathrm{SE}}(\hat\theta_n);}$ .

For the binomial proportion with $\hat p = k/n$ this reads $\hat p \pm z\sqrt{\hat p(1-\hat p)/n}$ . For the Poisson rate with $\hat\lambda = k/n$ (total events $k$ over $n$ unit-exposure observations) it reads $\hat\lambda \pm z\sqrt{\hat\lambda/n}$ . For the Normal mean with known $\sigma$ it reads $\bar X \pm z,\sigma/\sqrt n$ . In every case the form is the same: point estimate ± normal quantile × estimated standard error.

The Wald CI is the FIRST CI most readers learn because it is short, generic, and works for any estimator that has an asymptotic-normality theorem. It is the standard reporting form in regression output, generalised linear models (Part 5), and most software. Wasserman (2004, All of Statistics, Chapter 6) gives it as the default; Casella-Berger (2002, Chapter 9.1) discuss it under "inverting a pivot". For large $n$ and a parameter value in the interior of its space, the Wald CI is hard to improve on.

Where Wald breaks: boundaries, skew, small n

The "for large $n$ and a parameter value in the interior" qualifier is doing a lot of work in the previous paragraph. Three regimes spoil the Wald approximation; each is the canonical motivation for the exact and score alternatives.

Parameter values near a boundary. The Wald CI has the form $\hat\theta \pm z\cdot \widehat{\mathrm{SE}}$ , which is symmetric and unbounded. For binomial $p$ this is a problem because $p \in [0, 1]$ : when $\hat p$ is close to 0 (e.g., $k = 1$ event in $n = 50$ ), the Wald CI extends below 0 (or extends close to 0, where the $\sqrt{\hat p(1 - \hat p)}$ term itself is small, so the interval becomes too narrow). Many implementations CLIP the interval to [0, 1] post hoc, which makes the coverage non-monotone and reproduces the famous Brown-Cai-DasGupta (2001) sawtooth. The exact and score CIs handle boundaries by construction.
Highly skewed sampling distributions. Wald assumes the sampling distribution of $\hat\theta$ is well-approximated by a symmetric Normal. For estimators whose sampling distribution is skewed, Poisson rate at low counts (right-skewed), ratio of two means with small denominators (extreme tails), MLEs for irregular models, the symmetric ± width misrepresents the actual uncertainty. The exact CIs are typically asymmetric and capture this; the LRT-based CIs do so via the non-quadratic shape of the log-likelihood.
Small n. Asymptotic normality is a limit theorem: it holds AS $n \to \infty$ . For finite $n$ the quality of the approximation depends on the parameter (the "third moment" / Berry-Esseen rate, see §1.9). For the binomial with $n = 30$ and $p \approx 0.05$ , the Wald CI under-covers by 10+ percentage points (Brown-Cai-DasGupta 2001 Table 2). The Student-t correction (Gosset 1908) is the simplest small- $n$ fix for the Normal mean; for other distributions the LRT or exact CIs are the analogues.

The diagnostic for "is Wald OK here?" is therefore: how close is the parameter to a boundary, how skewed is the sampling distribution, and how large is $n$ ? When ALL three are favourable (interior parameter, near-symmetric distribution, large $n$ ), the Wald CI matches the exact and score CIs to within Monte Carlo noise. When ANY one of them fails, Wald can be badly off.

The Clopper-Pearson exact CI (binomial)

Clopper and Pearson (1934, Biometrika 26(4), 404-413) constructed an EXACT confidence interval for the binomial proportion by inverting the binomial test. The construction is:

Lower bound $p_L$ : the smallest $p$ such that $P(\text{Binomial}(n, p) \ge k) \ge \alpha/2$ . Upper bound $p_U$ : the largest $p$ such that $P(\text{Binomial}(n, p) \le k) \ge \alpha/2$ .

Closed form via the gamma-beta duality:

$p_L = \text{Beta}\bigl(\alpha/2;;k,;n - k + 1\bigr)$ (the $\alpha/2$ -quantile of a Beta distribution)

p_U = \text{Beta}\bigl(1 - \alpha/2;\;k + 1,\;n - k\bigr)

with the boundary conventions $p_L = 0$ when $k = 0$ and $p_U = 1$ when $k = n$ . The interval is GUARANTEED to satisfy $P_p(p \in C) \ge 1 - \alpha$ for every $p \in [0, 1]$ . Read the inequality carefully: the coverage is AT LEAST nominal. The actual coverage is typically ABOVE the nominal level, Clopper-Pearson is CONSERVATIVE. The over-coverage is the price for a discrete sample space (the test cannot achieve exactly $\alpha/2$ on each side because the binomial probability mass function only takes a finite set of values).

Two things to know about Clopper-Pearson. (1) When you have ZERO events ( $k = 0$ ), the lower bound is exactly 0 and the upper bound is the so-called "rule of three" approximation: for 95% CI, $p_U \approx 3/n$ at large $n$ (Hanley & Lippman-Hand 1983, JAMA). This is the canonical answer to "no events in $n$ trials, what is the upper bound for $p$ ?". (2) Some software reports a MID-P version (Lancaster 1961) that splits the discrete probability mass at $k$ in half, reducing the conservatism. Both are exact in different conventions; both are widely used.

The Wilson score CI (binomial)

Wilson (1927, JASA 22(158), 209-212) constructed a CI by inverting the SCORE test rather than the Wald test. The score test is the test that uses $\sqrt{n}(\hat p - p)/\sqrt{p(1-p)}$ , the standard error computed UNDER THE NULL $p_0$ , not at the point estimate $\hat p$ . Setting that pivot equal to $\pm z$ and solving for $p$ gives a quadratic in $p$ ; the two roots are the CI endpoints:

C^{\mathrm{Wilson}} \;=\; \frac{\hat p + z^2/(2n) \;\pm\; z\,\sqrt{\hat p(1-\hat p)/n + z^2/(4n^2)}}{1 + z^2/n}.

Three properties make Wilson the modern default:

Coverage close to nominal across most of [0, 1]. Brown, Cai, DasGupta (2001, Statistical Science 16(2), 101-117) computed exact coverage probabilities for n up to several hundred and a fine grid of p values. Wilson coverage stays within ± 1 percentage point of nominal almost everywhere; the rare deviations are at small $n$ and very small $p$ .
Bounded inside [0, 1] by construction. The denominator $(1 + z^2/n)$ and the $z^2/(2n)$ pull-toward-1/2 effect mean the bounds never escape [0, 1], even when $k = 0$ or $k = n$ . No post-hoc clipping is needed.
Closed form. Unlike Clopper-Pearson, no incomplete-beta call is required, Wilson is a single quadratic-root formula.

Agresti and Coull (1998, American Statistician 52(2), 119-126) made the empirical case explicitly: "Approximate is better than 'exact' for interval estimation of binomial proportions". Their Agresti-Coull modification, apply the Wald formula to $\tilde p = (k + z^2/2)/(n + z^2)$ with $\tilde n = n + z^2$ , is almost numerically identical to Wilson at $z = 1.96$ (95%) and is sometimes preferred for its mnemonic simplicity (the "add two successes, add two failures" rule). Brown-Cai-DasGupta (2001) recommend Wilson or Agresti-Coull as the universal binomial CI default; the Wald CI's near-boundary behaviour disqualifies it from textbook teaching, in their view.

Exact CIs for Poisson and Normal

The Clopper-Pearson logic generalises beyond the binomial. Two other settings carry their own exact CI machinery:

Poisson rate (Garwood 1936, Biometrika). If $k$ events are observed over $n$ unit-exposure observations, $k \sim \text{Poisson}(n\lambda)$ . Inverting the Poisson test (which uses gamma-distribution tails via the gamma-Poisson identity) yields the exact CI for the rate $\lambda$ :

\lambda_L = \frac{1}{2n}\;\chi^2_{2k,\,\alpha/2}, \qquad \lambda_U = \frac{1}{2n}\;\chi^2_{2(k+1),\,1-\alpha/2}.

with $\lambda_L = 0$ when $k = 0$ . The interval is conservative (always ≥ nominal coverage) and asymmetric, the upper bound is further from $\hat\lambda$ than the lower, reflecting the right-skew of the Poisson sampling distribution at low counts.

Normal mean (Gosset 1908, Biometrika, published under the pseudonym "Student"). When the data are iid $N(\mu, \sigma^2)$ with $\sigma$ ESTIMATED from the data as $s$ , the pivot $T = (\bar X - \mu)/(s/\sqrt{n})$ has an exact Student-t distribution with $n - 1$ degrees of freedom. The CI is

\bar X \;\pm\; t_{1-\alpha/2,\,n-1} \cdot s/\sqrt{n}.

The Student-t correction widens the interval for small $n$ : $t_{0.975, 5} \approx 2.57$ vs $z_{0.975} \approx 1.96$ . For $n \ge 30$ the two are within 3% of each other and the z-based and t-based intervals converge. The "exactness" relies on the Normality assumption, under non-Normal data the t-CI is approximate (by CLT, which is why it still works well at moderate $n$ ).

For other distributions the construction is the same in principle (invert a test based on the sampling distribution of $\hat\theta$ ), but the algebra is messier. Exponential rate: invert via gamma tails. Geometric: invert via negative-binomial tails. Multinomial: simultaneous CIs via Bonferroni or the Goodman (1965) procedure. The pattern, "express the test's acceptance region as an inequality in $\theta$ , solve for $\theta$ , that's the CI", is universal. Casella-Berger (2002, Chapter 9.2) call this the "inverting a test" approach and walk through several examples.

Profile likelihood and LRT CIs, preview

Section §3.3 develops the LIKELIHOOD-RATIO TEST CI in depth. For now, the preview is enough to round out the §3.1 toolkit. For a regular model with likelihood $L(\theta; X) = \prod_i f(X_i; \theta)$ and log-likelihood $\ell(\theta) = \log L(\theta)$ , the generalised likelihood ratio test (Wilks 1938) rejects $H_0: \theta = \theta_0$ when

2\,[\ell(\hat\theta) - \ell(\theta_0)] \;>\; \chi^2_{1,\,1-\alpha}.

Inverting this test gives the LRT-based CI:

C^{\mathrm{LRT}} \;=\; \left\{\theta:\; 2\,[\ell(\hat\theta) - \ell(\theta)] \;\le\; \chi^2_{1,\,1-\alpha}\right\}.

For Binomial $p$ this becomes the set of $p$ satisfying $2[k\log(\hat p/p) + (n-k)\log((1-\hat p)/(1-p))] \le \chi^2_{1, 1-\alpha}$ , with $\hat p = k/n$ . For Poisson rate this becomes $2[k\log(\hat\lambda/\lambda) - n(\hat\lambda - \lambda)] \le \chi^2_{1, 1-\alpha}$ . In both cases the CI is asymmetric, follows the shape of the likelihood, and is generally well-calibrated even for small $n$ . Cox and Hinkley (1974, Theoretical Statistics) and Lehmann and Romano (2005, Testing Statistical Hypotheses) make the case that LRT CIs are the "best general-purpose" finite-sample CI for regular models.

The §3.3 section develops the multi-parameter PROFILE-LIKELIHOOD generalisation (profile out nuisance parameters; the resulting one-dimensional log-profile-likelihood inversions inherit the $\chi^2_1$ calibration). For §3.1 we include the LRT CI as a fourth bar in the ci-methods-comparison widget, so the reader can see how it tracks Wilson and the exact CIs across parameter regions.

Coverage probability, nominal vs ACTUAL

The single most important diagnostic for any CI procedure is its COVERAGE PROBABILITY: the actual fraction of times the procedure covers $\theta$ when $\theta$ is the true parameter. The NOMINAL level $1 - \alpha$ is what the procedure claims. The ACTUAL coverage may differ. The gap between nominal and actual is the CALIBRATION ERROR (§3.5 develops this in depth).

For a discrete sampling distribution like Binomial $(n, p)$ , the actual coverage can be COMPUTED EXACTLY by summation. The coverage of a CI procedure $C(\cdot)$ at parameter $p$ is

C(p) \;=\; \sum_{k=0}^{n}\,\mathbb{1}\!\left[p \in C(k)\right]\,\binom{n}{k}\,p^k(1-p)^{n-k}.

Sum over the $n + 1$ possible $k$ values, count which CIs contain $p$ , weight by the binomial PMF, done. No Monte Carlo error. This is what the coverage-explorer widget below computes for each method on a fine grid of $p$ values.

The Brown-Cai-DasGupta (2001) verdict on the four binomial CIs, after running this exact-coverage computation for $n$ from 5 to 100 and the full $p \in (0, 0.5]$ range (by symmetry the $(0.5, 1)$ half is the mirror image):

Wald, chaotic. The coverage oscillates as $p$ varies, sometimes dropping 10+ percentage points below nominal, especially near the boundary. The oscillation is the discrete-binomial signature: each step in $k$ shifts the Wald CI bounds discontinuously, and the coverage as a function of $p$ takes on a sawtooth pattern. Brown-Cai-DasGupta recommend RETIRING Wald from textbook teaching for the binomial.
Wilson score, close to nominal. Coverage stays within ± 1 percentage point of nominal across most of [0, 1], with the only visible deviations near $p = 0$ and $p = 1$ where the discrete-binomial mass is concentrated. This is the recommended default.
Clopper-Pearson, always over-covers. The exact CI ALWAYS attains AT LEAST nominal coverage; the actual coverage typically sits at 96-98% rather than the claimed 95%. The over-coverage is the price for a strictly-conservative procedure.
Agresti-Coull, nearly identical to Wilson at 95%. Brown-Cai-DasGupta note Agresti-Coull is a smooth approximation to Wilson; the two are interchangeable for most applied work.

The "exact" name in "exact CI" is a guarantee about coverage being AT LEAST nominal, it does not mean the coverage IS exactly nominal. Conversely, "approximate" CIs (Wilson) can be CLOSER to nominal coverage on average than exact CIs, because they do not pay the discrete-conservatism premium. This is the heart of the Agresti-Coull paper title.

The first widget puts all four CIs on a single canvas so the reader can see how dramatically they can DISAGREE on the same data. Pick a parameter (Binomial p, Poisson λ, or Normal μ), the sample size n, the TRUE parameter value, and the confidence level. The widget draws ONE simulated sample and computes:

Wald CI, $\hat\theta \pm z \cdot \widehat{\mathrm{SE}}(\hat\theta)$ (asymptotic, the workhorse).
Score / Wilson CI, closed-form (binomial / Poisson) score inversion.
Exact CI, Clopper-Pearson (binomial), Garwood gamma (Poisson), Student-t (Normal).
Profile-LRT CI, the LRT inversion against the $\chi^2_{1, 1-\alpha}$ threshold.

Each CI is drawn as a horizontal bar on a common axis. The dashed blue line is the TRUE parameter value (the truth). The white vertical mark is the point estimate. Re-roll the dataset to draw a fresh sample under the same (n, true parameter, confidence) settings.

Things to verify in the widget:

Start at Binomial p, n = 30, true p = 0.10. Click Re-roll a few times. Note how the four bars often AGREE qualitatively, all four typically contain the truth, but disagree on width. Clopper-Pearson is the widest; Wald the narrowest; Wilson and profile-LRT sit in between.
Slide true p down to 0.02. Wald sometimes returns a lower bound clipped to 0 (the formula $\hat p - z\sqrt{\hat p(1-\hat p)/n}$ would have produced a NEGATIVE number; we cap to 0). Wilson and profile-LRT stay above 0 by construction. Clopper-Pearson is the widest. Re-roll: on some samples Wald MISSES the truth entirely (the bar lies entirely above p = 0.02 because k happened to be 2 or 3, pushing $\hat p$ enough above 0.02 that the symmetric ± width does not reach down to 0.02). The other three rarely miss.
Switch to Poisson λ with n = 20, true λ = 0.5. Re-roll. Note how the gamma-exact and profile CIs are ASYMMETRIC about $\hat\lambda$ , wider on the right than the left, because the Poisson sampling distribution is right-skewed at small $n\lambda$ . Wald is symmetric and misrepresents this skew. The "covers truth?" column shows that Wald has more "no" entries near the boundary than the other methods.
Switch to Normal μ with n = 8, true μ = 0, true σ = 1. The z-based Wald CI (using σ as if known) is exact under Normality. The Student-t CI (using s estimated from the data) is wider because t_{0.975, 7} ≈ 2.36 vs z_{0.975} ≈ 1.96. For genuinely-unknown σ the t-CI is the correct procedure; the z-CI under-covers (it pretends σ is known when it is estimated). Slide n up to 100: the two converge.
Pick a setting where exactly one method misses (e.g., binomial, n = 20, true p = 0.05, re-roll to find a sample with k = 0). With k = 0 the Wald CI is [0, 0] (the SE is 0 because $\hat p(1-\hat p) = 0$ !), Wald has COMPLETELY collapsed. Wilson returns a non-trivial upper bound (the $z^2/(2n)$ term saves it). Clopper-Pearson returns $[0, 1 - (\alpha/2)^{1/n}] \approx [0, 3.7/n]$ , the famous "rule of three" answer. This is the canonical Wald failure mode.

The second widget reverses the question. Instead of one sample with multiple CIs, it fixes the CI procedure and varies the parameter, computing the EXACT coverage probability at every value via the $\sum_k \mathbb{1}[p \in C(k)],\binom{n}{k} p^k(1-p)^{n-k}$ formula. No Monte Carlo error. The output is the "coverage curve" the Brown-Cai-DasGupta (2001) paper made famous.

Things to verify in the widget:

At n = 30, 95% nominal, with Wald + Wilson + Clopper-Pearson selected: the Wald (red) curve oscillates and drops as low as 80-85% near $p = 0.05$ , far below the 95% nominal line. Wilson (yellow) stays inside the 94-96% band almost everywhere. Clopper-Pearson (green) sits above 95% across the whole range, hitting 98+% in some regions.
Slide n up to 100. The Wald oscillation tightens and the overall coverage approaches nominal, the asymptotic-normality approximation is now working harder for it. The other three already hugged nominal at n = 30 and barely change.
Slide n down to 10. The Wald curve becomes WILD, coverage drops below 70% in some regions. Wilson still mostly works (within 90-95%). Clopper-Pearson now over-covers more dramatically (some regions at 99+%). This is the "small n where boundary matters" regime where the choice of method makes the biggest practical difference.
Toggle Agresti-Coull on. Note that it tracks Wilson almost exactly at 95% nominal, Brown-Cai-DasGupta (2001) and Agresti-Coull (1998) note this near-equivalence. At 99% nominal, they diverge slightly: Agresti-Coull is a touch more conservative.
Move the "Point-evaluation true p" slider. The numeric table updates with the exact coverage of each method at that p. Wald typically shows in red (UNDER-covers); Clopper-Pearson in yellow (over-covers); Wilson in green (≈ nominal). The verdict column quantifies what the curves show.
Try 90% nominal vs 99% nominal. At 99% the Wald sawtooth becomes more pronounced (the boundary effect is larger relative to the smaller $\alpha$ ). At 90% nominal Wald is less catastrophic but still oscillates. The take-away: the Wald deficiency is structural, not specific to the 95% convention.

One-sided CIs and the §2.7 cross-link

A confidence interval has TWO bounds. A confidence BOUND has one. A one-sided UPPER bound $U(X)$ at level $1 - \alpha$ satisfies $P_\theta(\theta \le U(X)) \ge 1 - \alpha$ for every $\theta$ ; a one-sided LOWER bound is the mirror. The post-hoc form is "we are 95% confident $\theta \le 0.40$ " or "we are 95% confident $\theta \ge 0.05$ ", with the usual frequentist sloppy-short-hand caveat.

When to use which?

Two-sided CI, the default for "what range of $\theta$ is consistent with the data?". Most reporting uses this.
One-sided UPPER bound, when only an upper limit is scientifically meaningful. Canonical example: bounding a rare-event rate ("the rate of complications is at most …"); regulatory non-inferiority studies (the new drug's relative risk is at most 1.05); detection-limit settings ("the contamination is at most …").
One-sided LOWER bound, when only a lower limit is meaningful. Canonical example: bounding a sensitivity or coverage rate ("the test's true-positive rate is at least 0.90"); shelf-life lower bounds.

The §2.7 BIOEQUIVALENCE machinery was effectively a TWO ONE-SIDED test pair: the equivalence claim succeeds iff the $1 - 2\alpha$ two-sided CI for the difference fits inside the margin $(-\delta, +\delta)$ . Equivalently: BOTH one-sided tests reject at level $\alpha$ . The Schuirmann (1987) TOST procedure exploits this dual. The §3.1 lesson: every two-sided CI has a one-sided sibling, and the appropriate one depends on the research question.

One CRITICAL caveat: the one-sided $1 - \alpha$ confidence bound is DIFFERENT from a two-sided $1 - \alpha$ CI's endpoint. The one-sided 95% upper bound corresponds to the upper end of a TWO-SIDED 90% CI, not a 95% CI (because 95% one-sided uses $z_{0.95} = 1.645$ , not $z_{0.975} = 1.96$ ). Confusing the two is a common error, see the FDA guidance documents on confidence-interval reporting for examples.

Confidence vs credible intervals, the cross-link to Part 7

A BAYESIAN CREDIBLE INTERVAL at level $1 - \alpha$ is an interval $[a, b]$ such that, GIVEN A PRIOR $\pi(\theta)$ and the data $X$ ,

P\bigl(\theta \in [a, b]\;\big|\; X\bigr) \;=\; 1 - \alpha.

Read the LHS as a POSTERIOR probability: a probability distribution on $\theta$ , conditioned on the observed data, integrated over $[a, b]$ . The credible interval makes the statement "the probability that $\theta \in [a, b]$ is 95%" formally TRUE, exactly the statement the frequentist CI cannot make.

Three things to know about the comparison:

Different conceptual meaning. Frequentist CI: a property of the procedure under repeated sampling. Bayesian credible interval: a posterior probability statement about $\theta$ . Both can take the SAME numeric value (e.g., for binomial with a Beta-uniform prior the Jeffreys credible interval and Wilson CI are numerically similar at moderate $n$ ), but they answer different questions.
Frequentist CI has long-run frequency interpretation, Bayesian does not need one. If you do not believe in the "imagined replications" thought experiment (which underlies the frequentist 95%), the credible interval is more direct. If you do, the CI is conceptually cleaner because it requires no prior. Part 7 develops the Bayesian framework in detail.
Wasserstein-Schirm-Lazar (2019) endorse both as alternatives to dichotomous p-values. A 95% credible interval, a 95% CI, and a posterior-density visualisation are all in the recommended-reporting set; the choice between them is methodological and field-dependent.

For the rest of Part 3 we stay in the frequentist framework, bootstrap CIs (§3.2), profile-likelihood CIs (§3.3), prediction intervals (§3.4), calibration (§3.5), communication (§3.6). Part 7 picks up the Bayesian thread.

Try it

In the ci-methods-comparison, set Binomial p, n = 20, true p = 0.05, 95% confidence. Re-roll five times. For each sample, note (i) the four CIs and (ii) whether each covers the truth. Count the misses across the five re-rolls per method. Which method has the most misses? Connect this empirical count to the Brown-Cai-DasGupta (2001) coverage-curve result.
Same widget. Pick Binomial p with n = 30. Try p = 0.5 (centre of the parameter space). Re-roll a few times. Note how the four CIs are almost identical, width and position. Now slide p down to 0.05. Re-roll. The CIs now disagree visibly. Argue: the Wald approximation is best in the INTERIOR of the parameter space and worst near the BOUNDARY.
Same widget. Switch to Poisson λ with n = 20, true λ = 0.5. Re-roll until you get a sample with total k ≤ 5. Read off the four CIs. The exact-gamma and profile CIs are ASYMMETRIC (longer right tail); Wald is symmetric. Argue why this matters when the parameter is close to a hard boundary (λ ≥ 0): the symmetric Wald form requires capping, the asymmetric exact form does not.
In the coverage-explorer, set n = 30, 95% nominal, all four methods on. Find the value of p between 0.01 and 0.50 where the Wald coverage drops the LOWEST. (Hint: the dip near p ≈ 0.02-0.03.) Read off the exact coverage at the point slider. Now find the corresponding Clopper-Pearson coverage. Argue: at this p, a literature relying on Wald 95% CIs would have an actual error rate around 12-15%, not 5%, and a literature relying on Clopper-Pearson would have closer to 2-3%, not 5%.
Same widget. Slide n from 10 to 200 at p = 0.05. Watch the Wald curve's minimum coverage climb toward 95%. At what n is Wald coverage within 1 percentage point of nominal at p = 0.05? This is the EFFECTIVE LARGE-n threshold for Wald near p = 0.05, typically n > 200. Argue why textbook statements like "Wald is OK for np ≥ 5" are conservative-enough rules of thumb but Brown-Cai-DasGupta (2001) prefer to abandon Wald entirely.
Pen-and-paper. Derive the Wilson score CI from scratch. Start with the score-test pivot $(\hat p - p)/\sqrt{p(1-p)/n}$ . Set its absolute value equal to $z$ and SOLVE for $p$ . Show that the two roots of the resulting quadratic in $p$ give the Wilson endpoints.
Pen-and-paper. For Binomial(n = 20, p = 0.1), compute the Clopper-Pearson 95% CI for k = 1. Use the Beta-quantile form: $p_L = \text{Beta}(0.025; 1, 20)$ , $p_U = \text{Beta}(0.975; 2, 19)$ . (You may use software for the quantiles; the answer is approximately [0.001, 0.249].) Compare with the Wald CI $1/20 \pm 1.96\sqrt{0.05 \cdot 0.95/20} \approx [-0.045, 0.145]$ , note the Wald LOWER bound is NEGATIVE. Argue why Wald is unusable here.
Pen-and-paper. For the rule of three: argue that if $k = 0$ events are observed in $n$ trials, the 95% upper confidence bound for $p$ satisfies $(1 - p_U)^n = 0.025$ , hence $p_U \approx -\log(0.025)/n \approx 3.7/n$ . Hanley & Lippman-Hand (1983, JAMA) gave this the punchier name "rule of three" using log(1/0.05) ≈ 3. State the result for n = 100 (no adverse events: 95% upper bound ≈ 3%) and for n = 1000 (95% upper bound ≈ 0.3%).
Pen-and-paper. The Student-t correction matters most for SMALL n. Compute $t_{0.975, df}$ for df = 5, 10, 30, 100, 1000 (use a t-table). Note the convergence to $z_{0.975} = 1.96$ . Argue: for n ≥ 30, the practical difference between z- and t-based CIs is negligible; for n ≤ 10, it is substantial; for n ≤ 5, the t-CI is significantly wider.
Pen-and-paper. Show that the LRT-based CI for binomial p with n large is asymptotically equivalent to the Wilson CI. (Hint: Taylor-expand the log-likelihood $\ell(p)$ around $\hat p$ to second order; the LRT $2[\ell(\hat p) - \ell(p)]$ becomes a quadratic in $p - \hat p$ ; setting that quadratic equal to $z^2$ gives an equation algebraically identical to the Wilson construction.) This is why the four-method comparison widget shows the Wilson and profile-LRT bars almost overlapping at moderate n.
Pen-and-paper. State the difference between a one-sided 95% upper confidence bound and the upper endpoint of a two-sided 95% CI. (The one-sided uses $z_{0.95} = 1.645$ ; the two-sided upper endpoint uses $z_{0.975} = 1.96$ . The one-sided 95% upper bound is therefore LOWER than the two-sided 95% upper endpoint.) Give a research scenario where the difference would matter, e.g., a regulatory non-inferiority filing where the relevant question is "is the relative risk < 1.05?" and the right tool is the one-sided 95% upper bound, not the two-sided 95% CI upper endpoint.
Pen-and-paper. A clinical study reports "95% CI for cure rate: [0.62, 0.78]". A reader writes "the probability that the true cure rate is in [0.62, 0.78] is 0.95". Diagnose the misstatement. Provide the CORRECT frequentist reading. Then provide the BAYESIAN reading (which DOES make the probability statement directly, conditional on a prior). State why both readings can lead to similar practical decisions but answer different formal questions.

Pause and reflect: §3.1 has set out the CI framework. A CI is a PROCEDURE-level guarantee, not a statement about the realised interval. The Wald CI is short, generic, and works asymptotically; it breaks near boundaries, under skew, and at small n. The Clopper-Pearson CI is exact in the conservative sense (coverage ≥ nominal) for the binomial; the Garwood gamma is the Poisson analogue; the Student-t CI is exact under Normality for the Normal mean. The Wilson score CI is the modern default for the binomial because its coverage is closest to nominal across the whole parameter space. The likelihood-ratio CI (developed in §3.3) is the general-purpose finite-sample tool. The COVERAGE PROBABILITY is the key diagnostic, Brown-Cai-DasGupta (2001) computed it exactly and argued Wald should be retired from textbook teaching. The two widgets make these abstract claims visible. §3.2 picks up with BOOTSTRAP CIs, a resampling-based approach that side-steps the asymptotic-normality assumption entirely, with its own trade-offs.

What you now know

You can state the formal definition of a confidence interval as a procedure-level property: for confidence level $1 - \alpha$ , the procedure $C(X)$ satisfies $P_\theta(\theta \in C(X)) \ge 1 - \alpha$ for every $\theta$ . You can articulate why the post-data sloppy short-hand "I am 95% confident $\theta \in [a, b]$ " is technically incorrect under the frequentist reading; you can supply the correct phrasing ("the procedure covers $\theta$ in 95% of repeated samples").

You can derive the Wald CI $\hat\theta \pm z_{1-\alpha/2},\widehat{\mathrm{SE}}(\hat\theta)$ from the asymptotic-normality result and identify the three regimes where it breaks: parameter values near a boundary, highly skewed sampling distributions, and small $n$ . You can write down the Clopper-Pearson (1934) exact binomial CI via the Beta-quantile form, the Wilson (1927) score CI as a closed-form quadratic root, the Garwood (1936) gamma-based Poisson CI, and the Student-t (Gosset 1908) CI for the Normal mean. You can preview the likelihood-ratio CI ${\theta : 2[\ell(\hat\theta) - \ell(\theta)] \le \chi^2_{1, 1-\alpha}}$ and know that §3.3 develops it.

You can define coverage probability as the actual fraction of times a CI procedure covers the truth in repeated samples, distinguish nominal from actual, and quote the Brown-Cai-DasGupta (2001) verdict: Wald is chaotic with severe under-coverage near binomial boundaries; Wilson is close to nominal; Clopper-Pearson always over-covers. You can use the ci-methods-comparison widget to see how dramatically the four CIs disagree on the same data, especially near boundaries and small $n$ , and the coverage-explorer widget to see the exact coverage probabilities computed by summing over the full discrete sample space.

You can distinguish two-sided CIs from one-sided confidence bounds; you can cite the §2.7 TOST procedure as a two-one-sided construction; and you can warn about the $z_{0.95}$ vs $z_{0.975}$ confusion between one-sided 95% bounds and two-sided 95% endpoints. You can distinguish frequentist CIs from Bayesian credible intervals, different conceptual content even when the numerics happen to agree, and you can preview Part 7 as the home of the Bayesian alternative.

Where this lands in the rest of Part 3. §3.2 picks up with BOOTSTRAP CIs (percentile, BCa, basic), a resampling-based approach that side-steps the asymptotic-normality assumption entirely. §3.3 develops the PROFILE-LIKELIHOOD and LRT-based CIs in full generality, including the multi-parameter case. §3.4 distinguishes prediction intervals (uncertainty about a future observation) from confidence intervals (uncertainty about a parameter). §3.5 takes calibration seriously: when does a 95% CI actually mean 95%, and how do we diagnose miscalibration? §3.6 closes Part 3 on the communication side: how to report uncertainty without lying, drawing on Wasserstein-Schirm-Lazar (2019) and the broader reproducibility agenda.

References

Neyman, J. (1937). "Outline of a theory of statistical estimation based on the classical theory of probability." Philosophical Transactions of the Royal Society of London, Series A 236(767), 333-380. (The foundational paper defining confidence intervals as a frequentist procedure. The "procedure covers $\theta$ in $1 - \alpha$ of repeated samples" reading is Neyman's.)
Clopper, C.J., Pearson, E.S. (1934). "The use of confidence or fiducial limits illustrated in the case of the binomial." Biometrika 26(4), 404-413. (The exact binomial CI via inversion of the binomial test. The interval is guaranteed to attain at least nominal coverage; it is also conservative.)
Wilson, E.B. (1927). "Probable inference, the law of succession, and statistical inference." Journal of the American Statistical Association 22(158), 209-212. (The score CI for the binomial. Pre-dates Clopper-Pearson by seven years and is the modern default for proportion intervals.)
Agresti, A., Coull, B.A. (1998). "Approximate is better than 'exact' for interval estimation of binomial proportions." The American Statistician 52(2), 119-126. (The "add two successes, add two failures" simplification of Wilson, plus the empirical case that approximate methods can outperform exact ones in coverage at the nominal level.)
Brown, L.D., Cai, T.T., DasGupta, A. (2001). "Interval estimation for a binomial proportion." Statistical Science 16(2), 101-117. (The definitive paper on binomial CI coverage. Computes exact coverage of Wald, Wilson, Clopper-Pearson, Agresti-Coull, and Jeffreys-Bayes across the parameter space; recommends retiring Wald from textbook teaching for the binomial.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 6, "Estimation": the Wald CI as the default asymptotic procedure.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 9, "Interval Estimation": the inverting-a-test framework, with worked examples for Wald, score, and exact CIs.)
Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall. (Section 7.2 on LRT-based CIs and the $\chi^2_1$ calibration via Wilks 1938.)
Lehmann, E.L., Romano, J.P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer. (Chapter 12 on the duality between tests and confidence sets, including the multi-parameter generalisation of the LRT inversion.)
Hanley, J.A., Lippman-Hand, A. (1983). "If nothing goes wrong, is everything all right? Interpreting zero numerators." JAMA 249(13), 1743-1745. (The "rule of three" for the $k = 0$ binomial: $p_U \approx 3/n$ for the 95% upper bound.)