Exact vs asymptotic CIs

Part 3 — Confidence intervals and uncertainty

Learning objectives

  • State the FREQUENTIST definition of a confidence interval as a PROCEDURE: for confidence level 1 − α, the procedure C(X) = [L(X), U(X)] satisfies P_θ(θ ∈ C(X)) ≥ 1 − α for every θ — coverage is a property of the random PROCEDURE under the true parameter, NOT a probability statement about a single computed [a, b]
  • Recognise that 'I am 95% confident θ ∈ [a, b]' is a frequentist sloppy short-hand; the formally correct reading is 'the procedure that produced [a, b] covers θ in 95% of repeated samples'
  • Derive the WALD CI θ̂ ± z_{1 − α/2} · SÊ(θ̂) from the asymptotic normality result (CLT, §1.6, §1.9) and recognise it as the asymptotic CI
  • List the regimes where Wald BREAKS DOWN: parameter values near a boundary (Binomial p near 0 or 1; Poisson λ near 0), highly skewed sampling distributions, small n
  • State the CLOPPER–PEARSON (1934) exact CI for Binomial p by inverting the binomial test: [Beta(α/2; k, n − k + 1), Beta(1 − α/2; k + 1, n − k)]; note the EXACT coverage guarantee P(θ ∈ C) ≥ 1 − α (NOT equality) — the CI is CONSERVATIVE
  • State the WILSON (1927) score CI: (p̂ + z²/(2n) ± z·√(p̂(1-p̂)/n + z²/(4n²)))/(1 + z²/n). Recognise it as the inversion of the SCORE test, with coverage close to nominal across most of [0, 1] and no need for boundary clipping — Agresti-Coull (1998) and Brown-Cai-DasGupta (2001) endorse it as the modern default
  • Preview the LIKELIHOOD-RATIO CI (§3.3 develops it): {θ : 2[ℓ(θ̂) − ℓ(θ)] ≤ χ²_{1, 1−α}}. Recognise it as the general-purpose finite-sample CI that often outperforms both Wald and exact CIs for irregular models
  • Define COVERAGE PROBABILITY as the actual fraction of times the procedure covers θ in repeated samples; distinguish NOMINAL from ACTUAL coverage; show that Wald under-covers and Clopper–Pearson over-covers, with Wilson closest to nominal (Brown-Cai-DasGupta 2001)
  • State the EXACT POISSON CI (gamma-based, Garwood 1936): lo = ½χ²_{2k, α/2}/n, hi = ½χ²_{2(k+1), 1−α/2}/n; recognise it as the gamma-Poisson conjugacy applied to the rate λ
  • State the NORMAL μ CI (Student-t, Gosset 1908): X̄ ± t_{1−α/2, n−1}·s/√n; recognise it as exact under Normality (the assumption is what 'exact' means)
  • Recognise the difference between a two-sided CI [L, U] and a one-sided UPPER or LOWER confidence BOUND (∞-or 0-truncated); state when each is appropriate — and recall that the bioequivalence machinery in §2.7 used a one-sided / TOST construction
  • Distinguish FREQUENTIST CIs (a procedure property) from BAYESIAN CREDIBLE intervals (P(θ ∈ [a, b] | data), a posterior-probability statement); both can take the same numeric value yet they answer different questions, and §7 develops the Bayesian alternative

Part 2 spent eight sections on HYPOTHESIS TESTS — the Neyman–Pearson decision rule, the p-value, multiple-testing corrections, the equivalence framework, and the replication crisis that follows when those disciplines are absent at the literature level. Part 3 turns to the DUAL inferential object: the CONFIDENCE INTERVAL. A test answers "is the parameter different from the null?"; a CI answers "given the data, what range of parameter values is consistent with what we observed?". The two are mathematically equivalent — every CI inverts a test, every test corresponds to a CI — but the CI carries an effect size and an uncertainty band on the SAME object, which is exactly the reporting recommendation Wasserstein, Schirm, Lazar (2019, American Statistician) made the centrepiece of "Moving to a world beyond p < 0.05".

The §3.1 task is to lay out the CI FRAMEWORK and the two complementary roads into it. The asymptotic road — the Wald CI θ̂ ± z·SÊ(θ̂) — is the workhorse: it works for any estimator with a tractable standard error and an asymptotic normality result. It dominates introductory textbooks because it is short and mnemonic. The exact road — Clopper–Pearson (1934) for the binomial, the gamma-based CI (Garwood 1936) for the Poisson, the Student-t CI (Gosset 1908) for the Normal mean — gives an interval whose coverage is guaranteed to hit the nominal level on the parameter space, not merely in the n → ∞ limit. Both have failure modes. The §3.1 widgets make them visible.

The arc has twelve stops. First, the formal definition of a CI and the SLOPPY-SHORT-HAND warning. Second, the Wald CI derivation from CLT-style asymptotics. Third, where Wald breaks: boundaries, skew, small n. Fourth, the Clopper–Pearson exact CI for the binomial. Fifth, the Wilson score CI and why it is the modern default. Sixth, the EXACT POISSON CI and the Normal-μ Student-t CI. Seventh, a preview of the likelihood-ratio CI (which §3.3 develops in depth). Eighth, COVERAGE PROBABILITY — nominal vs actual — and the Brown-Cai-DasGupta (2001) verdict on Wald. Ninth, the ci-methods-comparison widget: one sample, four CIs, all drawn on the same axis. Tenth, the coverage-explorer widget: EXACT coverage vs true p, by full summation over k = 0, …, n. Eleventh, one-sided CIs and the bioequivalence cross-link. Twelfth, the credible-interval cross-link to Part 7.

What a confidence interval IS — and is NOT

The formal definition (Neyman 1937, Phil. Trans. R. Soc. A 236, 333–380) is operational. Let X=(X1,,Xn)X = (X_1, \ldots, X_n) be the sample and θ\theta the parameter of interest. A confidence interval procedure at level 1α1 - \alpha is a pair of functions L(X)U(X)L(X) \le U(X) such that, for EVERY parameter value θ\theta,

Pθ(L(X)θU(X))    1α.P_{\theta}\bigl(L(X) \le \theta \le U(X)\bigr) \;\ge\; 1 - \alpha.

Read the LHS slowly. The randomness lives in XX; the CI bounds L(X),U(X)L(X), U(X) are random because they are functions of XX. The parameter θ\theta is FIXED but unknown. The probability is taken over the distribution of XX when θ\theta is the true value. So the statement is: over the long run of imagined replications of the experiment, the procedure that produces these bounds covers the true θ\theta at least 1α1 - \alpha of the time.

The deceptive part is what happens AFTER you observe a single dataset. Suppose you collect data, compute L(x)=0.42L(x) = 0.42 and U(x)=0.61U(x) = 0.61, and write down "95% CI: [0.42, 0.61]". The temptation is to say "I am 95% confident θ\theta is in [0.42, 0.61]" or, worse, "the probability that θ[0.42,0.61]\theta \in [0.42, 0.61] is 0.95". Both are technically WRONG under the frequentist reading. After you observe the data, the bounds 0.42,0.610.42, 0.61 are FIXED numbers; θ\theta is also a fixed number; the event "θ[0.42,0.61]\theta \in [0.42, 0.61]" either happens or does not happen — it has no probability between 0 and 1 from a frequentist standpoint. The 95% lives in the PROCEDURE, not in the post-data interval.

The honest reading is: "the procedure that produced this interval covers θ\theta in 95% of repeated samples". The procedure is what is 95%-reliable. The single interval [0.42, 0.61] either contains θ\theta or it does not, and we cannot know which without seeing the truth. Calling [0.42, 0.61] "the 95% CI" is convention — every applied textbook uses the phrase — but the underlying object that carries the 95% is the procedure (L,U)(L, U), not the realised numbers. The widgets in this section drive that distinction home: when you re-roll a simulation, you see DIFFERENT realised intervals from the SAME procedure, and the long-run fraction that covers θ\theta is what the 95% refers to.

Bayesian credible intervals (Part 7) make a different statement: given a prior on θ\theta and the data, P(θ[a,b]X)=1αP(\theta \in [a, b] \mid X) = 1 - \alpha DOES hold for the realised interval. The numeric value can be the same as a frequentist CI in many problems (Jeffreys-prior Bayesian intervals for the binomial coincide with mid-P CIs to high accuracy at moderate n), but the conceptual content is different. Part 7 develops that distinction. For now, every CI in this section is a frequentist procedure, and "95%" is its long-run coverage rate.

The Wald CI: derived from asymptotic normality

The asymptotic CI most readers see first is the WALD CI. Its derivation is one line of §1.6 + §1.9 machinery. Let θ^n\hat\theta_n be an estimator with an asymptotic-normality result of the form

n(θ^nθ)  d  N(0,σ2(θ))as n.\sqrt{n}\,(\hat\theta_n - \theta) \;\xrightarrow{d}\; N\bigl(0, \sigma^2(\theta)\bigr) \qquad \text{as } n \to \infty.

Examples: the sample mean Xˉ\bar X for any iid population with finite variance (CLT, §0.7); the MLE for a regular model (§1.3 + §1.9); the sample proportion p^=k/n\hat p = k/n for the binomial (special case of the CLT with σ2=p(1p)\sigma^2 = p(1-p)). In every case, for large nn the distribution of θ^n\hat\theta_n is approximately N(θ,σ2(θ)/n)N\bigl(\theta, \sigma^2(\theta) / n\bigr). Plug in a consistent estimate SE^(θ^n)=σ^n/n\widehat{\mathrm{SE}}(\hat\theta_n) = \hat\sigma_n/\sqrt{n} (typically σ^n=σ(θ^n)\hat\sigma_n = \sigma(\hat\theta_n); this is the "plug-in" step) and you get the asymptotic pivot

θ^nθSE^(θ^n)  d  N(0,1).\frac{\hat\theta_n - \theta}{\widehat{\mathrm{SE}}(\hat\theta_n)} \;\xrightarrow{d}\; N(0, 1).

Invert this to construct an interval. For nominal coverage 1α1 - \alpha in a two-sided test, the central-95% region of a standard Normal is (z1α/2,z1α/2)(-z_{1-\alpha/2}, z_{1-\alpha/2}) with z0.9751.96z_{0.975} \approx 1.96. Then

P ⁣(θ^nz1α/2SE^(θ^n)    θ    θ^n+z1α/2SE^(θ^n))    1α.P\!\left(\hat\theta_n - z_{1-\alpha/2}\,\widehat{\mathrm{SE}}(\hat\theta_n) \;\le\; \theta \;\le\; \hat\theta_n + z_{1-\alpha/2}\,\widehat{\mathrm{SE}}(\hat\theta_n)\right) \;\to\; 1 - \alpha.

The Wald CI at level 1α1 - \alpha is therefore

  θ^n  ±  z1α/2SE^(θ^n)  \boxed{;\hat\theta_n ;\pm; z_{1-\alpha/2} \cdot \widehat{\mathrm{SE}}(\hat\theta_n);}.

For the binomial proportion with p^=k/n\hat p = k/n this reads p^±zp^(1p^)/n\hat p \pm z\sqrt{\hat p(1-\hat p)/n}. For the Poisson rate with λ^=k/n\hat\lambda = k/n (total events kk over nn unit-exposure observations) it reads λ^±zλ^/n\hat\lambda \pm z\sqrt{\hat\lambda/n}. For the Normal mean with known σ\sigma it reads Xˉ±zσ/n\bar X \pm z,\sigma/\sqrt n. In every case the form is the same: point estimate ± normal quantile × estimated standard error.

The Wald CI is the FIRST CI most readers learn because it is short, generic, and works for any estimator that has an asymptotic-normality theorem. It is the standard reporting form in regression output, generalised linear models (Part 5), and most software. Wasserman (2004, All of Statistics, Chapter 6) gives it as the default; Casella–Berger (2002, Chapter 9.1) discuss it under "inverting a pivot". For large nn and a parameter value in the interior of its space, the Wald CI is hard to improve on.

Where Wald breaks: boundaries, skew, small n

The "for large nn and a parameter value in the interior" qualifier is doing a lot of work in the previous paragraph. Three regimes spoil the Wald approximation; each is the canonical motivation for the exact and score alternatives.

  • Parameter values near a boundary. The Wald CI has the form θ^±zSE^\hat\theta \pm z\cdot \widehat{\mathrm{SE}}, which is symmetric and unbounded. For binomial pp this is a problem because p[0,1]p \in [0, 1]: when p^\hat p is close to 0 (e.g., k=1k = 1 event in n=50n = 50), the Wald CI extends below 0 (or extends close to 0, where the p^(1p^)\sqrt{\hat p(1 - \hat p)} term itself is small, so the interval becomes too narrow). Many implementations CLIP the interval to [0, 1] post hoc — which makes the coverage non-monotone and reproduces the famous Brown-Cai-DasGupta (2001) sawtooth. The exact and score CIs handle boundaries by construction.
  • Highly skewed sampling distributions. Wald assumes the sampling distribution of θ^\hat\theta is well-approximated by a symmetric Normal. For estimators whose sampling distribution is skewed — Poisson rate at low counts (right-skewed), ratio of two means with small denominators (extreme tails), MLEs for irregular models — the symmetric ± width misrepresents the actual uncertainty. The exact CIs are typically asymmetric and capture this; the LRT-based CIs do so via the non-quadratic shape of the log-likelihood.
  • Small n. Asymptotic normality is a limit theorem: it holds AS nn \to \infty. For finite nn the quality of the approximation depends on the parameter (the "third moment" / Berry-Esseen rate, see §1.9). For the binomial with n=30n = 30 and p0.05p \approx 0.05, the Wald CI under-covers by 10+ percentage points (Brown-Cai-DasGupta 2001 Table 2). The Student-t correction (Gosset 1908) is the simplest small-nn fix for the Normal mean; for other distributions the LRT or exact CIs are the analogues.

The diagnostic for "is Wald OK here?" is therefore: how close is the parameter to a boundary, how skewed is the sampling distribution, and how large is nn? When ALL three are favourable (interior parameter, near-symmetric distribution, large nn), the Wald CI matches the exact and score CIs to within Monte Carlo noise. When ANY one of them fails, Wald can be badly off.

The Clopper–Pearson exact CI (binomial)

Clopper and Pearson (1934, Biometrika 26(4), 404–413) constructed an EXACT confidence interval for the binomial proportion by inverting the binomial test. The construction is:

Lower bound pLp_L: the smallest pp such that P(Binomial(n,p)k)α/2P(\text{Binomial}(n, p) \ge k) \ge \alpha/2. Upper bound pUp_U: the largest pp such that P(Binomial(n,p)k)α/2P(\text{Binomial}(n, p) \le k) \ge \alpha/2.

Closed form via the gamma-beta duality:

  • pL=Beta(α/2;  k,  nk+1)p_L = \text{Beta}\bigl(\alpha/2;;k,;n - k + 1\bigr) (the α/2\alpha/2-quantile of a Beta distribution)
pU=Beta(1α/2;  k+1,  nk)p_U = \text{Beta}\bigl(1 - \alpha/2;\;k + 1,\;n - k\bigr)

with the boundary conventions pL=0p_L = 0 when k=0k = 0 and pU=1p_U = 1 when k=nk = n. The interval is GUARANTEED to satisfy Pp(pC)1αP_p(p \in C) \ge 1 - \alpha for every p[0,1]p \in [0, 1]. Read the inequality carefully: the coverage is AT LEAST nominal. The actual coverage is typically ABOVE the nominal level — Clopper–Pearson is CONSERVATIVE. The over-coverage is the price for a discrete sample space (the test cannot achieve exactly α/2\alpha/2 on each side because the binomial probability mass function only takes a finite set of values).

Two things to know about Clopper–Pearson. (1) When you have ZERO events (k=0k = 0), the lower bound is exactly 0 and the upper bound is the so-called "rule of three" approximation: for 95% CI, pU3/np_U \approx 3/n at large nn (Hanley & Lippman-Hand 1983, JAMA). This is the canonical answer to "no events in nn trials, what is the upper bound for pp?". (2) Some software reports a MID-P version (Lancaster 1961) that splits the discrete probability mass at kk in half, reducing the conservatism. Both are exact in different conventions; both are widely used.

The Wilson score CI (binomial)

Wilson (1927, JASA 22(158), 209–212) constructed a CI by inverting the SCORE test rather than the Wald test. The score test is the test that uses n(p^p)/p(1p)\sqrt{n}(\hat p - p)/\sqrt{p(1-p)} — the standard error computed UNDER THE NULL p0p_0, not at the point estimate p^\hat p. Setting that pivot equal to ±z\pm z and solving for pp gives a quadratic in pp; the two roots are the CI endpoints:

CWilson  =  p^+z2/(2n)  ±  zp^(1p^)/n+z2/(4n2)1+z2/n.C^{\mathrm{Wilson}} \;=\; \frac{\hat p + z^2/(2n) \;\pm\; z\,\sqrt{\hat p(1-\hat p)/n + z^2/(4n^2)}}{1 + z^2/n}.

Three properties make Wilson the modern default:

  • Coverage close to nominal across most of [0, 1]. Brown, Cai, DasGupta (2001, Statistical Science 16(2), 101–117) computed exact coverage probabilities for n up to several hundred and a fine grid of p values. Wilson coverage stays within ± 1 percentage point of nominal almost everywhere; the rare deviations are at small nn and very small pp.
  • Bounded inside [0, 1] by construction. The denominator (1+z2/n)(1 + z^2/n) and the z2/(2n)z^2/(2n) pull-toward-1/2 effect mean the bounds never escape [0, 1], even when k=0k = 0 or k=nk = n. No post-hoc clipping is needed.
  • Closed form. Unlike Clopper–Pearson, no incomplete-beta call is required — Wilson is a single quadratic-root formula.

Agresti and Coull (1998, American Statistician 52(2), 119–126) made the empirical case explicitly: "Approximate is better than 'exact' for interval estimation of binomial proportions". Their Agresti–Coull modification — apply the Wald formula to p~=(k+z2/2)/(n+z2)\tilde p = (k + z^2/2)/(n + z^2) with n~=n+z2\tilde n = n + z^2 — is almost numerically identical to Wilson at z=1.96z = 1.96 (95%) and is sometimes preferred for its mnemonic simplicity (the "add two successes, add two failures" rule). Brown-Cai-DasGupta (2001) recommend Wilson or Agresti–Coull as the universal binomial CI default; the Wald CI's near-boundary behaviour disqualifies it from textbook teaching, in their view.

Exact CIs for Poisson and Normal

The Clopper–Pearson logic generalises beyond the binomial. Two other settings carry their own exact CI machinery:

  • Poisson rate (Garwood 1936, Biometrika). If kk events are observed over nn unit-exposure observations, kPoisson(nλ)k \sim \text{Poisson}(n\lambda). Inverting the Poisson test (which uses gamma-distribution tails via the gamma-Poisson identity) yields the exact CI for the rate λ\lambda:
λL=12n  χ2k,α/22,λU=12n  χ2(k+1),1α/22.\lambda_L = \frac{1}{2n}\;\chi^2_{2k,\,\alpha/2}, \qquad \lambda_U = \frac{1}{2n}\;\chi^2_{2(k+1),\,1-\alpha/2}.

with λL=0\lambda_L = 0 when k=0k = 0. The interval is conservative (always ≥ nominal coverage) and asymmetric — the upper bound is further from λ^\hat\lambda than the lower, reflecting the right-skew of the Poisson sampling distribution at low counts.

  • Normal mean (Gosset 1908, Biometrika — published under the pseudonym "Student"). When the data are iid N(μ,σ2)N(\mu, \sigma^2) with σ\sigma ESTIMATED from the data as ss, the pivot T=(Xˉμ)/(s/n)T = (\bar X - \mu)/(s/\sqrt{n}) has an exact Student-t distribution with n1n - 1 degrees of freedom. The CI is
Xˉ  ±  t1α/2,n1s/n.\bar X \;\pm\; t_{1-\alpha/2,\,n-1} \cdot s/\sqrt{n}.

The Student-t correction widens the interval for small nn: t0.975,52.57t_{0.975, 5} \approx 2.57 vs z0.9751.96z_{0.975} \approx 1.96. For n30n \ge 30 the two are within 3% of each other and the z-based and t-based intervals converge. The "exactness" relies on the Normality assumption — under non-Normal data the t-CI is approximate (by CLT, which is why it still works well at moderate nn).

For other distributions the construction is the same in principle (invert a test based on the sampling distribution of θ^\hat\theta), but the algebra is messier. Exponential rate: invert via gamma tails. Geometric: invert via negative-binomial tails. Multinomial: simultaneous CIs via Bonferroni or the Goodman (1965) procedure. The pattern — "express the test's acceptance region as an inequality in θ\theta, solve for θ\theta, that's the CI" — is universal. Casella–Berger (2002, Chapter 9.2) call this the "inverting a test" approach and walk through several examples.

Profile likelihood and LRT CIs — preview

Section §3.3 develops the LIKELIHOOD-RATIO TEST CI in depth. For now, the preview is enough to round out the §3.1 toolkit. For a regular model with likelihood L(θ;X)=if(Xi;θ)L(\theta; X) = \prod_i f(X_i; \theta) and log-likelihood (θ)=logL(θ)\ell(\theta) = \log L(\theta), the generalised likelihood ratio test (Wilks 1938) rejects H0:θ=θ0H_0: \theta = \theta_0 when

2[(θ^)(θ0)]  >  χ1,1α2.2\,[\ell(\hat\theta) - \ell(\theta_0)] \;>\; \chi^2_{1,\,1-\alpha}.

Inverting this test gives the LRT-based CI:

CLRT  =  {θ:  2[(θ^)(θ)]    χ1,1α2}.C^{\mathrm{LRT}} \;=\; \left\{\theta:\; 2\,[\ell(\hat\theta) - \ell(\theta)] \;\le\; \chi^2_{1,\,1-\alpha}\right\}.

For Binomial pp this becomes the set of pp satisfying 2[klog(p^/p)+(nk)log((1p^)/(1p))]χ1,1α22[k\log(\hat p/p) + (n-k)\log((1-\hat p)/(1-p))] \le \chi^2_{1, 1-\alpha}, with p^=k/n\hat p = k/n. For Poisson rate this becomes 2[klog(λ^/λ)n(λ^λ)]χ1,1α22[k\log(\hat\lambda/\lambda) - n(\hat\lambda - \lambda)] \le \chi^2_{1, 1-\alpha}. In both cases the CI is asymmetric, follows the shape of the likelihood, and is generally well-calibrated even for small nn. Cox and Hinkley (1974, Theoretical Statistics) and Lehmann and Romano (2005, Testing Statistical Hypotheses) make the case that LRT CIs are the "best general-purpose" finite-sample CI for regular models.

The §3.3 section develops the multi-parameter PROFILE-LIKELIHOOD generalisation (profile out nuisance parameters; the resulting one-dimensional log-profile-likelihood inversions inherit the χ12\chi^2_1 calibration). For §3.1 we include the LRT CI as a fourth bar in the ci-methods-comparison widget, so the reader can see how it tracks Wilson and the exact CIs across parameter regions.

Coverage probability — nominal vs ACTUAL

The single most important diagnostic for any CI procedure is its COVERAGE PROBABILITY: the actual fraction of times the procedure covers θ\theta when θ\theta is the true parameter. The NOMINAL level 1α1 - \alpha is what the procedure claims. The ACTUAL coverage may differ. The gap between nominal and actual is the CALIBRATION ERROR (§3.5 develops this in depth).

For a discrete sampling distribution like Binomial (n,p)(n, p), the actual coverage can be COMPUTED EXACTLY by summation. The coverage of a CI procedure C()C(\cdot) at parameter pp is

C(p)  =  k=0n1 ⁣[pC(k)](nk)pk(1p)nk.C(p) \;=\; \sum_{k=0}^{n}\,\mathbb{1}\!\left[p \in C(k)\right]\,\binom{n}{k}\,p^k(1-p)^{n-k}.

Sum over the n+1n + 1 possible kk values, count which CIs contain pp, weight by the binomial PMF, done. No Monte Carlo error. This is what the coverage-explorer widget below computes for each method on a fine grid of pp values.

The Brown-Cai-DasGupta (2001) verdict on the four binomial CIs, after running this exact-coverage computation for nn from 5 to 100 and the full p(0,0.5]p \in (0, 0.5] range (by symmetry the (0.5,1)(0.5, 1) half is the mirror image):

  • Wald — chaotic. The coverage oscillates as pp varies, sometimes dropping 10+ percentage points below nominal, especially near the boundary. The oscillation is the discrete-binomial signature: each step in kk shifts the Wald CI bounds discontinuously, and the coverage as a function of pp takes on a sawtooth pattern. Brown-Cai-DasGupta recommend RETIRING Wald from textbook teaching for the binomial.
  • Wilson score — close to nominal. Coverage stays within ± 1 percentage point of nominal across most of [0, 1], with the only visible deviations near p=0p = 0 and p=1p = 1 where the discrete-binomial mass is concentrated. This is the recommended default.
  • Clopper–Pearson — always over-covers. The exact CI ALWAYS attains AT LEAST nominal coverage; the actual coverage typically sits at 96-98% rather than the claimed 95%. The over-coverage is the price for a strictly-conservative procedure.
  • Agresti–Coull — nearly identical to Wilson at 95%. Brown-Cai-DasGupta note Agresti-Coull is a smooth approximation to Wilson; the two are interchangeable for most applied work.

The "exact" name in "exact CI" is a guarantee about coverage being AT LEAST nominal — it does not mean the coverage IS exactly nominal. Conversely, "approximate" CIs (Wilson) can be CLOSER to nominal coverage on average than exact CIs, because they do not pay the discrete-conservatism premium. This is the heart of the Agresti–Coull paper title.

One sample, multiple CIs — the ci-methods-comparison widget

The first widget puts all four CIs on a single canvas so the reader can see how dramatically they can DISAGREE on the same data. Pick a parameter (Binomial p, Poisson λ, or Normal μ), the sample size n, the TRUE parameter value, and the confidence level. The widget draws ONE simulated sample and computes:

  • Wald CIθ^±zSE^(θ^)\hat\theta \pm z \cdot \widehat{\mathrm{SE}}(\hat\theta) (asymptotic, the workhorse).
  • Score / Wilson CI — closed-form (binomial / Poisson) score inversion.
  • Exact CI — Clopper–Pearson (binomial), Garwood gamma (Poisson), Student-t (Normal).
  • Profile-LRT CI — the LRT inversion against the χ1,1α2\chi^2_{1, 1-\alpha} threshold.

Each CI is drawn as a horizontal bar on a common axis. The dashed blue line is the TRUE parameter value (the truth). The white vertical mark is the point estimate. Re-roll the dataset to draw a fresh sample under the same (n, true parameter, confidence) settings.

Ci Methods ComparisonInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • Start at Binomial p, n = 30, true p = 0.10. Click Re-roll a few times. Note how the four bars often AGREE qualitatively — all four typically contain the truth — but disagree on width. Clopper–Pearson is the widest; Wald the narrowest; Wilson and profile-LRT sit in between.
  • Slide true p down to 0.02. Wald sometimes returns a lower bound clipped to 0 (the formula p^zp^(1p^)/n\hat p - z\sqrt{\hat p(1-\hat p)/n} would have produced a NEGATIVE number; we cap to 0). Wilson and profile-LRT stay above 0 by construction. Clopper-Pearson is the widest. Re-roll: on some samples Wald MISSES the truth entirely (the bar lies entirely above p = 0.02 because k happened to be 2 or 3, pushing p^\hat p enough above 0.02 that the symmetric ± width does not reach down to 0.02). The other three rarely miss.
  • Switch to Poisson λ with n = 20, true λ = 0.5. Re-roll. Note how the gamma-exact and profile CIs are ASYMMETRIC about λ^\hat\lambda — wider on the right than the left, because the Poisson sampling distribution is right-skewed at small nλn\lambda. Wald is symmetric and misrepresents this skew. The "covers truth?" column shows that Wald has more "no" entries near the boundary than the other methods.
  • Switch to Normal μ with n = 8, true μ = 0, true σ = 1. The z-based Wald CI (using σ as if known) is exact under Normality. The Student-t CI (using s estimated from the data) is wider because t_{0.975, 7} ≈ 2.36 vs z_{0.975} ≈ 1.96. For genuinely-unknown σ the t-CI is the correct procedure; the z-CI under-covers (it pretends σ is known when it is estimated). Slide n up to 100: the two converge.
  • Pick a setting where exactly one method misses (e.g., binomial, n = 20, true p = 0.05, re-roll to find a sample with k = 0). With k = 0 the Wald CI is [0, 0] (the SE is 0 because p^(1p^)=0\hat p(1-\hat p) = 0!) — Wald has COMPLETELY collapsed. Wilson returns a non-trivial upper bound (the z2/(2n)z^2/(2n) term saves it). Clopper-Pearson returns [0,1(α/2)1/n][0,3.7/n][0, 1 - (\alpha/2)^{1/n}] \approx [0, 3.7/n] — the famous "rule of three" answer. This is the canonical Wald failure mode.

Coverage, exactly computed — the coverage-explorer widget

The second widget reverses the question. Instead of one sample with multiple CIs, it fixes the CI procedure and varies the parameter, computing the EXACT coverage probability at every value via the k1[pC(k)](nk)pk(1p)nk\sum_k \mathbb{1}[p \in C(k)],\binom{n}{k} p^k(1-p)^{n-k} formula. No Monte Carlo error. The output is the "coverage curve" the Brown-Cai-DasGupta (2001) paper made famous.

Coverage ExplorerInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • At n = 30, 95% nominal, with Wald + Wilson + Clopper–Pearson selected: the Wald (red) curve oscillates and drops as low as 80-85% near p=0.05p = 0.05 — far below the 95% nominal line. Wilson (yellow) stays inside the 94–96% band almost everywhere. Clopper–Pearson (green) sits above 95% across the whole range, hitting 98+% in some regions.
  • Slide n up to 100. The Wald oscillation tightens and the overall coverage approaches nominal — the asymptotic-normality approximation is now working harder for it. The other three already hugged nominal at n = 30 and barely change.
  • Slide n down to 10. The Wald curve becomes WILD — coverage drops below 70% in some regions. Wilson still mostly works (within 90-95%). Clopper-Pearson now over-covers more dramatically (some regions at 99+%). This is the "small n where boundary matters" regime where the choice of method makes the biggest practical difference.
  • Toggle Agresti–Coull on. Note that it tracks Wilson almost exactly at 95% nominal — Brown-Cai-DasGupta (2001) and Agresti-Coull (1998) note this near-equivalence. At 99% nominal, they diverge slightly: Agresti-Coull is a touch more conservative.
  • Move the "Point-evaluation true p" slider. The numeric table updates with the exact coverage of each method at that p. Wald typically shows in red (UNDER-covers); Clopper-Pearson in yellow (over-covers); Wilson in green (≈ nominal). The verdict column quantifies what the curves show.
  • Try 90% nominal vs 99% nominal. At 99% the Wald sawtooth becomes more pronounced (the boundary effect is larger relative to the smaller α\alpha). At 90% nominal Wald is less catastrophic but still oscillates. The take-away: the Wald deficiency is structural, not specific to the 95% convention.

A confidence interval has TWO bounds. A confidence BOUND has one. A one-sided UPPER bound U(X)U(X) at level 1α1 - \alpha satisfies Pθ(θU(X))1αP_\theta(\theta \le U(X)) \ge 1 - \alpha for every θ\theta; a one-sided LOWER bound is the mirror. The post-hoc form is "we are 95% confident θ0.40\theta \le 0.40" or "we are 95% confident θ0.05\theta \ge 0.05", with the usual frequentist sloppy-short-hand caveat.

When to use which?

  • Two-sided CI — the default for "what range of θ\theta is consistent with the data?". Most reporting uses this.
  • One-sided UPPER bound — when only an upper limit is scientifically meaningful. Canonical example: bounding a rare-event rate ("the rate of complications is at most …"); regulatory non-inferiority studies (the new drug's relative risk is at most 1.05); detection-limit settings ("the contamination is at most …").
  • One-sided LOWER bound — when only a lower limit is meaningful. Canonical example: bounding a sensitivity or coverage rate ("the test's true-positive rate is at least 0.90"); shelf-life lower bounds.

The §2.7 BIOEQUIVALENCE machinery was effectively a TWO ONE-SIDED test pair: the equivalence claim succeeds iff the 12α1 - 2\alpha two-sided CI for the difference fits inside the margin (δ,+δ)(-\delta, +\delta). Equivalently: BOTH one-sided tests reject at level α\alpha. The Schuirmann (1987) TOST procedure exploits this dual. The §3.1 lesson: every two-sided CI has a one-sided sibling, and the appropriate one depends on the research question.

One CRITICAL caveat: the one-sided 1α1 - \alpha confidence bound is DIFFERENT from a two-sided 1α1 - \alpha CI's endpoint. The one-sided 95% upper bound corresponds to the upper end of a TWO-SIDED 90% CI, not a 95% CI (because 95% one-sided uses z0.95=1.645z_{0.95} = 1.645, not z0.975=1.96z_{0.975} = 1.96). Confusing the two is a common error — see the FDA guidance documents on confidence-interval reporting for examples.

A BAYESIAN CREDIBLE INTERVAL at level 1α1 - \alpha is an interval [a,b][a, b] such that, GIVEN A PRIOR π(θ)\pi(\theta) and the data XX,

P(θ[a,b]    X)  =  1α.P\bigl(\theta \in [a, b]\;\big|\; X\bigr) \;=\; 1 - \alpha.

Read the LHS as a POSTERIOR probability: a probability distribution on θ\theta, conditioned on the observed data, integrated over [a,b][a, b]. The credible interval makes the statement "the probability that θ[a,b]\theta \in [a, b] is 95%" formally TRUE — exactly the statement the frequentist CI cannot make.

Three things to know about the comparison:

  • Different conceptual meaning. Frequentist CI: a property of the procedure under repeated sampling. Bayesian credible interval: a posterior probability statement about θ\theta. Both can take the SAME numeric value (e.g., for binomial with a Beta-uniform prior the Jeffreys credible interval and Wilson CI are numerically similar at moderate nn), but they answer different questions.
  • Frequentist CI has long-run frequency interpretation, Bayesian does not need one. If you do not believe in the "imagined replications" thought experiment (which underlies the frequentist 95%), the credible interval is more direct. If you do, the CI is conceptually cleaner because it requires no prior. Part 7 develops the Bayesian framework in detail.
  • Wasserstein-Schirm-Lazar (2019) endorse both as alternatives to dichotomous p-values. A 95% credible interval, a 95% CI, and a posterior-density visualisation are all in the recommended-reporting set; the choice between them is methodological and field-dependent.

For the rest of Part 3 we stay in the frequentist framework — bootstrap CIs (§3.2), profile-likelihood CIs (§3.3), prediction intervals (§3.4), calibration (§3.5), communication (§3.6). Part 7 picks up the Bayesian thread.

Try it

  • In the ci-methods-comparison, set Binomial p, n = 20, true p = 0.05, 95% confidence. Re-roll five times. For each sample, note (i) the four CIs and (ii) whether each covers the truth. Count the misses across the five re-rolls per method. Which method has the most misses? Connect this empirical count to the Brown-Cai-DasGupta (2001) coverage-curve result.
  • Same widget. Pick Binomial p with n = 30. Try p = 0.5 (centre of the parameter space). Re-roll a few times. Note how the four CIs are almost identical — width and position. Now slide p down to 0.05. Re-roll. The CIs now disagree visibly. Argue: the Wald approximation is best in the INTERIOR of the parameter space and worst near the BOUNDARY.
  • Same widget. Switch to Poisson λ with n = 20, true λ = 0.5. Re-roll until you get a sample with total k ≤ 5. Read off the four CIs. The exact-gamma and profile CIs are ASYMMETRIC (longer right tail); Wald is symmetric. Argue why this matters when the parameter is close to a hard boundary (λ ≥ 0): the symmetric Wald form requires capping, the asymmetric exact form does not.
  • In the coverage-explorer, set n = 30, 95% nominal, all four methods on. Find the value of p between 0.01 and 0.50 where the Wald coverage drops the LOWEST. (Hint: the dip near p ≈ 0.02-0.03.) Read off the exact coverage at the point slider. Now find the corresponding Clopper-Pearson coverage. Argue: at this p, a literature relying on Wald 95% CIs would have an actual error rate around 12-15%, not 5% — and a literature relying on Clopper-Pearson would have closer to 2-3%, not 5%.
  • Same widget. Slide n from 10 to 200 at p = 0.05. Watch the Wald curve's minimum coverage climb toward 95%. At what n is Wald coverage within 1 percentage point of nominal at p = 0.05? This is the EFFECTIVE LARGE-n threshold for Wald near p = 0.05 — typically n > 200. Argue why textbook statements like "Wald is OK for np ≥ 5" are conservative-enough rules of thumb but Brown-Cai-DasGupta (2001) prefer to abandon Wald entirely.
  • Pen-and-paper. Derive the Wilson score CI from scratch. Start with the score-test pivot (p^p)/p(1p)/n(\hat p - p)/\sqrt{p(1-p)/n}. Set its absolute value equal to zz and SOLVE for pp. Show that the two roots of the resulting quadratic in pp give the Wilson endpoints.
  • Pen-and-paper. For Binomial(n = 20, p = 0.1), compute the Clopper-Pearson 95% CI for k = 1. Use the Beta-quantile form: pL=Beta(0.025;1,20)p_L = \text{Beta}(0.025; 1, 20), pU=Beta(0.975;2,19)p_U = \text{Beta}(0.975; 2, 19). (You may use software for the quantiles; the answer is approximately [0.001, 0.249].) Compare with the Wald CI 1/20±1.960.050.95/20[0.045,0.145]1/20 \pm 1.96\sqrt{0.05 \cdot 0.95/20} \approx [-0.045, 0.145] — note the Wald LOWER bound is NEGATIVE. Argue why Wald is unusable here.
  • Pen-and-paper. For the rule of three: argue that if k=0k = 0 events are observed in nn trials, the 95% upper confidence bound for pp satisfies (1pU)n=0.025(1 - p_U)^n = 0.025, hence pUlog(0.025)/n3.7/np_U \approx -\log(0.025)/n \approx 3.7/n. Hanley & Lippman-Hand (1983, JAMA) gave this the punchier name "rule of three" using log(1/0.05) ≈ 3. State the result for n = 100 (no adverse events: 95% upper bound ≈ 3%) and for n = 1000 (95% upper bound ≈ 0.3%).
  • Pen-and-paper. The Student-t correction matters most for SMALL n. Compute t0.975,dft_{0.975, df} for df = 5, 10, 30, 100, 1000 (use a t-table). Note the convergence to z0.975=1.96z_{0.975} = 1.96. Argue: for n ≥ 30, the practical difference between z- and t-based CIs is negligible; for n ≤ 10, it is substantial; for n ≤ 5, the t-CI is significantly wider.
  • Pen-and-paper. Show that the LRT-based CI for binomial p with n large is asymptotically equivalent to the Wilson CI. (Hint: Taylor-expand the log-likelihood (p)\ell(p) around p^\hat p to second order; the LRT 2[(p^)(p)]2[\ell(\hat p) - \ell(p)] becomes a quadratic in pp^p - \hat p; setting that quadratic equal to z2z^2 gives an equation algebraically identical to the Wilson construction.) This is why the four-method comparison widget shows the Wilson and profile-LRT bars almost overlapping at moderate n.
  • Pen-and-paper. State the difference between a one-sided 95% upper confidence bound and the upper endpoint of a two-sided 95% CI. (The one-sided uses z0.95=1.645z_{0.95} = 1.645; the two-sided upper endpoint uses z0.975=1.96z_{0.975} = 1.96. The one-sided 95% upper bound is therefore LOWER than the two-sided 95% upper endpoint.) Give a research scenario where the difference would matter — e.g., a regulatory non-inferiority filing where the relevant question is "is the relative risk < 1.05?" and the right tool is the one-sided 95% upper bound, not the two-sided 95% CI upper endpoint.
  • Pen-and-paper. A clinical study reports "95% CI for cure rate: [0.62, 0.78]". A reader writes "the probability that the true cure rate is in [0.62, 0.78] is 0.95". Diagnose the misstatement. Provide the CORRECT frequentist reading. Then provide the BAYESIAN reading (which DOES make the probability statement directly, conditional on a prior). State why both readings can lead to similar practical decisions but answer different formal questions.

Pause and reflect: §3.1 has set out the CI framework. A CI is a PROCEDURE-level guarantee, not a statement about the realised interval. The Wald CI is short, generic, and works asymptotically; it breaks near boundaries, under skew, and at small n. The Clopper–Pearson CI is exact in the conservative sense (coverage ≥ nominal) for the binomial; the Garwood gamma is the Poisson analogue; the Student-t CI is exact under Normality for the Normal mean. The Wilson score CI is the modern default for the binomial because its coverage is closest to nominal across the whole parameter space. The likelihood-ratio CI (developed in §3.3) is the general-purpose finite-sample tool. The COVERAGE PROBABILITY is the key diagnostic — Brown-Cai-DasGupta (2001) computed it exactly and argued Wald should be retired from textbook teaching. The two widgets make these abstract claims visible. §3.2 picks up with BOOTSTRAP CIs — a resampling-based approach that side-steps the asymptotic-normality assumption entirely, with its own trade-offs.

What you now know

You can state the formal definition of a confidence interval as a procedure-level property: for confidence level 1α1 - \alpha, the procedure C(X)C(X) satisfies Pθ(θC(X))1αP_\theta(\theta \in C(X)) \ge 1 - \alpha for every θ\theta. You can articulate why the post-data sloppy short-hand "I am 95% confident θ[a,b]\theta \in [a, b]" is technically incorrect under the frequentist reading; you can supply the correct phrasing ("the procedure covers θ\theta in 95% of repeated samples").

You can derive the Wald CI θ^±z1α/2SE^(θ^)\hat\theta \pm z_{1-\alpha/2},\widehat{\mathrm{SE}}(\hat\theta) from the asymptotic-normality result and identify the three regimes where it breaks: parameter values near a boundary, highly skewed sampling distributions, and small nn. You can write down the Clopper-Pearson (1934) exact binomial CI via the Beta-quantile form, the Wilson (1927) score CI as a closed-form quadratic root, the Garwood (1936) gamma-based Poisson CI, and the Student-t (Gosset 1908) CI for the Normal mean. You can preview the likelihood-ratio CI {θ:2[(θ^)(θ)]χ1,1α2}{\theta : 2[\ell(\hat\theta) - \ell(\theta)] \le \chi^2_{1, 1-\alpha}} and know that §3.3 develops it.

You can define coverage probability as the actual fraction of times a CI procedure covers the truth in repeated samples, distinguish nominal from actual, and quote the Brown-Cai-DasGupta (2001) verdict: Wald is chaotic with severe under-coverage near binomial boundaries; Wilson is close to nominal; Clopper-Pearson always over-covers. You can use the ci-methods-comparison widget to see how dramatically the four CIs disagree on the same data — especially near boundaries and small nn — and the coverage-explorer widget to see the exact coverage probabilities computed by summing over the full discrete sample space.

You can distinguish two-sided CIs from one-sided confidence bounds; you can cite the §2.7 TOST procedure as a two-one-sided construction; and you can warn about the z0.95z_{0.95} vs z0.975z_{0.975} confusion between one-sided 95% bounds and two-sided 95% endpoints. You can distinguish frequentist CIs from Bayesian credible intervals — different conceptual content even when the numerics happen to agree — and you can preview Part 7 as the home of the Bayesian alternative.

Where this lands in the rest of Part 3. §3.2 picks up with BOOTSTRAP CIs (percentile, BCa, basic) — a resampling-based approach that side-steps the asymptotic-normality assumption entirely. §3.3 develops the PROFILE-LIKELIHOOD and LRT-based CIs in full generality, including the multi-parameter case. §3.4 distinguishes prediction intervals (uncertainty about a future observation) from confidence intervals (uncertainty about a parameter). §3.5 takes calibration seriously: when does a 95% CI actually mean 95%, and how do we diagnose miscalibration? §3.6 closes Part 3 on the communication side: how to report uncertainty without lying, drawing on Wasserstein-Schirm-Lazar (2019) and the broader reproducibility agenda.

References

  • Neyman, J. (1937). "Outline of a theory of statistical estimation based on the classical theory of probability." Philosophical Transactions of the Royal Society of London, Series A 236(767), 333–380. (The foundational paper defining confidence intervals as a frequentist procedure. The "procedure covers θ\theta in 1α1 - \alpha of repeated samples" reading is Neyman's.)
  • Clopper, C.J., Pearson, E.S. (1934). "The use of confidence or fiducial limits illustrated in the case of the binomial." Biometrika 26(4), 404–413. (The exact binomial CI via inversion of the binomial test. The interval is guaranteed to attain at least nominal coverage; it is also conservative.)
  • Wilson, E.B. (1927). "Probable inference, the law of succession, and statistical inference." Journal of the American Statistical Association 22(158), 209–212. (The score CI for the binomial. Pre-dates Clopper-Pearson by seven years and is the modern default for proportion intervals.)
  • Agresti, A., Coull, B.A. (1998). "Approximate is better than 'exact' for interval estimation of binomial proportions." The American Statistician 52(2), 119–126. (The "add two successes, add two failures" simplification of Wilson, plus the empirical case that approximate methods can outperform exact ones in coverage at the nominal level.)
  • Brown, L.D., Cai, T.T., DasGupta, A. (2001). "Interval estimation for a binomial proportion." Statistical Science 16(2), 101–117. (The definitive paper on binomial CI coverage. Computes exact coverage of Wald, Wilson, Clopper-Pearson, Agresti-Coull, and Jeffreys-Bayes across the parameter space; recommends retiring Wald from textbook teaching for the binomial.)
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 6, "Estimation": the Wald CI as the default asymptotic procedure.)
  • Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 9, "Interval Estimation": the inverting-a-test framework, with worked examples for Wald, score, and exact CIs.)
  • Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall. (Section 7.2 on LRT-based CIs and the χ12\chi^2_1 calibration via Wilks 1938.)
  • Lehmann, E.L., Romano, J.P. (2005). Testing Statistical Hypotheses (3rd ed.). Springer. (Chapter 12 on the duality between tests and confidence sets, including the multi-parameter generalisation of the LRT inversion.)
  • Hanley, J.A., Lippman-Hand, A. (1983). "If nothing goes wrong, is everything all right? Interpreting zero numerators." JAMA 249(13), 1743–1745. (The "rule of three" for the k=0k = 0 binomial: pU3/np_U \approx 3/n for the 95% upper bound.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.