Profile-likelihood and likelihood-ratio CIs

Part 3 — Confidence intervals and uncertainty

Learning objectives

  • State the LIKELIHOOD-RATIO statistic Lambda(theta) = L(theta_hat) / L(theta) and its log-form 2[ell(theta_hat) - ell(theta)], and recognise it as the test statistic for H_0: theta = theta_0 vs the unrestricted alternative
  • State WILKS'S THEOREM (1938, Annals of Mathematical Statistics): under regularity (interior parameter, smooth likelihood, identifiability), 2[ell(theta_hat) - ell(theta_0)] converges in distribution to chi^2_q under H_0, where q is the number of parameters being tested
  • Derive the LRT-based CI by inverting the test: C^LRT_{1-alpha} = { theta : 2[ell(theta_hat) - ell(theta)] <= chi^2_{1, 1-alpha} } = { theta : ell(theta) >= ell(theta_hat) - chi^2_{1, 1-alpha}/2 }. State the chi^2/2 cutoff (= 1.92 for 95% nominal, = 2.71/2 = 1.35 for 90%, = 6.63/2 = 3.32 for 99%)
  • Work the BERNOULLI p case: ell(p) = k log p + (n-k) log(1-p), p_hat = k/n. The LRT statistic at p is 2[k log(p_hat/p) + (n-k) log((1-p_hat)/(1-p))]. CI by numerical root-finding (bisection): the two p values where the LRT crosses chi^2/2
  • Work the EXPONENTIAL lambda case: ell(lambda) = n log lambda - lambda T (T = sum x_i), lambda_hat = n/T. The log-likelihood is asymmetric (right-skewed), so the LRT CI is wider on the right of lambda_hat than the left — exactly the asymmetry that Wald cannot capture
  • Work the NORMAL mu (known sigma) case: ell(mu) = -n(mu - x_bar)^2/(2 sigma^2). The log-likelihood is exactly quadratic, so the LRT and Wald CIs COINCIDE: x_bar +/- z_{1-alpha/2} * sigma/sqrt(n). This is the special-case equivalence that confirms Wald is the LRT's quadratic approximation
  • Define PROFILE LIKELIHOOD for multi-parameter models: with parameter vector theta = (theta_1, theta_2) where theta_1 is the parameter of interest and theta_2 is nuisance, the profile log-likelihood is ell_p(theta_1) = max_{theta_2} ell(theta_1, theta_2). The profile LRT CI is { theta_1 : 2[ell_p(theta_1_hat) - ell_p(theta_1)] <= chi^2_{1, 1-alpha} }
  • Recognise that the chi^2 distribution of the profile LRT statistic for a single parameter (with nuisance parameters profiled) is STILL chi^2_1 (Wilks 1938, Chernoff 1954): the dimension of the chi^2 equals the number of CONSTRAINED parameters under H_0, NOT the total number of parameters in the model
  • List the THREE reasons LRT CIs are PREFERRED over Wald CIs in finite samples: (1) no Gaussian / quadratic approximation needed — works for any log-concave likelihood; (2) automatically asymmetric when the log-likelihood is asymmetric — correctly allocates uncertainty near boundaries; (3) REPARAMETERIZATION INVARIANCE — the LRT CI for theta and the LRT CI for g(theta) commute via the monotone map g, while Wald CIs do not (a transformed Wald CI differs from the Wald CI of the transformation, by the delta-method approximation error)
  • Quantify the COMPUTATIONAL COST: LRT CIs require numerical root-finding (typically bisection) for the lower and upper endpoints because the equation 2[ell(theta_hat) - ell(theta)] = chi^2/2 has no closed form for most models. In one-parameter problems this is one bisection per endpoint, ~30-80 likelihood evaluations. In profile-likelihood problems the inner max_{theta_2} is itself a sub-optimisation per evaluation of ell_p, so the total cost is roughly (number of bisection iterations) x (cost of one profile-likelihood evaluation)
  • State the REGULARITY CONDITIONS for Wilks's theorem: (i) the true theta is in the INTERIOR of the parameter space; (ii) the likelihood is identifiable; (iii) the third-derivative-bounded smoothness assumption; (iv) the Fisher-information matrix is positive-definite. When (i) fails (boundary parameter, e.g. variance component sigma^2 = 0), the chi^2_1 calibration is wrong — the asymptotic distribution becomes a 50:50 mixture of a point mass at 0 and a chi^2_1 (Chernoff 1954, Self & Liang 1987, Stram & Lee 1994)
  • Recognise that PROFILE-LIKELIHOOD CIs are the DEFAULT in glm() (R) via the confint.glm method, in statsmodels.GLM (Python) via the .conf_int(method='profile') flag, and in many maximum-likelihood packages. Wald CIs are typically printed by .summary() for ease but the profile-LRT CIs are the gold standard for regression model inference

Section §3.1 set out the asymptotic-vs-exact dichotomy: Wald and Wilson CIs use the asymptotic Normal pivot, while Clopper–Pearson and Garwood invert exact discrete tests. Section §3.2 worked through the empirical, resampling-based bootstrap alternative — percentile, basic, BCa, bootstrap-t — that gives a CI for any functional of the empirical CDF without a parametric model. §3.3 turns to the THIRD major confidence-interval methodology: the PROFILE-LIKELIHOOD CI, also known as the LRT-based CI or the likelihood-ratio CI. This is the parametric, likelihood-based alternative that owns the middle ground between the rigid Wald form and the assumption-light bootstrap.

The starting point is the LIKELIHOOD-RATIO STATISTIC. For a parametric model with likelihood L(θX)L(\theta \mid X), the LRT statistic at any candidate θ\theta is

Λ(θ)  =  L(θ^)L(θ),or in log form:2[(θ^)(θ)],\Lambda(\theta) \;=\; \dfrac{L(\hat\theta)}{L(\theta)}, \qquad \text{or in log form:} \qquad 2\bigl[\ell(\hat\theta) - \ell(\theta)\bigr],

where θ^\hat\theta is the MLE and =logL\ell = \log L. The statistic is non-negative (the MLE maximises \ell), zero at θ=θ^\theta = \hat\theta, and grows as θ\theta moves away from the MLE. Large values are evidence AGAINST H0:θ=θ0H_0: \theta = \theta_0. The classical Neyman–Pearson LRT rejects H0H_0 when Λ(θ0)\Lambda(\theta_0) exceeds a threshold. The LRT-based confidence interval INVERTS that test: it is the set of θ\theta values that the LRT would NOT reject at level α\alpha.

The arc has nine stops. First, the LRT statistic and Wilks's theorem — the asymptotic χ12\chi^2_1 calibration that makes the inversion work. Second, the χ2/2\chi^2/2 cutoff in log-likelihood units and the inversion that defines the CI. Third, three worked examples: Bernoulli pp, Exponential λ\lambda, Normal μ\mu with known σ\sigma. Fourth, the lrt-ci-builder widget that draws the log-likelihood curve and shows the LRT CI as the region above the cutoff. Fifth, PROFILE LIKELIHOOD for multi-parameter models — Venzon and Moolgavkar (1988) and Pawitan (2001). Sixth, the THREE structural advantages of LRT over Wald: no quadratic approximation, asymmetric CIs near boundaries, and reparameterization invariance. Seventh, the profile-vs-wald-coverage widget that compares exact coverage curves for the binomial. Eighth, the regularity conditions and what happens when they fail (boundary parameters, the Chernoff 1954 50:50 mixture). Ninth, the practical recommendation and the connection to Part 4 regression.

Wilks's theorem and the χ2/2\chi^2/2 cutoff

The keystone result is WILKS'S THEOREM (Wilks 1938, Annals of Mathematical Statistics 9(1), 60–62). Under regularity — interior parameter, smooth likelihood, identifiability, positive-definite Fisher information — the log-likelihood-ratio statistic at the true parameter is asymptotically χ2\chi^2:

2[(θ^)(θ)]  d  χq2as  n,2\bigl[\ell(\hat\theta) - \ell(\theta)\bigr] \;\xrightarrow{d}\; \chi^2_q \qquad \text{as}\; n \to \infty,

where qq is the number of parameters being constrained under H0H_0. For a one-parameter test (θ\theta is a scalar), q=1q = 1. For a multi-parameter test (e.g. testing H0:β1=β2=0H_0: \beta_1 = \beta_2 = 0 in a regression), q=2q = 2. The dimension is the number of FREE-vs-CONSTRAINED parameter directions.

For a CI on a single parameter, q=1q = 1 and the cutoff is χ1,1α2\chi^2_{1, 1-\alpha}. Three values matter:

  • 90% confidence: χ1,0.9022.706\chi^2_{1, 0.90} \approx 2.706, so χ2/21.353\chi^2/2 \approx 1.353.
  • 95% confidence: χ1,0.9523.841\chi^2_{1, 0.95} \approx 3.841, so χ2/21.921\chi^2/2 \approx 1.921.
  • 99% confidence: χ1,0.9926.635\chi^2_{1, 0.99} \approx 6.635, so χ2/23.317\chi^2/2 \approx 3.317.

The relationship χ1,1α2=z1α/22\chi^2_{1, 1-\alpha} = z_{1-\alpha/2}^2 (where zz is the standard-Normal quantile) is exact: 95% Wald uses z=1.96z = 1.96 and 95% LRT uses χ2=1.962=3.8416\chi^2 = 1.96^2 = 3.8416. They are the same cutoff written in two different scales — squared standard-Normal vs χ12\chi^2_1. The shift to half-cutoff in log-likelihood units is because the LRT is 2[(θ^)(θ)]2[\ell(\hat\theta) - \ell(\theta)], with the factor of 2 absorbed into the χ2\chi^2 calibration.

The LRT-based confidence interval is the inversion:

  C1αLRT  =  {θ  :  2[(θ^)(θ)]χ1,1α2}  =  {θ  :  (θ)(θ^)χ1,1α2/2}  \boxed{\;C^{\mathrm{LRT}}_{1-\alpha} \;=\; \bigl\{\theta \;:\; 2[\ell(\hat\theta) - \ell(\theta)] \le \chi^2_{1, 1-\alpha}\bigr\} \;=\; \bigl\{\theta \;:\; \ell(\theta) \ge \ell(\hat\theta) - \chi^2_{1, 1-\alpha}/2\bigr\}\;}

Read the second form geometrically: draw the log-likelihood curve (θ)\ell(\theta). Mark the peak at (θ^)\ell(\hat\theta). Drop a horizontal line at height (θ^)χ2/2\ell(\hat\theta) - \chi^2/2. The set of θ\theta where the log-likelihood lies ABOVE that line is the LRT confidence interval. For a well-behaved unimodal log-likelihood the set is a single connected interval [θL,θU][\theta_L, \theta_U]; the endpoints are the two solutions of the equation (θ)=(θ^)χ2/2\ell(\theta) = \ell(\hat\theta) - \chi^2/2.

Three worked examples: Bernoulli, Exponential, Normal

The LRT CI shines on examples where the log-likelihood is clearly non-quadratic, and reduces to the Wald CI when the log-likelihood is exactly quadratic. The three canonical one-parameter models cover both regimes.

Bernoulli pp. The likelihood for kk successes in nn trials is L(p)=pk(1p)nkL(p) = p^k (1-p)^{n-k}, and

(p)  =  klogp+(nk)log(1p),p^=k/n.\ell(p) \;=\; k \log p + (n - k) \log(1 - p), \qquad \hat p = k/n.

The LRT statistic at any pp is

2[(p^)(p)]  =  2klog ⁣(p^/p)  +  2(nk)log ⁣((1p^)/(1p)).2\bigl[\ell(\hat p) - \ell(p)\bigr] \;=\; 2k \log\!\bigl(\hat p / p\bigr) \;+\; 2(n - k) \log\!\bigl((1 - \hat p) / (1 - p)\bigr).

Set this equal to χ1,1α2\chi^2_{1, 1-\alpha} and solve numerically (bisection) for the two roots pL,pUp_L, p_U. For k=3,n=30,α=0.05k = 3, n = 30, \alpha = 0.05: p^=0.1\hat p = 0.1, χ2=3.84\chi^2 = 3.84. Bisecting yields roughly pL0.026p_L \approx 0.026 and pU0.231p_U \approx 0.231 — visibly ASYMMETRIC about p^\hat p (the upper half-width 0.131\approx 0.131, the lower 0.074\approx 0.074). The corresponding 95% Wald CI is 0.1±1.960.10.9/30=0.1±0.107=(0.007,0.207)0.1 \pm 1.96\sqrt{0.1 \cdot 0.9 / 30} = 0.1 \pm 0.107 = (-0.007, 0.207), which escapes 0 on the lower side (needs clipping) and is symmetric. The LRT CI is the cleaner answer.

Exponential λ\lambda. The likelihood for nn iid samples with sum T=xiT = \sum x_i is L(λ)=λneλTL(\lambda) = \lambda^n e^{-\lambda T}, and

(λ)  =  nlogλλT,λ^=n/T.\ell(\lambda) \;=\; n \log \lambda - \lambda T, \qquad \hat\lambda = n / T.

The LRT statistic at λ\lambda is

2[(λ^)(λ)]  =  2n[log(λ^/λ)(1λ/λ^)].2\bigl[\ell(\hat\lambda) - \ell(\lambda)\bigr] \;=\; 2n \bigl[\log(\hat\lambda / \lambda) - (1 - \lambda/\hat\lambda)\bigr].

The log-likelihood is RIGHT-SKEWED — for small TT, the curve drops more slowly above λ^\hat\lambda than below. The LRT CI inherits the skew: it is wider on the right of λ^\hat\lambda than the left. The Wald CI λ^±zλ^/n\hat\lambda \pm z \cdot \hat\lambda / \sqrt{n} is symmetric and under-allocates probability to the right tail. For n=10,T=8n = 10, T = 8: λ^=1.25\hat\lambda = 1.25. The 95% Wald CI is (0.477,2.023)(0.477, 2.023) (half-width 0.775\approx 0.775). The 95% LRT CI is roughly (0.598,2.341)(0.598, 2.341), asymmetric and shifted right. The asymmetry shrinks with nn and becomes invisible by n200n \approx 200.

Normal μ\mu with known σ\sigma. The likelihood is L(μ)exp((xiμ)2/(2σ2))L(\mu) \propto \exp\bigl(-\sum(x_i - \mu)^2 / (2\sigma^2)\bigr), and

(μ)  =  n(μxˉ)22σ2+const,μ^=xˉ.\ell(\mu) \;=\; -\dfrac{n(\mu - \bar x)^2}{2\sigma^2} + \text{const}, \qquad \hat\mu = \bar x.

The log-likelihood is EXACTLY quadratic in μ\mu. The LRT statistic at μ\mu is

2[(μ^)(μ)]  =  n(μxˉ)2/σ2.2\bigl[\ell(\hat\mu) - \ell(\mu)\bigr] \;=\; n(\mu - \bar x)^2 / \sigma^2.

Setting this to χ1,1α2=z1α/22\chi^2_{1, 1-\alpha} = z_{1-\alpha/2}^2 gives (μxˉ)2=z2σ2/n(\mu - \bar x)^2 = z^2 \sigma^2 / n, i.e. μ=xˉ±zσ/n\mu = \bar x \pm z \sigma / \sqrt n. This is EXACTLY the Wald CI. The LRT CI and the Wald CI COINCIDE for the Normal-mean problem — because Wald is the LRT's parabolic approximation, and the Normal log-likelihood IS its own parabolic approximation. Every textbook proof of the Wald CI for the Normal mean is implicitly the LRT proof too.

The lrt-ci-builder widget

The first widget makes the χ2/2\chi^2/2 cutoff visible by drawing the log-likelihood curve directly. Pick a model (Bernoulli, Exponential, Normal known σ\sigma), choose a sample size nn and a true parameter, draw one sample, and look at:

  • The actual log-likelihood (θ)\ell(\theta) (solid green) on a 300-point grid through the parameter space.
  • The quadratic / Wald approximation 12(θθ^)2/SE^2-\tfrac12 (\theta - \hat\theta)^2 / \widehat{\mathrm{SE}}^2 (dashed yellow) — the second-order Taylor expansion at θ^\hat\theta.
  • The horizontal LRT cutoff at y=χ1,1α2/2y = -\chi^2_{1, 1-\alpha}/2 (dashed red).
  • The LRT CI shaded as a green vertical band over the θ\theta-axis: the θ\theta values where the log-likelihood is ABOVE the cutoff.
  • Two horizontal bars below the plot: Wald (yellow) and LRT (green), side by side for comparison.
  • A summary table with widths, asymmetry, and whether each CI covers the true parameter.

Lrt Ci BuilderInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • Start at Bernoulli, n = 30, true p = 0.10. The MLE often lands at p^0.050.15\hat p \approx 0.05 - 0.15 — close to the boundary. The dashed-yellow parabola sits below the green log-likelihood near 0 and above it past p^\hat p: the quadratic approximation is BAD at small pp. The LRT cutoff intersects the green curve at two asymmetric points; the green-shaded LRT CI is wider on the right of p^\hat p than the left, and the Wald CI sometimes clips at 0 (its lower endpoint goes negative, then we project back to 0). The two CI BARS at the bottom of the plot disagree visibly.
  • Bernoulli, n = 30, true p = 0.50. The MLE lands near 0.5; the log-likelihood is roughly symmetric in pp around 0.5 (the parameter space is far from either boundary). The dashed parabola tracks the green curve closely, and the LRT and Wald CIs are visually indistinguishable. Re-roll a few times; both bars stay tight and overlap.
  • Bernoulli, n = 5, k = 0 (re-roll until you hit k = 0 — or set true p very small). The MLE is p^=0\hat p = 0. The Wald CI collapses: SÊ = 0, so Wald gives (0,0)(0, 0) — a degenerate point. The LRT CI does the right thing: lower endpoint clamps at 0 (the boundary), upper endpoint at pUp_U where 2(nk)log(1/(1pU))=χ22(n - k) \log(1/(1-p_U)) = \chi^2, i.e. pU=1exp(χ2/2n)p_U = 1 - \exp(-\chi^2 / 2n). For n=5,χ2=3.84n = 5, \chi^2 = 3.84: pU1exp(0.384)0.319p_U \approx 1 - \exp(-0.384) \approx 0.319. This is the LRT analogue of the k=0k = 0 "rule of three" (0,3/n)(0, 3/n) from §3.1: LRT gives a SENSIBLE one-sided bound where Wald fails.
  • Switch to Exponential, n = 10. Re-roll a few times. The dashed-yellow parabola pulls above the green log-likelihood at small λ\lambda and below at large λ\lambda — Wald over-allocates uncertainty to the left and under-allocates to the right. The LRT CI is visibly wider on the right of λ^\hat\lambda; the Wald CI is symmetric and shifted left. The asymmetry in the LRT CI bar is the most important visual takeaway of the widget.
  • Switch to Normal known sigma, any n. The dashed parabola overlaps the green curve PERFECTLY — the log-likelihood IS a parabola for this model. The LRT and Wald CI bars are identical. This confirms the special-case equivalence and reassures the reader that LRT is NOT a sleight-of-hand; it agrees with Wald when Wald is exactly right and disagrees in the right direction when Wald is approximate.
  • For Bernoulli, slide n up to 200. The dashed parabola is essentially indistinguishable from the green log-likelihood. The LRT and Wald CIs match to 2-3 decimal places. The LRT advantage SHRINKS as nn \to \infty; both CIs are first-order accurate asymptotically. The advantage is MODERATE-n and BOUNDARY behaviour — not asymptotic improvement.

Profile likelihood: from one parameter to many

The single-parameter LRT generalises to MULTI-PARAMETER models via PROFILE LIKELIHOOD. With parameter vector θ=(θ1,θ2)\theta = (\theta_1, \theta_2) where θ1\theta_1 is the parameter of interest and θ2\theta_2 is a NUISANCE parameter (or vector of nuisance parameters), the profile log-likelihood is

p(θ1)  =  maxθ2(θ1,θ2),\ell_p(\theta_1) \;=\; \max_{\theta_2} \ell(\theta_1, \theta_2),

i.e. for each candidate value of θ1\theta_1, REOPTIMISE θ2\theta_2 to maximise the joint log-likelihood. The result is a one-dimensional curve over θ1\theta_1 that captures the "best-case" behaviour of the model at each θ1\theta_1.

The profile LRT CI for θ1\theta_1 uses the same χ2/2\chi^2/2 cutoff as the one-parameter case:

C1αprof  =  {θ1  :  2[p(θ^1)p(θ1)]χ1,1α2}.C^{\mathrm{prof}}_{1-\alpha} \;=\; \bigl\{\theta_1 \;:\; 2[\ell_p(\hat\theta_1) - \ell_p(\theta_1)] \le \chi^2_{1, 1-\alpha}\bigr\}.

This works because Wilks's theorem says the χ2\chi^2 dimension equals the number of CONSTRAINED parameter directions, NOT the total number of parameters in the model. Profiling out θ2\theta_2 leaves θ1\theta_1 as the only constraint under H0:θ1=θ1H_0: \theta_1 = \theta_1^*, so the asymptotic distribution is still χ12\chi^2_1. This is sometimes called the "Wilks ratio for profile likelihoods" (Pawitan 2001, In All Likelihood, §3.5).

The classical reference is VENZON AND MOOLGAVKAR (1988, Applied Statistics 37(1), 87–94), who gave the efficient bisection-with-reoptimisation algorithm now used in R::confint.glm(), statsmodels.GLM.profile_likelihood(), and most modern likelihood software. The Venzon–Moolgavkar algorithm:

  • Start at θ^1\hat\theta_1 and a target log-likelihood reduction of χ2/2\chi^2/2.
  • Take a step in θ1\theta_1; at each step REOPTIMISE θ2\theta_2 via Newton–Raphson on the joint log-likelihood with θ1\theta_1 fixed.
  • Continue until the profile log-likelihood drops by exactly χ2/2\chi^2/2 from its maximum.
  • Repeat in the opposite direction for the other endpoint.

Total cost: roughly 2×K×O(θ2 optimisation)2 \times K \times O(\theta_2 \text{ optimisation}) where K2080K \approx 20 - 80 is the number of bisection / Newton steps per endpoint. For a logistic regression with p=10p = 10 covariates, each inner optimisation is a 9-dimensional Newton–Raphson on the conditional log-likelihood — cheap if you use the Hessian from the full MLE as a warm start. The overall cost is a few hundred likelihood evaluations per CI endpoint, which is fine for moderate models and slow for very large ones. glm() in R uses this algorithm; the resulting CIs are reported by confint.glm(fit).

Why LRT CIs often beat Wald: three structural reasons

The LRT CI methodology has three structural advantages over the Wald CI, articulated across Cox & Hinkley (1974, Theoretical Statistics, §7.2), Pawitan (2001, §3.4–3.5), and Casella & Berger (2002, §9.2).

(1) No quadratic approximation. The Wald CI assumes (θ)(θ^)12(θθ^)2/SE^2\ell(\theta) \approx \ell(\hat\theta) - \tfrac12 (\theta - \hat\theta)^2 / \widehat{\mathrm{SE}}^2 — the second-order Taylor expansion at the MLE. For finite nn this approximation is GOOD when the true log-likelihood is nearly quadratic (Normal mean, regression coefficients far from boundaries, large samples) and POOR otherwise (Bernoulli pp near 0 or 1, Exponential λ\lambda at small TT, variance components near 0). The LRT CI uses the ACTUAL log-likelihood, no Taylor expansion. When the log-likelihood is exactly quadratic — Normal mean with known σ\sigma — the two coincide; otherwise the LRT CI follows the true curvature.

(2) Asymmetric CIs near boundaries. The Wald CI is forced symmetric θ^±zSE^\hat\theta \pm z \cdot \widehat{\mathrm{SE}} by construction — both endpoints are equidistant from the MLE in absolute value. This is wrong when the log-likelihood is asymmetric, e.g. Bernoulli pp near the boundary p0.05p \approx 0.05 where the log-likelihood drops fast to the left (heading to 0) and slow to the right (heading toward 1). The LRT CI follows the actual shape: it can be (and usually is) wider on the side where the log-likelihood drops more slowly. This shows up clearly in the Bernoulli example and in regression coefficients constrained near 0 by the data.

(3) Reparameterization invariance. Suppose gg is a smooth monotone reparameterisation ϕ=g(θ)\phi = g(\theta). The LRT CI for θ\theta and the LRT CI for ϕ\phi are EQUIVARIANT under gg: if [θL,θU][\theta_L, \theta_U] is the LRT CI for θ\theta, then [g(θL),g(θU)][g(\theta_L), g(\theta_U)] is the LRT CI for ϕ\phi (with the endpoints possibly swapped if gg is decreasing). This is because the log-likelihood is invariant under reparameterisation: ϕ(ϕ)=θ(g1(ϕ))\ell_\phi(\phi) = \ell_\theta(g^{-1}(\phi)), and the χ2/2\chi^2/2 cutoff is a level on the log-likelihood, not on the parameter axis.

The Wald CI, by contrast, is NOT reparameterization-invariant. The Wald CI for θ\theta uses SE^θ\widehat{\mathrm{SE}}\theta; the Wald CI for ϕ\phi uses SE^ϕ=g(θ^)SE^θ\widehat{\mathrm{SE}}\phi = |g'(\hat\theta)| \cdot \widehat{\mathrm{SE}}\theta (the delta-method approximation). Applying the back-transformation to the Wald CI on θ\theta gives [g(θ^zSE^θ),g(θ^+zSE^θ)][g(\hat\theta - z \cdot \widehat{\mathrm{SE}}\theta), g(\hat\theta + z \cdot \widehat{\mathrm{SE}}_\theta)], which agrees with the Wald CI on ϕ\phi only to first order in zSE^z \cdot \widehat{\mathrm{SE}}. In finite samples the two differ.

The classical illustration is the odds-ratio ψ=p/(1p)\psi = p / (1 - p) vs the probability pp. The Wald CI for pp can extend below 0; the Wald CI for ψ\psi cannot (because ψ>0\psi > 0), but neither agrees with the Wald CI for logψ\log \psi (which can extend across -\infty). The LRT CI for pp, the LRT CI for ψ\psi, and the LRT CI for logψ\log \psi all give the SAME interval after back-transformation. Reparameterization invariance is one of the strongest properties of any CI procedure and one that Wald notably lacks.

Coverage comparison: profile-vs-wald-coverage widget

The second widget moves from "one sample, two CIs side by side" to "across the parameter space, what is the empirical COVERAGE of each CI?". For the Binomial (n,p)(n, p), the exact coverage can be computed by SUMMATION over k=0,,nk = 0, \ldots, n — no Monte Carlo error needed. This is the same Brown–Cai–DasGupta (2001, Statistical Science) framework that §3.1 used for the Wald-vs-Wilson comparison, applied here to the profile-LRT CI that §3.3 develops.

For each true pp, the coverage is

cov(p)  =  k=0n1[pC(k)](nk)pk(1p)nk.\mathrm{cov}(p) \;=\; \sum_{k=0}^{n} \mathbb{1}\bigl[p \in C(k)\bigr] \cdot \binom{n}{k} p^k (1-p)^{n-k}.

The widget computes this exactly for pp on a fine grid and plots the resulting coverage curve. Both Wald and profile-LRT curves are drawn; the Wilson curve (also from §3.1) is available as a toggle for context.

Profile Vs Wald CoverageInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • Start at n = 20, 95% nominal. Look at the yellow Wald curve: it oscillates wildly between 89% and 99% across the pp range, with deep dips below 90% near p=0p = 0 and p=1p = 1 — the classical sawtooth. The green profile-LRT curve is much smoother and stays close to 95% across most of (0, 1). The summary table confirms: Wald min coverage ~ 85-89%, mean coverage ~ 93%; LRT min ~ 93%, mean ~ 95%.
  • Toggle ON the Wilson (cyan) curve for context. The LRT and Wilson curves often look nearly identical — this is because Wilson is the inversion of the SCORE test, and the score and likelihood-ratio tests are asymptotically equivalent under regularity. They coincide for the binomial to leading order in 1/n1/n. The point: profile-LRT is a generic recipe that REDISCOVERS the Wilson CI (and its excellent coverage) automatically — without needing to know the score-test machinery.
  • Slide n up to 100. The Wald sawtooth narrows; both curves move closer to 95%. By n = 200 the curves are visually indistinguishable. The LRT advantage is MODERATE-n boundary behaviour, not asymptotic. The widget makes this dependence on nn directly visible.
  • Drop n down to 10. The sawtooth is dramatic: Wald drops to 70–85% in places, with two pronounced deep dips bracketing p=0.5p = 0.5 on each side of the boundary. LRT stays in the 90–96% range. The verdict column flags Wald as "under-covers". For very small nn neither method is great near the boundary — both have unavoidable discreteness — but LRT remains demonstrably better.
  • Drag the point-eval slider to p=0.05p = 0.05. The point-eval table reports exact coverage at that single pp. Wald typically lands well below nominal (e.g. 87% at n = 20); LRT and Wilson typically within 1-2 percentage points of nominal. This is the boundary regime where Wald is famously bad and LRT is the cure.
  • Switch to 99% nominal. The chi-square cutoff grows to 6.63 (from 3.84) and the LRT CIs widen. Wald likewise — but Wald's sawtooth structure is now more severe (more kk transitions land below 99%). LRT remains close to nominal.

Regularity and what happens when it fails

Wilks's theorem assumes the true parameter is in the INTERIOR of the parameter space. When that fails — variance component σ2=0\sigma^2 = 0 in a mixed model, or testing H0:p=0H_0: p = 0 on the boundary — the asymptotic distribution of the LRT statistic is NOT χ12\chi^2_1. Chernoff (1954, Annals of Mathematical Statistics 25(3), 573–578) worked out the correct distribution for the one-sided boundary case: a 50:50 MIXTURE of a point mass at 0 and a χ12\chi^2_1. Self & Liang (1987) and Stram & Lee (1994) extended the result to multi-parameter boundary cases.

For the calculus-and-algebra-tutor context, the regularity conditions are typically satisfied: the Bernoulli, Exponential, and Normal parameters in the widget are all interior parameters when the user picks reasonable true values. The widget BOUNDARY-CLAMP code handles the k=0k = 0 or k=nk = n Bernoulli edge cases (where the LRT lower / upper endpoint hits the boundary at 0 or 1) — those still give SENSIBLE intervals, but the formal asymptotic theory shifts to Chernoff's mixture. In Part 4 regression and Part 8 GLMs, the regularity conditions are the standard ones (Fisher information positive-definite, interior parameter); the LRT CIs are the gold-standard recommendation there.

The reader should remember the conditions as a CAVEAT for boundary-parameter problems (variance components, proportion p=0p = 0), not as a problem with LRT CIs in typical use. The chi-square calibration is a large-sample asymptotic; for very small nn even the interior-parameter LRT CI has finite-sample coverage error of order n1n^{-1}.

Try it

  • In the lrt-ci-builder, pick Bernoulli, n = 30, true p = 0.10. Re-roll the sample five times. For each, write down the LRT CI endpoints (lower / upper) and the Wald CI endpoints. Compute the LRT CI asymmetry (upper half-widthlower half-width)/total width(\text{upper half-width} - \text{lower half-width}) / \text{total width} across the five rolls. Is it consistently positive (right-skewed) and what does that tell you about the log-likelihood?
  • Same widget. Pick Normal known sigma, any n. Re-roll a few times. Verify visually that the dashed-yellow parabola sits ON TOP of the green log-likelihood curve. Confirm: the LRT and Wald CI bars at the bottom are identical to the displayed precision. Argue why this is mathematically inevitable (the Normal log-likelihood is exactly quadratic in μ\mu).
  • Same widget. Pick Bernoulli, n = 5, then re-roll until you get k=0k = 0. Look at the resulting CIs. The Wald CI collapses to a point (0,0)(0, 0). The LRT CI has lower endpoint at 0 (the boundary clamp) and a sensible upper endpoint. Compute pU=1exp(χ2/(2n))=1exp(3.84/10)0.32p_U = 1 - \exp(-\chi^2/(2n)) = 1 - \exp(-3.84 / 10) \approx 0.32. Compare with the widget's reported pUp_U. Note that this is the LRT analogue of the "rule of three" pU3/np_U \approx 3/n — and the LRT answer is a touch tighter than the rule of three (which assumes a Poisson approximation).
  • In the profile-vs-wald-coverage, set n = 20, 95% nominal. Toggle the Wilson curve ON. Compare the LRT (green) and Wilson (cyan) curves across the pp range. Argue why they look nearly identical: the score test (which Wilson inverts) and the likelihood-ratio test are asymptotically equivalent for the binomial. They differ at second order in 1/n1/n, which is invisible to the eye.
  • Same widget. Slide n from 10 up to 200 (the available discrete steps). Watch the Wald curve's sawtooth shrink and the worst-case under-coverage move toward nominal. Argue: this is the asymptotic regularity of Wald — eventually the Normal approximation becomes good enough — and it is the reason Wald is "fine for large samples" despite being bad in moderate ones.
  • Pen-and-paper. For the Exponential model with n=10,T=8n = 10, T = 8, write the LRT equation 2n[log(λ^/λ)(1λ/λ^)]=3.842n[\log(\hat\lambda/\lambda) - (1 - \lambda/\hat\lambda)] = 3.84 with λ^=1.25\hat\lambda = 1.25. Try λ=0.6\lambda = 0.6: the left side is 20[log(2.083)(10.48)]=20[0.7340.52]=200.214=4.2820 \cdot [\log(2.083) - (1 - 0.48)] = 20 \cdot [0.734 - 0.52] = 20 \cdot 0.214 = 4.28 — too high, so λ=0.6\lambda = 0.6 is outside the CI. Try λ=0.65\lambda = 0.65: 20[log(1.923)0.48]=20[0.6540.48]=3.4820 \cdot [\log(1.923) - 0.48] = 20 \cdot [0.654 - 0.48] = 3.48 — below cutoff, so λ=0.65\lambda = 0.65 is inside. The lower endpoint is between 0.6 and 0.65; bisecting gives λL0.60\lambda_L \approx 0.60. Compare with the Wald lower 1.251.961.25/100.4751.25 - 1.96 \cdot 1.25/\sqrt{10} \approx 0.475. Argue: Wald under-estimates the lower endpoint because it ignores the log-likelihood's left-side falloff.
  • Pen-and-paper. State Wilks's theorem and explain why it gives χ12\chi^2_1 for a one-parameter test even when the model has many parameters (with the rest profiled out). Hint: the dimension is the number of CONSTRAINED directions in parameter space under H0H_0, not the total dimension of the parameter vector. Cite Venzon & Moolgavkar (1988) for the algorithmic implementation in glm().
  • Pen-and-paper. Consider the reparameterisation ϕ=log(λ)\phi = \log(\lambda) for the Exponential model. State why the LRT CI for λ\lambda at (λL,λU)(\lambda_L, \lambda_U) corresponds to the LRT CI for ϕ\phi at (logλL,logλU)(\log \lambda_L, \log \lambda_U) exactly — no first-order delta-method approximation. Then argue why the Wald CI for λ\lambda does NOT back-transform exactly to the Wald CI for ϕ\phi. This is the reparameterization-invariance advantage.

Pause and reflect: §3.3 has set out the third major CI methodology, after Wald (§3.1) and bootstrap (§3.2). The LRT confidence interval is the set of θ\theta values where the log-likelihood is within χ1,1α2/2\chi^2_{1, 1-\alpha}/2 of its peak — for 95% nominal, that's 1.92 log-likelihood units. The construction is Wilks's 1938 inversion: the LRT test statistic 2[(θ^)(θ)]2[\ell(\hat\theta) - \ell(\theta)] is asymptotically χ12\chi^2_1 under H0:θ=θ0H_0: \theta = \theta_0, and inverting the test gives the CI. For multi-parameter models, the PROFILE log-likelihood p(θ1)=maxθ2(θ1,θ2)\ell_p(\theta_1) = \max_{\theta_2} \ell(\theta_1, \theta_2) generalises the construction. LRT CIs are NON-SYMMETRIC, REPARAMETERIZATION-INVARIANT, and follow the actual log-likelihood shape — the three structural advantages over Wald. The cost is numerical: root-finding for the endpoints, with the Venzon & Moolgavkar (1988) bisection-with-reoptimisation as the standard algorithm. §3.4 picks up with PREDICTION INTERVALS — the uncertainty-about-a-future-observation analogue of the CI.

What you now know

You can state the LIKELIHOOD-RATIO STATISTIC Λ(θ)=L(θ^)/L(θ)\Lambda(\theta) = L(\hat\theta)/L(\theta) and the log-form 2[(θ^)(θ)]2[\ell(\hat\theta) - \ell(\theta)], and you can recognise it as the test statistic for H0:θ=θ0H_0: \theta = \theta_0. You can state WILKS'S THEOREM (Wilks 1938): under regularity, the LRT statistic is asymptotically χq2\chi^2_q where qq is the number of constrained parameters. You can invert the test to get the LRT confidence interval CLRT={θ:(θ)(θ^)χ2/2}C^{\mathrm{LRT}} = {\theta : \ell(\theta) \ge \ell(\hat\theta) - \chi^2/2}.

You can derive the LRT CI for three canonical models: the Bernoulli pp (numerical bisection on 2klog(p^/p)+2(nk)log((1p^)/(1p))=χ22k\log(\hat p / p) + 2(n-k)\log((1-\hat p)/(1-p)) = \chi^2), the Exponential λ\lambda (bisection on 2n[log(λ^/λ)(1λ/λ^)]=χ22n[\log(\hat\lambda/\lambda) - (1 - \lambda/\hat\lambda)] = \chi^2), and the Normal μ\mu with known σ\sigma (closed form, identical to Wald: xˉ±zσ/n\bar x \pm z \sigma/\sqrt n). You know the Normal case is the boundary equivalence between Wald and LRT that confirms the quadratic-approximation interpretation.

You can state PROFILE LIKELIHOOD as the multi-parameter generalisation: p(θ1)=maxθ2(θ1,θ2)\ell_p(\theta_1) = \max_{\theta_2} \ell(\theta_1, \theta_2). You know the Venzon & Moolgavkar (1988) bisection-with-reoptimisation algorithm is the standard implementation, and that it is the default behind confint.glm() in R and the profile_likelihood method in statsmodels.

You can articulate the THREE structural advantages of LRT over Wald: (1) no quadratic / Gaussian approximation needed; (2) automatically asymmetric for asymmetric log-likelihoods, especially near boundaries; (3) reparameterization invariance — the LRT CI for θ\theta back-transforms cleanly to the LRT CI for g(θ)g(\theta) under any smooth monotone gg.

You can use the lrt-ci-builder widget to draw the log-likelihood curve, the χ2/2\chi^2/2 cutoff, and both the LRT and Wald CIs as horizontal bars — seeing them coincide for the Normal mean and diverge for the Bernoulli near boundaries. You can use the profile-vs-wald-coverage widget to compare exact coverage curves: Wald's sawtooth oscillation near p=0,1p = 0, 1 vs the smoother, closer-to-nominal LRT curve.

You know the REGULARITY CONDITIONS for Wilks's theorem (interior parameter, smooth log-likelihood, identifiability, positive-definite Fisher information), and you know the Chernoff (1954) 50:50-mixture result for boundary parameters (variance components, H0:p=0H_0: p = 0).

Where this lands in the rest of Part 3 and the textbook. §3.4 distinguishes PREDICTION INTERVALS (uncertainty about a future observation) from CONFIDENCE INTERVALS (uncertainty about a parameter) — the LRT machinery extends but the chi-square dimension changes. §3.5 takes calibration seriously: empirical-coverage studies for LRT, Wald, and bootstrap CIs side by side. §3.6 closes Part 3 on the communication side. Part 4 (regression) uses PROFILE-LIKELIHOOD CIs as the DEFAULT for generalised linear models — glm() reports them via confint.glm(), and they are the gold standard for coefficient inference in logistic regression, Poisson regression, and the full GLM family. The chi-square inversion you just learned is the same one used there.

References

  • Wilks, S.S. (1938). "The large-sample distribution of the likelihood ratio for testing composite hypotheses." Annals of Mathematical Statistics 9(1), 60–62. (The foundational paper. Proves the χq2\chi^2_q asymptotic distribution of the log-likelihood-ratio statistic under regularity, where qq is the number of constrained parameters.)
  • Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall. (Section 7.2 develops the LRT-based CI and the χ12\chi^2_1 calibration via Wilks 1938. The standard intermediate-graduate reference.)
  • Venzon, D.J., Moolgavkar, S.H. (1988). "A method for computing profile-likelihood-based confidence intervals." Applied Statistics 37(1), 87–94. (The standard ALGORITHM for profile-likelihood CIs — bisection with inner Newton–Raphson reoptimisation of nuisance parameters. The default implementation behind confint.glm() in R.)
  • Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. (Comprehensive treatment of likelihood-based inference. Chapter 3 on the LRT CI; §3.4–3.5 on profile likelihood; §10 on the multi-parameter Wilks ratio.)
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (§9.7 on the LRT and §10.6 on likelihood-based CIs; readable introductory treatment.)
  • Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Section 9.2.1 develops the LRT-based CI as the inversion of the LRT, with worked examples for the Normal mean, the binomial proportion, and the exponential rate. The standard graduate-textbook treatment.)
  • Chernoff, H. (1954). "On the distribution of the likelihood ratio." Annals of Mathematical Statistics 25(3), 573–578. (The boundary-parameter correction: when the true parameter is on the boundary of the parameter space, the LRT statistic has a 50:50 mixture asymptotic distribution rather than χ12\chi^2_1.)
  • Self, S.G., Liang, K.-Y. (1987). "Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions." JASA 82(398), 605–610. (Extension of Chernoff 1954 to multi-parameter boundary cases — the modern reference for variance-component testing in mixed models.)
  • Brown, L.D., Cai, T.T., DasGupta, A. (2001). "Interval estimation for a binomial proportion." Statistical Science 16(2), 101–117. (The exact-coverage comparison framework. §3.1 covered the Wald-vs-Wilson-vs-Clopper-Pearson verdict; §3.3 extends to the LRT method that this paper previewed.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.