Profile-likelihood and likelihood-ratio CIs

Part 3 — Confidence intervals and uncertainty

Learning objectives

State the LIKELIHOOD-RATIO statistic Lambda(theta) = L(theta_hat) / L(theta) and its log-form 2[ell(theta_hat) - ell(theta)], and recognise it as the test statistic for H_0: theta = theta_0 vs the unrestricted alternative
State WILKS'S THEOREM (1938, Annals of Mathematical Statistics): under regularity (interior parameter, smooth likelihood, identifiability), 2[ell(theta_hat) - ell(theta_0)] converges in distribution to chi^2_q under H_0, where q is the number of parameters being tested
Derive the LRT-based CI by inverting the test: C^LRT_{1-alpha} = { theta : 2[ell(theta_hat) - ell(theta)] <= chi^2_{1, 1-alpha} } = { theta : ell(theta) >= ell(theta_hat) - chi^2_{1, 1-alpha}/2 }. State the chi^2/2 cutoff (= 1.92 for 95% nominal, = 2.71/2 = 1.35 for 90%, = 6.63/2 = 3.32 for 99%)
Work the BERNOULLI p case: ell(p) = k log p + (n-k) log(1-p), p_hat = k/n. The LRT statistic at p is 2[k log(p_hat/p) + (n-k) log((1-p_hat)/(1-p))]. CI by numerical root-finding (bisection): the two p values where the LRT crosses chi^2/2
Work the EXPONENTIAL lambda case: ell(lambda) = n log lambda - lambda T (T = sum x_i), lambda_hat = n/T. The log-likelihood is asymmetric (right-skewed), so the LRT CI is wider on the right of lambda_hat than the left — exactly the asymmetry that Wald cannot capture
Work the NORMAL mu (known sigma) case: ell(mu) = -n(mu - x_bar)^2/(2 sigma^2). The log-likelihood is exactly quadratic, so the LRT and Wald CIs COINCIDE: x_bar +/- z_{1-alpha/2} * sigma/sqrt(n). This is the special-case equivalence that confirms Wald is the LRT's quadratic approximation
Define PROFILE LIKELIHOOD for multi-parameter models: with parameter vector theta = (theta_1, theta_2) where theta_1 is the parameter of interest and theta_2 is nuisance, the profile log-likelihood is ell_p(theta_1) = max_{theta_2} ell(theta_1, theta_2). The profile LRT CI is { theta_1 : 2[ell_p(theta_1_hat) - ell_p(theta_1)] <= chi^2_{1, 1-alpha} }
Recognise that the chi^2 distribution of the profile LRT statistic for a single parameter (with nuisance parameters profiled) is STILL chi^2_1 (Wilks 1938, Chernoff 1954): the dimension of the chi^2 equals the number of CONSTRAINED parameters under H_0, NOT the total number of parameters in the model
List the THREE reasons LRT CIs are PREFERRED over Wald CIs in finite samples: (1) no Gaussian / quadratic approximation needed — works for any log-concave likelihood; (2) automatically asymmetric when the log-likelihood is asymmetric — correctly allocates uncertainty near boundaries; (3) REPARAMETERIZATION INVARIANCE — the LRT CI for theta and the LRT CI for g(theta) commute via the monotone map g, while Wald CIs do not (a transformed Wald CI differs from the Wald CI of the transformation, by the delta-method approximation error)
Quantify the COMPUTATIONAL COST: LRT CIs require numerical root-finding (typically bisection) for the lower and upper endpoints because the equation 2[ell(theta_hat) - ell(theta)] = chi^2/2 has no closed form for most models. In one-parameter problems this is one bisection per endpoint, ~30-80 likelihood evaluations. In profile-likelihood problems the inner max_{theta_2} is itself a sub-optimisation per evaluation of ell_p, so the total cost is roughly (number of bisection iterations) x (cost of one profile-likelihood evaluation)
State the REGULARITY CONDITIONS for Wilks's theorem: (i) the true theta is in the INTERIOR of the parameter space; (ii) the likelihood is identifiable; (iii) the third-derivative-bounded smoothness assumption; (iv) the Fisher-information matrix is positive-definite. When (i) fails (boundary parameter, e.g. variance component sigma^2 = 0), the chi^2_1 calibration is wrong — the asymptotic distribution becomes a 50:50 mixture of a point mass at 0 and a chi^2_1 (Chernoff 1954, Self & Liang 1987, Stram & Lee 1994)
Recognise that PROFILE-LIKELIHOOD CIs are the DEFAULT in glm() (R) via the confint.glm method, in statsmodels.GLM (Python) via the .conf_int(method='profile') flag, and in many maximum-likelihood packages. Wald CIs are typically printed by .summary() for ease but the profile-LRT CIs are the gold standard for regression model inference

Section §3.1 set out the asymptotic-vs-exact dichotomy: Wald and Wilson CIs use the asymptotic Normal pivot, while Clopper–Pearson and Garwood invert exact discrete tests. Section §3.2 worked through the empirical, resampling-based bootstrap alternative — percentile, basic, BCa, bootstrap-t — that gives a CI for any functional of the empirical CDF without a parametric model. §3.3 turns to the THIRD major confidence-interval methodology: the PROFILE-LIKELIHOOD CI, also known as the LRT-based CI or the likelihood-ratio CI. This is the parametric, likelihood-based alternative that owns the middle ground between the rigid Wald form and the assumption-light bootstrap.

The starting point is the LIKELIHOOD-RATIO STATISTIC. For a parametric model with likelihood $L(\theta \mid X)$ , the LRT statistic at any candidate $\theta$ is

\Lambda(\theta) \;=\; \dfrac{L(\hat\theta)}{L(\theta)}, \qquad \text{or in log form:} \qquad 2\bigl[\ell(\hat\theta) - \ell(\theta)\bigr],

where $\hat\theta$ is the MLE and $\ell = \log L$ . The statistic is non-negative (the MLE maximises $\ell$ ), zero at $\theta = \hat\theta$ , and grows as $\theta$ moves away from the MLE. Large values are evidence AGAINST $H_0: \theta = \theta_0$ . The classical Neyman–Pearson LRT rejects $H_0$ when $\Lambda(\theta_0)$ exceeds a threshold. The LRT-based confidence interval INVERTS that test: it is the set of $\theta$ values that the LRT would NOT reject at level $\alpha$ .

The arc has nine stops. First, the LRT statistic and Wilks's theorem — the asymptotic $\chi^2_1$ calibration that makes the inversion work. Second, the $\chi^2/2$ cutoff in log-likelihood units and the inversion that defines the CI. Third, three worked examples: Bernoulli $p$ , Exponential $\lambda$ , Normal $\mu$ with known $\sigma$ . Fourth, the lrt-ci-builder widget that draws the log-likelihood curve and shows the LRT CI as the region above the cutoff. Fifth, PROFILE LIKELIHOOD for multi-parameter models — Venzon and Moolgavkar (1988) and Pawitan (2001). Sixth, the THREE structural advantages of LRT over Wald: no quadratic approximation, asymmetric CIs near boundaries, and reparameterization invariance. Seventh, the profile-vs-wald-coverage widget that compares exact coverage curves for the binomial. Eighth, the regularity conditions and what happens when they fail (boundary parameters, the Chernoff 1954 50:50 mixture). Ninth, the practical recommendation and the connection to Part 4 regression.

Wilks's theorem and the $\chi^2/2$ cutoff

The keystone result is WILKS'S THEOREM (Wilks 1938, Annals of Mathematical Statistics 9(1), 60–62). Under regularity — interior parameter, smooth likelihood, identifiability, positive-definite Fisher information — the log-likelihood-ratio statistic at the true parameter is asymptotically $\chi^2$ :

2\bigl[\ell(\hat\theta) - \ell(\theta)\bigr] \;\xrightarrow{d}\; \chi^2_q \qquad \text{as}\; n \to \infty,

where $q$ is the number of parameters being constrained under $H_0$ . For a one-parameter test ( $\theta$ is a scalar), $q = 1$ . For a multi-parameter test (e.g. testing $H_0: \beta_1 = \beta_2 = 0$ in a regression), $q = 2$ . The dimension is the number of FREE-vs-CONSTRAINED parameter directions.

For a CI on a single parameter, $q = 1$ and the cutoff is $\chi^2_{1, 1-\alpha}$ . Three values matter:

90% confidence: $\chi^2_{1, 0.90} \approx 2.706$ , so $\chi^2/2 \approx 1.353$ .
95% confidence: $\chi^2_{1, 0.95} \approx 3.841$ , so $\chi^2/2 \approx 1.921$ .
99% confidence: $\chi^2_{1, 0.99} \approx 6.635$ , so $\chi^2/2 \approx 3.317$ .

The relationship $\chi^2_{1, 1-\alpha} = z_{1-\alpha/2}^2$ (where $z$ is the standard-Normal quantile) is exact: 95% Wald uses $z = 1.96$ and 95% LRT uses $\chi^2 = 1.96^2 = 3.8416$ . They are the same cutoff written in two different scales — squared standard-Normal vs $\chi^2_1$ . The shift to half-cutoff in log-likelihood units is because the LRT is $2[\ell(\hat\theta) - \ell(\theta)]$ , with the factor of 2 absorbed into the $\chi^2$ calibration.

The LRT-based confidence interval is the inversion:

\boxed{\;C^{\mathrm{LRT}}_{1-\alpha} \;=\; \bigl\{\theta \;:\; 2[\ell(\hat\theta) - \ell(\theta)] \le \chi^2_{1, 1-\alpha}\bigr\} \;=\; \bigl\{\theta \;:\; \ell(\theta) \ge \ell(\hat\theta) - \chi^2_{1, 1-\alpha}/2\bigr\}\;}

Read the second form geometrically: draw the log-likelihood curve $\ell(\theta)$ . Mark the peak at $\ell(\hat\theta)$ . Drop a horizontal line at height $\ell(\hat\theta) - \chi^2/2$ . The set of $\theta$ where the log-likelihood lies ABOVE that line is the LRT confidence interval. For a well-behaved unimodal log-likelihood the set is a single connected interval $[\theta_L, \theta_U]$ ; the endpoints are the two solutions of the equation $\ell(\theta) = \ell(\hat\theta) - \chi^2/2$ .

Three worked examples: Bernoulli, Exponential, Normal

The LRT CI shines on examples where the log-likelihood is clearly non-quadratic, and reduces to the Wald CI when the log-likelihood is exactly quadratic. The three canonical one-parameter models cover both regimes.

Bernoulli $p$ . The likelihood for $k$ successes in $n$ trials is $L(p) = p^k (1-p)^{n-k}$ , and

\ell(p) \;=\; k \log p + (n - k) \log(1 - p), \qquad \hat p = k/n.

The LRT statistic at any $p$ is

2\bigl[\ell(\hat p) - \ell(p)\bigr] \;=\; 2k \log\!\bigl(\hat p / p\bigr) \;+\; 2(n - k) \log\!\bigl((1 - \hat p) / (1 - p)\bigr).

Set this equal to $\chi^2_{1, 1-\alpha}$ and solve numerically (bisection) for the two roots $p_L, p_U$ . For $k = 3, n = 30, \alpha = 0.05$ : $\hat p = 0.1$ , $\chi^2 = 3.84$ . Bisecting yields roughly $p_L \approx 0.026$ and $p_U \approx 0.231$ — visibly ASYMMETRIC about $\hat p$ (the upper half-width $\approx 0.131$ , the lower $\approx 0.074$ ). The corresponding 95% Wald CI is $0.1 \pm 1.96\sqrt{0.1 \cdot 0.9 / 30} = 0.1 \pm 0.107 = (-0.007, 0.207)$ , which escapes 0 on the lower side (needs clipping) and is symmetric. The LRT CI is the cleaner answer.

Exponential $\lambda$ . The likelihood for $n$ iid samples with sum $T = \sum x_i$ is $L(\lambda) = \lambda^n e^{-\lambda T}$ , and

\ell(\lambda) \;=\; n \log \lambda - \lambda T, \qquad \hat\lambda = n / T.

The LRT statistic at $\lambda$ is

2\bigl[\ell(\hat\lambda) - \ell(\lambda)\bigr] \;=\; 2n \bigl[\log(\hat\lambda / \lambda) - (1 - \lambda/\hat\lambda)\bigr].

The log-likelihood is RIGHT-SKEWED — for small $T$ , the curve drops more slowly above $\hat\lambda$ than below. The LRT CI inherits the skew: it is wider on the right of $\hat\lambda$ than the left. The Wald CI $\hat\lambda \pm z \cdot \hat\lambda / \sqrt{n}$ is symmetric and under-allocates probability to the right tail. For $n = 10, T = 8$ : $\hat\lambda = 1.25$ . The 95% Wald CI is $(0.477, 2.023)$ (half-width $\approx 0.775$ ). The 95% LRT CI is roughly $(0.598, 2.341)$ , asymmetric and shifted right. The asymmetry shrinks with $n$ and becomes invisible by $n \approx 200$ .

Normal $\mu$ with known $\sigma$ . The likelihood is $L(\mu) \propto \exp\bigl(-\sum(x_i - \mu)^2 / (2\sigma^2)\bigr)$ , and

\ell(\mu) \;=\; -\dfrac{n(\mu - \bar x)^2}{2\sigma^2} + \text{const}, \qquad \hat\mu = \bar x.

The log-likelihood is EXACTLY quadratic in $\mu$ . The LRT statistic at $\mu$ is

2\bigl[\ell(\hat\mu) - \ell(\mu)\bigr] \;=\; n(\mu - \bar x)^2 / \sigma^2.

Setting this to $\chi^2_{1, 1-\alpha} = z_{1-\alpha/2}^2$ gives $(\mu - \bar x)^2 = z^2 \sigma^2 / n$ , i.e. $\mu = \bar x \pm z \sigma / \sqrt n$ . This is EXACTLY the Wald CI. The LRT CI and the Wald CI COINCIDE for the Normal-mean problem — because Wald is the LRT's parabolic approximation, and the Normal log-likelihood IS its own parabolic approximation. Every textbook proof of the Wald CI for the Normal mean is implicitly the LRT proof too.

The first widget makes the $\chi^2/2$ cutoff visible by drawing the log-likelihood curve directly. Pick a model (Bernoulli, Exponential, Normal known $\sigma$ ), choose a sample size $n$ and a true parameter, draw one sample, and look at:

The actual log-likelihood $\ell(\theta)$ (solid green) on a 300-point grid through the parameter space.
The quadratic / Wald approximation $-\tfrac12 (\theta - \hat\theta)^2 / \widehat{\mathrm{SE}}^2$ (dashed yellow) — the second-order Taylor expansion at $\hat\theta$ .
The horizontal LRT cutoff at $y = -\chi^2_{1, 1-\alpha}/2$ (dashed red).
The LRT CI shaded as a green vertical band over the $\theta$ -axis: the $\theta$ values where the log-likelihood is ABOVE the cutoff.
Two horizontal bars below the plot: Wald (yellow) and LRT (green), side by side for comparison.
A summary table with widths, asymmetry, and whether each CI covers the true parameter.

Things to verify in the widget:

Start at Bernoulli, n = 30, true p = 0.10. The MLE often lands at $\hat p \approx 0.05 - 0.15$ — close to the boundary. The dashed-yellow parabola sits below the green log-likelihood near 0 and above it past $\hat p$ : the quadratic approximation is BAD at small $p$ . The LRT cutoff intersects the green curve at two asymmetric points; the green-shaded LRT CI is wider on the right of $\hat p$ than the left, and the Wald CI sometimes clips at 0 (its lower endpoint goes negative, then we project back to 0). The two CI BARS at the bottom of the plot disagree visibly.
Bernoulli, n = 30, true p = 0.50. The MLE lands near 0.5; the log-likelihood is roughly symmetric in $p$ around 0.5 (the parameter space is far from either boundary). The dashed parabola tracks the green curve closely, and the LRT and Wald CIs are visually indistinguishable. Re-roll a few times; both bars stay tight and overlap.
Bernoulli, n = 5, k = 0 (re-roll until you hit k = 0 — or set true p very small). The MLE is $\hat p = 0$ . The Wald CI collapses: SÊ = 0, so Wald gives $(0, 0)$ — a degenerate point. The LRT CI does the right thing: lower endpoint clamps at 0 (the boundary), upper endpoint at $p_U$ where $2(n - k) \log(1/(1-p_U)) = \chi^2$ , i.e. $p_U = 1 - \exp(-\chi^2 / 2n)$ . For $n = 5, \chi^2 = 3.84$ : $p_U \approx 1 - \exp(-0.384) \approx 0.319$ . This is the LRT analogue of the $k = 0$ "rule of three" $(0, 3/n)$ from §3.1: LRT gives a SENSIBLE one-sided bound where Wald fails.
Switch to Exponential, n = 10. Re-roll a few times. The dashed-yellow parabola pulls above the green log-likelihood at small $\lambda$ and below at large $\lambda$ — Wald over-allocates uncertainty to the left and under-allocates to the right. The LRT CI is visibly wider on the right of $\hat\lambda$ ; the Wald CI is symmetric and shifted left. The asymmetry in the LRT CI bar is the most important visual takeaway of the widget.
Switch to Normal known sigma, any n. The dashed parabola overlaps the green curve PERFECTLY — the log-likelihood IS a parabola for this model. The LRT and Wald CI bars are identical. This confirms the special-case equivalence and reassures the reader that LRT is NOT a sleight-of-hand; it agrees with Wald when Wald is exactly right and disagrees in the right direction when Wald is approximate.
For Bernoulli, slide n up to 200. The dashed parabola is essentially indistinguishable from the green log-likelihood. The LRT and Wald CIs match to 2-3 decimal places. The LRT advantage SHRINKS as $n \to \infty$ ; both CIs are first-order accurate asymptotically. The advantage is MODERATE-n and BOUNDARY behaviour — not asymptotic improvement.

Profile likelihood: from one parameter to many

The single-parameter LRT generalises to MULTI-PARAMETER models via PROFILE LIKELIHOOD. With parameter vector $\theta = (\theta_1, \theta_2)$ where $\theta_1$ is the parameter of interest and $\theta_2$ is a NUISANCE parameter (or vector of nuisance parameters), the profile log-likelihood is

\ell_p(\theta_1) \;=\; \max_{\theta_2} \ell(\theta_1, \theta_2),

i.e. for each candidate value of $\theta_1$ , REOPTIMISE $\theta_2$ to maximise the joint log-likelihood. The result is a one-dimensional curve over $\theta_1$ that captures the "best-case" behaviour of the model at each $\theta_1$ .

The profile LRT CI for $\theta_1$ uses the same $\chi^2/2$ cutoff as the one-parameter case:

C^{\mathrm{prof}}_{1-\alpha} \;=\; \bigl\{\theta_1 \;:\; 2[\ell_p(\hat\theta_1) - \ell_p(\theta_1)] \le \chi^2_{1, 1-\alpha}\bigr\}.

This works because Wilks's theorem says the $\chi^2$ dimension equals the number of CONSTRAINED parameter directions, NOT the total number of parameters in the model. Profiling out $\theta_2$ leaves $\theta_1$ as the only constraint under $H_0: \theta_1 = \theta_1^*$ , so the asymptotic distribution is still $\chi^2_1$ . This is sometimes called the "Wilks ratio for profile likelihoods" (Pawitan 2001, In All Likelihood, §3.5).

The classical reference is VENZON AND MOOLGAVKAR (1988, Applied Statistics 37(1), 87–94), who gave the efficient bisection-with-reoptimisation algorithm now used in R::confint.glm(), statsmodels.GLM.profile_likelihood(), and most modern likelihood software. The Venzon–Moolgavkar algorithm:

Start at $\hat\theta_1$ and a target log-likelihood reduction of $\chi^2/2$ .
Take a step in $\theta_1$ ; at each step REOPTIMISE $\theta_2$ via Newton–Raphson on the joint log-likelihood with $\theta_1$ fixed.
Continue until the profile log-likelihood drops by exactly $\chi^2/2$ from its maximum.
Repeat in the opposite direction for the other endpoint.

Total cost: roughly $2 \times K \times O(\theta_2 \text{ optimisation})$ where $K \approx 20 - 80$ is the number of bisection / Newton steps per endpoint. For a logistic regression with $p = 10$ covariates, each inner optimisation is a 9-dimensional Newton–Raphson on the conditional log-likelihood — cheap if you use the Hessian from the full MLE as a warm start. The overall cost is a few hundred likelihood evaluations per CI endpoint, which is fine for moderate models and slow for very large ones. glm() in R uses this algorithm; the resulting CIs are reported by confint.glm(fit).

Why LRT CIs often beat Wald: three structural reasons

The LRT CI methodology has three structural advantages over the Wald CI, articulated across Cox & Hinkley (1974, Theoretical Statistics, §7.2), Pawitan (2001, §3.4–3.5), and Casella & Berger (2002, §9.2).

(1) No quadratic approximation. The Wald CI assumes $\ell(\theta) \approx \ell(\hat\theta) - \tfrac12 (\theta - \hat\theta)^2 / \widehat{\mathrm{SE}}^2$ — the second-order Taylor expansion at the MLE. For finite $n$ this approximation is GOOD when the true log-likelihood is nearly quadratic (Normal mean, regression coefficients far from boundaries, large samples) and POOR otherwise (Bernoulli $p$ near 0 or 1, Exponential $\lambda$ at small $T$ , variance components near 0). The LRT CI uses the ACTUAL log-likelihood, no Taylor expansion. When the log-likelihood is exactly quadratic — Normal mean with known $\sigma$ — the two coincide; otherwise the LRT CI follows the true curvature.

(2) Asymmetric CIs near boundaries. The Wald CI is forced symmetric $\hat\theta \pm z \cdot \widehat{\mathrm{SE}}$ by construction — both endpoints are equidistant from the MLE in absolute value. This is wrong when the log-likelihood is asymmetric, e.g. Bernoulli $p$ near the boundary $p \approx 0.05$ where the log-likelihood drops fast to the left (heading to 0) and slow to the right (heading toward 1). The LRT CI follows the actual shape: it can be (and usually is) wider on the side where the log-likelihood drops more slowly. This shows up clearly in the Bernoulli example and in regression coefficients constrained near 0 by the data.

(3) Reparameterization invariance. Suppose $g$ is a smooth monotone reparameterisation $\phi = g(\theta)$ . The LRT CI for $\theta$ and the LRT CI for $\phi$ are EQUIVARIANT under $g$ : if $[\theta_L, \theta_U]$ is the LRT CI for $\theta$ , then $[g(\theta_L), g(\theta_U)]$ is the LRT CI for $\phi$ (with the endpoints possibly swapped if $g$ is decreasing). This is because the log-likelihood is invariant under reparameterisation: $\ell_\phi(\phi) = \ell_\theta(g^{-1}(\phi))$ , and the $\chi^2/2$ cutoff is a level on the log-likelihood, not on the parameter axis.

The Wald CI, by contrast, is NOT reparameterization-invariant. The Wald CI for $\theta$ uses $\widehat{\mathrm{SE}}$ ; the Wald CI for $\phi$ uses $\widehat{\mathrm{SE}}$ \phi = |g'(\hat\theta)| \cdot \widehat{\mathrm{SE}}\theta $SE_{ϕ} = ∣ g^{'} (\hat{θ}) ∣ \cdot SE_{θ}$ (the delta-method approximation). Applying the back-transformation to the Wald CI on $\theta$ gives $[g(\hat\theta - z \cdot \widehat{\mathrm{SE}}$ \theta), g(\hat\theta + z \cdot \widehat{\mathrm{SE}}_\theta)] $[g (\hat{θ} - z \cdot SE_{θ}), g (\hat{θ} + z \cdot SE_{θ})]$ , which agrees with the Wald CI on $\phi$ only to first order in $z \cdot \widehat{\mathrm{SE}}$ . In finite samples the two differ.

The classical illustration is the odds-ratio $\psi = p / (1 - p)$ vs the probability $p$ . The Wald CI for $p$ can extend below 0; the Wald CI for $\psi$ cannot (because $\psi > 0$ ), but neither agrees with the Wald CI for $\log \psi$ (which can extend across $-\infty$ ). The LRT CI for $p$ , the LRT CI for $\psi$ , and the LRT CI for $\log \psi$ all give the SAME interval after back-transformation. Reparameterization invariance is one of the strongest properties of any CI procedure and one that Wald notably lacks.

The second widget moves from "one sample, two CIs side by side" to "across the parameter space, what is the empirical COVERAGE of each CI?". For the Binomial $(n, p)$ , the exact coverage can be computed by SUMMATION over $k = 0, \ldots, n$ — no Monte Carlo error needed. This is the same Brown–Cai–DasGupta (2001, Statistical Science) framework that §3.1 used for the Wald-vs-Wilson comparison, applied here to the profile-LRT CI that §3.3 develops.

For each true $p$ , the coverage is

\mathrm{cov}(p) \;=\; \sum_{k=0}^{n} \mathbb{1}\bigl[p \in C(k)\bigr] \cdot \binom{n}{k} p^k (1-p)^{n-k}.

The widget computes this exactly for $p$ on a fine grid and plots the resulting coverage curve. Both Wald and profile-LRT curves are drawn; the Wilson curve (also from §3.1) is available as a toggle for context.

Things to verify in the widget:

Start at n = 20, 95% nominal. Look at the yellow Wald curve: it oscillates wildly between 89% and 99% across the $p$ range, with deep dips below 90% near $p = 0$ and $p = 1$ — the classical sawtooth. The green profile-LRT curve is much smoother and stays close to 95% across most of (0, 1). The summary table confirms: Wald min coverage ~ 85-89%, mean coverage ~ 93%; LRT min ~ 93%, mean ~ 95%.
Toggle ON the Wilson (cyan) curve for context. The LRT and Wilson curves often look nearly identical — this is because Wilson is the inversion of the SCORE test, and the score and likelihood-ratio tests are asymptotically equivalent under regularity. They coincide for the binomial to leading order in $1/n$ . The point: profile-LRT is a generic recipe that REDISCOVERS the Wilson CI (and its excellent coverage) automatically — without needing to know the score-test machinery.
Slide n up to 100. The Wald sawtooth narrows; both curves move closer to 95%. By n = 200 the curves are visually indistinguishable. The LRT advantage is MODERATE-n boundary behaviour, not asymptotic. The widget makes this dependence on $n$ directly visible.
Drop n down to 10. The sawtooth is dramatic: Wald drops to 70–85% in places, with two pronounced deep dips bracketing $p = 0.5$ on each side of the boundary. LRT stays in the 90–96% range. The verdict column flags Wald as "under-covers". For very small $n$ neither method is great near the boundary — both have unavoidable discreteness — but LRT remains demonstrably better.
Drag the point-eval slider to $p = 0.05$ . The point-eval table reports exact coverage at that single $p$ . Wald typically lands well below nominal (e.g. 87% at n = 20); LRT and Wilson typically within 1-2 percentage points of nominal. This is the boundary regime where Wald is famously bad and LRT is the cure.
Switch to 99% nominal. The chi-square cutoff grows to 6.63 (from 3.84) and the LRT CIs widen. Wald likewise — but Wald's sawtooth structure is now more severe (more $k$ transitions land below 99%). LRT remains close to nominal.

Regularity and what happens when it fails

Wilks's theorem assumes the true parameter is in the INTERIOR of the parameter space. When that fails — variance component $\sigma^2 = 0$ in a mixed model, or testing $H_0: p = 0$ on the boundary — the asymptotic distribution of the LRT statistic is NOT $\chi^2_1$ . Chernoff (1954, Annals of Mathematical Statistics 25(3), 573–578) worked out the correct distribution for the one-sided boundary case: a 50:50 MIXTURE of a point mass at 0 and a $\chi^2_1$ . Self & Liang (1987) and Stram & Lee (1994) extended the result to multi-parameter boundary cases.

For the calculus-and-algebra-tutor context, the regularity conditions are typically satisfied: the Bernoulli, Exponential, and Normal parameters in the widget are all interior parameters when the user picks reasonable true values. The widget BOUNDARY-CLAMP code handles the $k = 0$ or $k = n$ Bernoulli edge cases (where the LRT lower / upper endpoint hits the boundary at 0 or 1) — those still give SENSIBLE intervals, but the formal asymptotic theory shifts to Chernoff's mixture. In Part 4 regression and Part 8 GLMs, the regularity conditions are the standard ones (Fisher information positive-definite, interior parameter); the LRT CIs are the gold-standard recommendation there.

The reader should remember the conditions as a CAVEAT for boundary-parameter problems (variance components, proportion $p = 0$ ), not as a problem with LRT CIs in typical use. The chi-square calibration is a large-sample asymptotic; for very small $n$ even the interior-parameter LRT CI has finite-sample coverage error of order $n^{-1}$ .

Try it

In the lrt-ci-builder, pick Bernoulli, n = 30, true p = 0.10. Re-roll the sample five times. For each, write down the LRT CI endpoints (lower / upper) and the Wald CI endpoints. Compute the LRT CI asymmetry $(\text{upper half-width} - \text{lower half-width}) / \text{total width}$ across the five rolls. Is it consistently positive (right-skewed) and what does that tell you about the log-likelihood?
Same widget. Pick Normal known sigma, any n. Re-roll a few times. Verify visually that the dashed-yellow parabola sits ON TOP of the green log-likelihood curve. Confirm: the LRT and Wald CI bars at the bottom are identical to the displayed precision. Argue why this is mathematically inevitable (the Normal log-likelihood is exactly quadratic in $\mu$ ).
Same widget. Pick Bernoulli, n = 5, then re-roll until you get $k = 0$ . Look at the resulting CIs. The Wald CI collapses to a point $(0, 0)$ . The LRT CI has lower endpoint at 0 (the boundary clamp) and a sensible upper endpoint. Compute $p_U = 1 - \exp(-\chi^2/(2n)) = 1 - \exp(-3.84 / 10) \approx 0.32$ . Compare with the widget's reported $p_U$ . Note that this is the LRT analogue of the "rule of three" $p_U \approx 3/n$ — and the LRT answer is a touch tighter than the rule of three (which assumes a Poisson approximation).
In the profile-vs-wald-coverage, set n = 20, 95% nominal. Toggle the Wilson curve ON. Compare the LRT (green) and Wilson (cyan) curves across the $p$ range. Argue why they look nearly identical: the score test (which Wilson inverts) and the likelihood-ratio test are asymptotically equivalent for the binomial. They differ at second order in $1/n$ , which is invisible to the eye.
Same widget. Slide n from 10 up to 200 (the available discrete steps). Watch the Wald curve's sawtooth shrink and the worst-case under-coverage move toward nominal. Argue: this is the asymptotic regularity of Wald — eventually the Normal approximation becomes good enough — and it is the reason Wald is "fine for large samples" despite being bad in moderate ones.
Pen-and-paper. For the Exponential model with $n = 10, T = 8$ , write the LRT equation $2n[\log(\hat\lambda/\lambda) - (1 - \lambda/\hat\lambda)] = 3.84$ with $\hat\lambda = 1.25$ . Try $\lambda = 0.6$ : the left side is $20 \cdot [\log(2.083) - (1 - 0.48)] = 20 \cdot [0.734 - 0.52] = 20 \cdot 0.214 = 4.28$ — too high, so $\lambda = 0.6$ is outside the CI. Try $\lambda = 0.65$ : $20 \cdot [\log(1.923) - 0.48] = 20 \cdot [0.654 - 0.48] = 3.48$ — below cutoff, so $\lambda = 0.65$ is inside. The lower endpoint is between 0.6 and 0.65; bisecting gives $\lambda_L \approx 0.60$ . Compare with the Wald lower $1.25 - 1.96 \cdot 1.25/\sqrt{10} \approx 0.475$ . Argue: Wald under-estimates the lower endpoint because it ignores the log-likelihood's left-side falloff.
Pen-and-paper. State Wilks's theorem and explain why it gives $\chi^2_1$ for a one-parameter test even when the model has many parameters (with the rest profiled out). Hint: the dimension is the number of CONSTRAINED directions in parameter space under $H_0$ , not the total dimension of the parameter vector. Cite Venzon & Moolgavkar (1988) for the algorithmic implementation in glm().
Pen-and-paper. Consider the reparameterisation $\phi = \log(\lambda)$ for the Exponential model. State why the LRT CI for $\lambda$ at $(\lambda_L, \lambda_U)$ corresponds to the LRT CI for $\phi$ at $(\log \lambda_L, \log \lambda_U)$ exactly — no first-order delta-method approximation. Then argue why the Wald CI for $\lambda$ does NOT back-transform exactly to the Wald CI for $\phi$ . This is the reparameterization-invariance advantage.

Pause and reflect: §3.3 has set out the third major CI methodology, after Wald (§3.1) and bootstrap (§3.2). The LRT confidence interval is the set of $\theta$ values where the log-likelihood is within $\chi^2_{1, 1-\alpha}/2$ of its peak — for 95% nominal, that's 1.92 log-likelihood units. The construction is Wilks's 1938 inversion: the LRT test statistic $2[\ell(\hat\theta) - \ell(\theta)]$ is asymptotically $\chi^2_1$ under $H_0: \theta = \theta_0$ , and inverting the test gives the CI. For multi-parameter models, the PROFILE log-likelihood $\ell_p(\theta_1) = \max_{\theta_2} \ell(\theta_1, \theta_2)$ generalises the construction. LRT CIs are NON-SYMMETRIC, REPARAMETERIZATION-INVARIANT, and follow the actual log-likelihood shape — the three structural advantages over Wald. The cost is numerical: root-finding for the endpoints, with the Venzon & Moolgavkar (1988) bisection-with-reoptimisation as the standard algorithm. §3.4 picks up with PREDICTION INTERVALS — the uncertainty-about-a-future-observation analogue of the CI.

What you now know

You can state the LIKELIHOOD-RATIO STATISTIC $\Lambda(\theta) = L(\hat\theta)/L(\theta)$ and the log-form $2[\ell(\hat\theta) - \ell(\theta)]$ , and you can recognise it as the test statistic for $H_0: \theta = \theta_0$ . You can state WILKS'S THEOREM (Wilks 1938): under regularity, the LRT statistic is asymptotically $\chi^2_q$ where $q$ is the number of constrained parameters. You can invert the test to get the LRT confidence interval $C^{\mathrm{LRT}} = {\theta : \ell(\theta) \ge \ell(\hat\theta) - \chi^2/2}$ .

You can derive the LRT CI for three canonical models: the Bernoulli $p$ (numerical bisection on $2k\log(\hat p / p) + 2(n-k)\log((1-\hat p)/(1-p)) = \chi^2$ ), the Exponential $\lambda$ (bisection on $2n[\log(\hat\lambda/\lambda) - (1 - \lambda/\hat\lambda)] = \chi^2$ ), and the Normal $\mu$ with known $\sigma$ (closed form, identical to Wald: $\bar x \pm z \sigma/\sqrt n$ ). You know the Normal case is the boundary equivalence between Wald and LRT that confirms the quadratic-approximation interpretation.

You can state PROFILE LIKELIHOOD as the multi-parameter generalisation: $\ell_p(\theta_1) = \max_{\theta_2} \ell(\theta_1, \theta_2)$ . You know the Venzon & Moolgavkar (1988) bisection-with-reoptimisation algorithm is the standard implementation, and that it is the default behind confint.glm() in R and the profile_likelihood method in statsmodels.

You can articulate the THREE structural advantages of LRT over Wald: (1) no quadratic / Gaussian approximation needed; (2) automatically asymmetric for asymmetric log-likelihoods, especially near boundaries; (3) reparameterization invariance — the LRT CI for $\theta$ back-transforms cleanly to the LRT CI for $g(\theta)$ under any smooth monotone $g$ .

You can use the lrt-ci-builder widget to draw the log-likelihood curve, the $\chi^2/2$ cutoff, and both the LRT and Wald CIs as horizontal bars — seeing them coincide for the Normal mean and diverge for the Bernoulli near boundaries. You can use the profile-vs-wald-coverage widget to compare exact coverage curves: Wald's sawtooth oscillation near $p = 0, 1$ vs the smoother, closer-to-nominal LRT curve.

You know the REGULARITY CONDITIONS for Wilks's theorem (interior parameter, smooth log-likelihood, identifiability, positive-definite Fisher information), and you know the Chernoff (1954) 50:50-mixture result for boundary parameters (variance components, $H_0: p = 0$ ).

Where this lands in the rest of Part 3 and the textbook. §3.4 distinguishes PREDICTION INTERVALS (uncertainty about a future observation) from CONFIDENCE INTERVALS (uncertainty about a parameter) — the LRT machinery extends but the chi-square dimension changes. §3.5 takes calibration seriously: empirical-coverage studies for LRT, Wald, and bootstrap CIs side by side. §3.6 closes Part 3 on the communication side. Part 4 (regression) uses PROFILE-LIKELIHOOD CIs as the DEFAULT for generalised linear models — glm() reports them via confint.glm(), and they are the gold standard for coefficient inference in logistic regression, Poisson regression, and the full GLM family. The chi-square inversion you just learned is the same one used there.

References

Wilks, S.S. (1938). "The large-sample distribution of the likelihood ratio for testing composite hypotheses." Annals of Mathematical Statistics 9(1), 60–62. (The foundational paper. Proves the $\chi^2_q$ asymptotic distribution of the log-likelihood-ratio statistic under regularity, where $q$ is the number of constrained parameters.)
Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall. (Section 7.2 develops the LRT-based CI and the $\chi^2_1$ calibration via Wilks 1938. The standard intermediate-graduate reference.)
Venzon, D.J., Moolgavkar, S.H. (1988). "A method for computing profile-likelihood-based confidence intervals." Applied Statistics 37(1), 87–94. (The standard ALGORITHM for profile-likelihood CIs — bisection with inner Newton–Raphson reoptimisation of nuisance parameters. The default implementation behind confint.glm() in R.)
Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. (Comprehensive treatment of likelihood-based inference. Chapter 3 on the LRT CI; §3.4–3.5 on profile likelihood; §10 on the multi-parameter Wilks ratio.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (§9.7 on the LRT and §10.6 on likelihood-based CIs; readable introductory treatment.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Section 9.2.1 develops the LRT-based CI as the inversion of the LRT, with worked examples for the Normal mean, the binomial proportion, and the exponential rate. The standard graduate-textbook treatment.)
Chernoff, H. (1954). "On the distribution of the likelihood ratio." Annals of Mathematical Statistics 25(3), 573–578. (The boundary-parameter correction: when the true parameter is on the boundary of the parameter space, the LRT statistic has a 50:50 mixture asymptotic distribution rather than $\chi^2_1$ .)
Self, S.G., Liang, K.-Y. (1987). "Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions." JASA 82(398), 605–610. (Extension of Chernoff 1954 to multi-parameter boundary cases — the modern reference for variance-component testing in mixed models.)
Brown, L.D., Cai, T.T., DasGupta, A. (2001). "Interval estimation for a binomial proportion." Statistical Science 16(2), 101–117. (The exact-coverage comparison framework. §3.1 covered the Wald-vs-Wilson-vs-Clopper-Pearson verdict; §3.3 extends to the LRT method that this paper previewed.)

Profile-likelihood and likelihood-ratio CIs

Learning objectives

Wilks's theorem and the $\chi^2/2$ cutoff

Three worked examples: Bernoulli, Exponential, Normal

The lrt-ci-builder widget

Profile likelihood: from one parameter to many

Why LRT CIs often beat Wald: three structural reasons

Coverage comparison: profile-vs-wald-coverage widget

Regularity and what happens when it fails

Try it

What you now know

References

Learning objectives

Wilks's theorem and the χ2/2\chi^2/2χ2/2 cutoff

Three worked examples: Bernoulli, Exponential, Normal

The lrt-ci-builder widget

Profile likelihood: from one parameter to many

Why LRT CIs often beat Wald: three structural reasons

Coverage comparison: profile-vs-wald-coverage widget

Regularity and what happens when it fails

Try it

What you now know

References

Wilks's theorem and the $\chi^2/2$ cutoff