Consistency, asymptotics, and "large enough"

Part 1 — Estimation

Learning objectives

  • Define the three convergence modes for random variables — convergence in PROBABILITY (θnpθ\theta_n \to_p \theta: for every ε>0\varepsilon > 0, P(θnθ>ε)0P(|\theta_n - \theta| > \varepsilon) \to 0), ALMOST SURELY (P(θnθ)=1P(\theta_n \to \theta) = 1), and IN DISTRIBUTION (Fθn(x)Fθ(x)F_{\theta_n}(x) \to F_\theta(x) at every continuity point of FθF_\theta) — and read the strict implications a.s. ⇒ in-probability ⇒ in-distribution, none of the reverses true in general
  • Define CONSISTENCY as θ^npθ\hat\theta_n \to_p \theta (weak) or θ^na.s.θ\hat\theta_n \to_{\text{a.s.}} \theta (strong), and recognise that consistency does NOT require unbiasedness — the MLE for σ2\sigma^2 on Normal data, (1/n)(XiXˉ)2(1/n) \sum (X_i - \bar X)^2, is biased at finite nn but consistent
  • State ASYMPTOTIC NORMALITY as n(θ^nθ)dN(0,σ2)\sqrt{n}(\hat\theta_n - \theta) \to_d \mathcal{N}(0, \sigma^2) where σ2\sigma^2 is the asymptotic variance, recognise this as the CLT-style statement that powers every Gaussian-based confidence interval, and connect to §1.4: when the MLE achieves the CRLB, σ2=1/I(θ)\sigma^2 = 1 / I(\theta)
  • State the DELTA METHOD: if n(θ^nθ)dN(0,σ2)\sqrt n (\hat\theta_n - \theta) \to_d \mathcal{N}(0, \sigma^2) and gg is differentiable at θ\theta with g(θ)0g'(\theta) \neq 0, then n(g(θ^n)g(θ))dN(0,[g(θ)]2σ2)\sqrt n (g(\hat\theta_n) - g(\theta)) \to_d \mathcal{N}(0, [g'(\theta)]^2 \sigma^2); apply to log-rate, log-odds, ratio of means
  • State SLUTSKY's theorem: if XndXX_n \to_d X and YnpcY_n \to_p c (constant), then XnYndcXX_n Y_n \to_d c X and Xn+YndX+cX_n + Y_n \to_d X + c; recognise it as the licence to replace a true SE by a consistent SE estimate when standardising (θ^θ)/SE^(\hat\theta - \theta) / \widehat{\text{SE}}
  • Articulate the practical "large enough" rules: sample mean for finite-variance symmetric populations is essentially Gaussian by n30n \approx 30; right-skewed populations need n100n \approx 100; heavy-tailed-but-finite-variance populations (Lognormal, t with large df) may need n1000n \ge 1000; sample variance needs much larger nn; sample max is NEVER asymptotically Gaussian
  • State the BERRY-ESSÉEN bound supxFZn(x)Φ(x)CEXμ3/(σ3n)\sup_x |F_{Z_n}(x) - \Phi(x)| \le C \, \mathbb{E}|X - \mu|^3 / (\sigma^3 \sqrt n) for some absolute constant CC (Esseen 1956 sharpened to C0.4748C \le 0.4748), and read it as the quantitative CLT-rate: convergence is O(1/n)O(1/\sqrt n) in the sup-distance, and the constant depends on the SKEWNESS of XX
  • Identify three CLT-failure regimes and the alternative tool each calls for: (i) infinite variance (Cauchy, certain Pareto) → use the median or another robust estimator (§1.8); (ii) finite variance but pathological skew at borderline nn → use the bootstrap (§1.7) or t-rather-than-Gaussian critical values; (iii) extreme order statistics (sample max, min) → extreme-value theory (Gumbel / Fréchet / Weibull limits, not Gaussian)
  • Recognise diagnostics for "is n large enough" — compare the sampling distribution to a Gaussian via a Q-Q plot, bootstrap a single sample and check the bootstrap distribution's normality, or run the convergence-modes widget for the problem at hand; treat the asymptotic Gaussian as a TOOL, not a religion

§1.1 through §1.8 built the estimator catalogue: bias and variance (§1.1), method of moments (§1.2), MLE (§1.3), the CRLB lower bound on variance (§1.4), the bias-variance trade-off frontier (§1.5), the sampling distribution as the central object (§1.6), the bootstrap as a sampling-distribution engine (§1.7), and M-estimators as the robust replacement when classical estimators break (§1.8). Every one of those sections appealed informally to a notion of "as nn grows large" — bias goes to zero, the sampling distribution shrinks, the CLT pulls things toward Gaussian. §1.9 closes Part 1 by making that informal limit theory precise.

The §1.9 arc has six stops. First, the THREE CONVERGENCE MODES — in-probability, almost-sure, in-distribution — with their strict implication chain and the warning that none of the reverse arrows holds. Second, CONSISTENCY as the formal version of "more data eventually gets you to the truth" and the warning that consistency is a different property from unbiasedness. Third, ASYMPTOTIC NORMALITY as the foundation of every CLT-based CI and test downstream. Fourth, the DELTA METHOD as the workhorse for SEs of transformed parameters (log-odds in logistic regression, log-rate in Poisson, ratio of means). Fifth, SLUTSKY's THEOREM as the licence that lets you replace a true SE by a consistent SE estimate in the standardisation (θ^θ)/SE^(\hat\theta - \theta) / \widehat{\text{SE}} without breaking the Gaussian limit. Sixth, the PRACTICAL QUESTION — when is nn "large enough" for the asymptotic Gaussian to bite? — with concrete rules of thumb, the Berry-Esséen quantitative bound, and three explicit failure regimes where the answer is "never, use a different tool". Two widgets thread the section: the FIRST visualises all three convergence modes simultaneously; the SECOND makes the delta method tactile.

The three convergence modes

A sequence of random variables θ1,θ2,\theta_1, \theta_2, \ldots can "converge to a limit θ\theta" in three formally distinct senses. The differences matter because every limit theorem you will see — LLN, CLT, asymptotic normality of the MLE — uses ONE of these modes specifically.

Convergence in probability (θnpθ\theta_n \to_p \theta). For every ε>0\varepsilon > 0,

limnP(θnθ>ε)  =  0.\lim_{n \to \infty} P\bigl(|\theta_n - \theta| > \varepsilon\bigr) \;=\; 0.

Reading: the probability that θn\theta_n misses θ\theta by more than ε\varepsilon shrinks to zero as nn grows. This is the mode the WEAK LAW OF LARGE NUMBERS uses: for iid finite-variance X1,X2,X_1, X_2, \ldots with mean μ\mu, Xˉnpμ\bar X_n \to_p \mu.

Almost-sure convergence (θna.s.θ\theta_n \to_{\text{a.s.}} \theta). The event "θn\theta_n converges to θ\theta in the usual deterministic sense" has probability 1:

P(limnθn=θ)  =  1.P\bigl(\lim_{n \to \infty} \theta_n = \theta\bigr) \;=\; 1.

Reading: pick a realisation of the entire infinite sequence; almost every realisation is one where θn\theta_n eventually stays inside any neighbourhood of θ\theta. This is what the STRONG LAW OF LARGE NUMBERS gives: for iid XiX_i with finite mean, Xˉna.s.μ\bar X_n \to_{\text{a.s.}} \mu.

Convergence in distribution (θndθ\theta_n \to_d \theta). The CDFs converge: Fθn(x)Fθ(x)F_{\theta_n}(x) \to F_\theta(x) at every continuity point of FθF_\theta. Reading: the shape of the θn\theta_n-distribution stabilises onto the shape of θ\theta's distribution, regardless of which realisation you happen to draw. This is the mode the CENTRAL LIMIT THEOREM uses: for iid finite-variance XiX_i with mean μ\mu and variance σ2\sigma^2,

n(Xˉnμ)  d  N(0,σ2).\sqrt n \, (\bar X_n - \mu) \;\to_d\; \mathcal{N}(0, \sigma^2).

The three modes are ordered by strength:

θna.s.θ    θnpθ    θndθ.\theta_n \to_{\text{a.s.}} \theta \;\Longrightarrow\; \theta_n \to_p \theta \;\Longrightarrow\; \theta_n \to_d \theta.

None of the reverse arrows hold in general. You can construct sequences that converge in probability but not almost surely (the typewriter sequence; see Wasserman 2004 §5.5 or DasGupta 2008 Ch. 3) and sequences that converge in distribution but not in probability (independent draws all sharing the same limiting distribution). For a single LIMIT that is a CONSTANT, dc\to_d c and pc\to_p c ARE equivalent — a useful special case in Slutsky-style proofs below.

The first widget makes the three modes simultaneously visible. The LEFT panel plots realisation paths of θ^n=Xˉn\hat\theta_n = \bar X_n as nn grows from 1 to 5000 for 100 independent simulation runs — that is the in-probability / a.s. picture, with the 5-95% envelope of paths shaded as a band. The RIGHT panels plot the histogram of the standardised error n(Xˉnθ)\sqrt n (\bar X_n - \theta) at three fixed sample sizes — that is the in-distribution picture, with the asymptotic Gaussian N(0, σ²) overlay. Reader picks the population: for Normal, Exponential, Uniform, Bernoulli, Lognormal you see the envelope collapse AND the histogram stabilise; for Cauchy you see NEITHER — the envelope refuses to shrink at the 1/√n rate, and the histogram stays heavy-tailed at every nn because Cauchy has no finite variance.

Convergence ModesInteractive figure — enable JavaScript to interact.

Things to verify:

  • Normal: envelope width at n=100n = 100 is about 21.645/1000.332 \cdot 1.645 / \sqrt{100} \approx 0.33; the right-panel histogram at n=10n = 10 is already indistinguishable from N(0, 1). Convergence is essentially exact at every nn.
  • Exponential: envelope shrinks at the 1/√n rate; the right-panel histogram at n=10n = 10 is visibly right-skewed (echoing the population) but by n=1000n = 1000 it lies on top of N(0, 1).
  • Lognormal: same rate but slower visual convergence — even n=1000n = 1000 still shows a hint of right skew. This is the Berry-Esséen constant CEX3/(σ3n)C \cdot \mathbb{E}|X|^3 / (\sigma^3 \sqrt n) in action: large third absolute moment ⇒ slow rate.
  • Bernoulli(0.3): the right-panel histogram at n=10n = 10 is LUMPY because n(p^p)\sqrt n (\hat p - p) takes only 11 values; by n=100n = 100 the lumps blur into a smooth bell.
  • Cauchy: the envelope refuses to shrink. The path lines wander dramatically at every nn. The three right-panel histograms have the SAME heavy-tailed shape regardless of nn — the CLT machinery has no traction. The widget says so explicitly in the verdict line. The cure for Cauchy is the median (which IS asymptotically Normal, §1.8) or the bootstrap (§1.7).

Consistency — and how it differs from unbiasedness

An estimator θ^n\hat\theta_n is (weakly) consistent for θ\theta if θ^npθ\hat\theta_n \to_p \theta; strongly consistent if θ^na.s.θ\hat\theta_n \to_{\text{a.s.}} \theta. Reading: consistency is the formal version of "throw enough data at it and you eventually nail the right answer". It is a LARGE-SAMPLE property — it says nothing about how the estimator behaves at any specific finite nn.

Consistency is a different property from unbiasedness. Here is the textbook example that drives the distinction home. For iid X1,,XnN(μ,σ2)X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2), the MLE of σ2\sigma^2 is

σ^MLE2  =  1ni=1n(XiXˉn)2.\hat\sigma^2_{\text{MLE}} \;=\; \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar X_n)^2.

Direct calculation gives E[σ^MLE2]=σ2(n1)/n<σ2\mathbb{E}[\hat\sigma^2_{\text{MLE}}] = \sigma^2 (n - 1)/n < \sigma^2 — BIASED downward, by a factor (n1)/n(n - 1)/n. The "Bessel-corrected" sample variance s2=(1/(n1))(XiXˉ)2s^2 = (1/(n-1)) \sum (X_i - \bar X)^2 is unbiased. Yet both are CONSISTENT: as nn \to \infty, the bias σ2/n0\sigma^2 / n \to 0 and the variance also goes to zero, so by Chebyshev σ^2pσ2\hat\sigma^2 \to_p \sigma^2. The MLE's bias is "harmless asymptotically" — consistency only asks that the FINAL answer be right, not that every intermediate finite-nn answer be unbiased.

The two properties are LOGICALLY INDEPENDENT in both directions:

  • An estimator can be UNBIASED but NOT CONSISTENT. Example: Tn=X1T_n = X_1 (use the first observation, ignore the rest). It is unbiased for the population mean, but TnT_n does not converge in probability — it just sits at its first realisation forever. A "single observation" estimator never improves with more data.
  • An estimator can be CONSISTENT but NOT UNBIASED at every nn. The MLE of σ2\sigma^2 above. So can the MLE of θ\theta for Uniform(0, θ): θ^=maxXi\hat\theta = \max X_i has bias θ/(n+1)-\theta/(n+1) but is consistent (and the bias goes to zero faster than 1/n1/\sqrt n — a "super-efficient" rate, see §1.4 for the regularity-failure context).

The practical heuristic: among consistent estimators, prefer the one with smaller asymptotic variance; among unbiased estimators, prefer the one with smaller variance. Consistency is a MINIMUM requirement for any estimator you intend to use with growing-data; unbiasedness is a finite-sample property that may or may not matter depending on whether your nn is small enough for the O(1/n)O(1/n) bias term to dominate the O(1/n)O(1/\sqrt n) standard error.

Asymptotic normality — the foundation of CLT-based inference

Consistency tells you the estimator eventually gets the right answer; ASYMPTOTIC NORMALITY tells you the RATE and the SHAPE of the residual error. The standard statement is

n(θ^nθ)  d  N(0,σ2),\sqrt n \, (\hat\theta_n - \theta) \;\to_d\; \mathcal{N}(0, \sigma^2),

where σ2\sigma^2 is the asymptotic variance. The n\sqrt n rescaling is what stops the limit from being a degenerate point mass at zero (the unrescaled θ^nθ\hat\theta_n - \theta converges to 0 by consistency; rescaling magnifies the residual error to a non-trivial limit). Equivalently, for large nn,

θ^n    N ⁣(θ,  σ2/n),\hat\theta_n \;\approx\; \mathcal{N}\!\left(\theta,\; \sigma^2 / n\right),

which is the form every Gaussian-based confidence interval and Wald test relies on downstream (Part 2, Part 3). For the SAMPLE MEAN this is the classical CLT: σ2=Var(Xi)\sigma^2 = \operatorname{Var}(X_i). For the MLE under regularity, this is the asymptotic-normality-of-the-MLE theorem of §1.4: σ2=1/I(θ)\sigma^2 = 1 / I(\theta), the inverse Fisher information — which is also the Cramér-Rao lower bound, so the MLE is asymptotically EFFICIENT (achieves the CRLB asymptotically; see §1.4's crlb-vs-empirical widget for the live picture).

What the statement is NOT: it is NOT a claim that θ^n\hat\theta_n is Gaussian for any specific nn. The Gaussian limit is approached as nn \to \infty; at finite nn the sampling distribution may be visibly skewed, lumpy, or otherwise non-Gaussian. The §1.6 widget made this point empirically; the §1.9 widget above adds the in-probability path view that lets you SEE the residual error stabilising at the 1/n1/\sqrt n rate as the histograms stabilise on the Gaussian shape.

The delta method — SEs for transformed parameters

The DELTA METHOD is the linearisation trick that propagates the asymptotic-normality statement through a differentiable transformation gg.

Theorem (univariate delta method). Suppose n(θ^nθ)dN(0,σ2)\sqrt n (\hat\theta_n - \theta) \to_d \mathcal{N}(0, \sigma^2) and gg is differentiable at θ\theta with g(θ)0g'(\theta) \neq 0. Then

n(g(θ^n)g(θ))  d  N ⁣(0,  [g(θ)]2σ2).\sqrt n \, \bigl(g(\hat\theta_n) - g(\theta)\bigr) \;\to_d\; \mathcal{N}\!\left(0,\; [g'(\theta)]^2 \, \sigma^2\right).

Equivalently, for large nn,

g(θ^n)    N ⁣(g(θ),  [g(θ)]2σ2/n).g(\hat\theta_n) \;\approx\; \mathcal{N}\!\left(g(\theta),\; [g'(\theta)]^2 \, \sigma^2 / n\right).

Proof sketch (Taylor expansion + Slutsky). Expand gg around θ\theta: g(θ^n)=g(θ)+g(θ)(θ^nθ)+Rng(\hat\theta_n) = g(\theta) + g'(\theta) (\hat\theta_n - \theta) + R_n where Rn=Op((θ^nθ)2)=Op(1/n)R_n = O_p((\hat\theta_n - \theta)^2) = O_p(1/n). Multiply by n\sqrt n: n(g(θ^n)g(θ))=g(θ)n(θ^nθ)+nRn\sqrt n (g(\hat\theta_n) - g(\theta)) = g'(\theta) \cdot \sqrt n (\hat\theta_n - \theta) + \sqrt n R_n. The first term tends in distribution to g(θ)N(0,σ2)=N(0,[g(θ)]2σ2)g'(\theta) \cdot \mathcal{N}(0, \sigma^2) = \mathcal{N}(0, [g'(\theta)]^2 \sigma^2). The second term is Op(1/n)p0O_p(1/\sqrt n) \to_p 0. Slutsky's theorem assembles them.

The delta method is the workhorse for SEs of transformed parameters. Three canonical applications:

  • Log of a positive estimate. g(θ)=logθg(\theta) = \log \theta, g(θ)=1/θg'(\theta) = 1/\theta. So SE(logθ^)SE(θ^)/θ^\text{SE}(\log \hat\theta) \approx \text{SE}(\hat\theta) / \hat\theta. Used for log-rate parameters in Poisson regression, log-hazard in survival analysis, log of a count of events.
  • Log-odds (logit). g(p)=log(p/(1p))g(p) = \log(p / (1 - p)), g(p)=1/(p(1p))g'(p) = 1 / (p(1 - p)). So SE(logit(p^))SE(p^)/(p^(1p^))\text{SE}(\text{logit}(\hat p)) \approx \text{SE}(\hat p) / (\hat p (1 - \hat p)). This IS where the standard logistic-regression coefficient SEs come from.
  • Ratio of two means. g(μ1,μ2)=μ1/μ2g(\mu_1, \mu_2) = \mu_1 / \mu_2, requires the multivariate delta method (next paragraph). Used for relative risk, fold-change in genomics, percent difference.

Multivariate delta method. If n(θ^nθ)dNp(0,Σ)\sqrt n (\hat{\boldsymbol\theta}_n - \boldsymbol\theta) \to_d \mathcal{N}_p(\mathbf 0, \Sigma) and g:RpRqg: \mathbb{R}^p \to \mathbb{R}^q is differentiable at θ\boldsymbol\theta with Jacobian J=g/θ\mathbf{J} = \partial g / \partial \boldsymbol\theta of full row rank, then n(g(θ^n)g(θ))dNq(0,JΣJ)\sqrt n (g(\hat{\boldsymbol\theta}_n) - g(\boldsymbol\theta)) \to_d \mathcal{N}_q(\mathbf 0, \mathbf{J} \Sigma \mathbf{J}^\top). Same trick, vectorised.

The second widget makes the delta method tactile. Reader picks a transformation gg from a menu (identity, log, √, 1/θ, sigmoid, atan, exp, θ²) and a base estimator (sample mean of Normal, Exponential, or Bernoulli proportion). The widget simulates R=2000R = 2000 samples, computes θ^\hat\theta and g(θ^)g(\hat\theta) on each, and overlays the delta-method Gaussian N(g(θ),[g(θ)]2σ2/n)\mathcal{N}(g(\theta), [g'(\theta)]^2 \sigma^2/n) on the right-panel histogram. When the orange curve hugs the histogram, the delta method works. When it doesn't, you have found a regime where the linearisation breaks — typically because g(θ)0g'(\theta) \approx 0 (e.g. g(θ)=θ2g(\theta) = \theta^2 at θ=0\theta = 0) or because gg has too much curvature at the relevant scale.

Delta Method DemoInteractive figure — enable JavaScript to interact.

Things to verify:

  • Identity: g(θ)=θg(\theta) = \theta. The two panels are visually identical; the empirical SD ratio is 1.00 ± 0.02. Sanity check.
  • Log of Exp mean at θ = 1: g(1)=1g'(1) = 1, so the delta SD equals the direct SE. Empirical ratio ≈ 1. The right-panel histogram is shifted by log1=0\log 1 = 0 and otherwise indistinguishable from the left.
  • Sigmoid at the Normal(2, 1) mean: g(2)=σ(2)(1σ(2))0.105g'(2) = \sigma(2)(1 - \sigma(2)) \approx 0.105. Delta-method SD shrinks by that factor; empirical SD matches within 5%.
  • Reciprocal of Bernoulli(0.3) proportion: at n=10n = 10 the empirical histogram of 1/p^1 / \hat p is wildly right-skewed because p^\hat p can be very small. Delta method underestimates the spread. Increase nn to 200 and the delta-method Gaussian fits better.
  • Square of Normal(0, 1) mean (set θ = 2 in Normal base, switch to θ²): at θ=2\theta = 2 the delta works. Now switch the base to a setting where θ=0\theta = 0 — the only available base with θ=0\theta = 0 is Normal(μ=2)'s shifted sample-mean if you decrement μ, but in the widget the bases have nonzero θ. The CONCEPTUAL point — that g(θ)=2θ=0g'(\theta) = 2\theta = 0 at θ=0\theta = 0 produces a non-Gaussian chi-squared-like distribution for θ^2\hat\theta^2 — is described in the verdict line when you choose a base whose θ\theta is near zero or pick a transformation with vanishing derivative.
  • exp of Normal(2, 1) mean: g(2)=e27.39g'(2) = e^2 \approx 7.39. Delta SD blows up. At n=30n = 30 the empirical histogram is visibly right-skewed (lognormal-like) — the delta-method Gaussian is the right SCALE but the wrong SHAPE. Increase nn and the histogram concentrates and Gaussianises.

Slutsky's theorem — the licence to swap in a consistent SE estimate

SLUTSKY's THEOREM is the technical result that makes the practical standardisation (θ^θ)/SE^(\hat\theta - \theta) / \widehat{\text{SE}} work. The statement:

If XndXX_n \to_d X and YnpcY_n \to_p c for some CONSTANT cc, then

Xn+Yn  d  X+c,XnYn  d  cX,Xn/Yn  d  X/c    (if c0).X_n + Y_n \;\to_d\; X + c, \qquad X_n \cdot Y_n \;\to_d\; c \cdot X, \qquad X_n / Y_n \;\to_d\; X / c \;\;\text{(if } c \neq 0\text{).}

Slutsky says that in-distribution + in-probability-to-a-constant combine cleanly. Where it matters: a typical Wald CI is built from

θ^θSE^(θ^)  =  n(θ^θ)nSE^(θ^).\frac{\hat\theta - \theta}{\widehat{\text{SE}}(\hat\theta)} \;=\; \frac{\sqrt n (\hat\theta - \theta)}{\sqrt n \, \widehat{\text{SE}}(\hat\theta)}.

The numerator n(θ^θ)dN(0,σ2)\sqrt n (\hat\theta - \theta) \to_d \mathcal{N}(0, \sigma^2) by asymptotic normality. The denominator nSE^(θ^)\sqrt n , \widehat{\text{SE}}(\hat\theta) is a consistent estimate of σ\sigma (a finite positive constant). By Slutsky, the ratio dN(0,σ2)/σ=N(0,1)\to_d \mathcal{N}(0, \sigma^2) / \sigma = \mathcal{N}(0, 1). The standardised statistic is asymptotically standard Normal, so the Wald 95%95% CI θ^±1.96SE^\hat\theta \pm 1.96 \cdot \widehat{\text{SE}} has asymptotic coverage 95%95%.

Without Slutsky you would be stuck: you have a Gaussian limit when you divide by the TRUE σ/n\sigma / \sqrt n, but in practice you never know σ\sigma and have to estimate it. Slutsky says replacing σ\sigma by a CONSISTENT estimate σ^\hat\sigma does not break the Gaussian limit. The same machinery underlies the tt-statistic, the score statistic, and every "studentised" quantity in classical inference.

A useful one-line warning: Slutsky requires YnpcY_n \to_p c to a CONSTANT. If YnY_n converges to a non-degenerate random variable, the conclusion fails. That is why bootstrap-of-bootstrap variance estimates do NOT plug into Slutsky cleanly without extra justification — the bootstrap variance is asymptotically a random functional of the original sampling distribution, not a constant.

"Large enough" — when does the asymptotic Gaussian actually bite?

All of the above is asymptotic. In practice you have a finite nn and need to decide whether the asymptotic Gaussian approximation is trustworthy. The honest answer is "it depends" — on the estimator, the population, and what you want to do with the resulting CI or test. But several quantitative rules have been distilled in the textbook literature (Wasserman 2004 §5, DasGupta 2008 Ch. 3, Lehmann 1999):

Sample mean of finite-variance populations. The classical CLT applies. As a rule of thumb:

  • SYMMETRIC populations (Normal, Uniform, symmetric mixtures): n30n \ge 30 is generally fine. The Berry-Esséen bound below kicks in immediately because the third absolute moment is comparable to σ3\sigma^3.
  • MODERATELY SKEWED populations (Exponential, Chi-squared with small df, lognormal-like): n100n \ge 100 for the central body of the sampling distribution. The tails may need more.
  • HEAVILY SKEWED but finite-variance populations (Lognormal(0, 1), t with df = 5): n1000n \ge 1000 for tight Gaussian approximation. The third absolute moment is large, so the Berry-Esséen rate is slow.
  • BIMODAL or otherwise structured populations: depends on the mode-separation. Rule of thumb is irrelevant; check the sampling distribution directly.

Sample variance. Needs much larger nn than the sample mean. The asymptotic variance of s2s^2 depends on the FOURTH central moment of the population, not just the second. For Gaussian populations Var(s2)=2σ4/(n1)\text{Var}(s^2) = 2\sigma^4 / (n - 1); for heavy-tailed populations the constant is much larger. Practical rule: even n=200n = 200 on a heavy-tailed population may not be enough for s2s^2's asymptotic Gaussian to be useful.

Correlation. Sample correlation rr is asymptotically Gaussian by the delta method (applied to bivariate sample moments), but the convergence rate depends on the true ρ\rho and the bivariate distribution. Fisher's zz-transformation tanh1(r)\tanh^{-1}(r) — itself a delta-method application — is approximately Gaussian at much smaller nn than rr itself.

Sample max (and extreme order statistics). NEVER asymptotically Gaussian. The maximum of nn iid draws, suitably normalised, converges to one of three extreme-value distributions (Gumbel, Fréchet, Weibull) depending on the tail behaviour of the population. For Uniform(0, 1), n(1maxXi)dExponential(1)n(1 - \max X_i) \to_d \operatorname{Exponential}(1); for Exponential, the max converges to a Gumbel after subtracting logn\log n. The asymptotic Gaussian CI is the WRONG tool here; extreme-value theory (Coles 2001) is the right one.

Berry-Esséen — quantifying the CLT rate

The CLT says the standardised sample-mean distribution stabilises onto the standard Normal. The BERRY-ESSÉEN bound quantifies the rate:

supxFZn(x)Φ(x)    CEXμ3σ3n,\sup_x \left| F_{Z_n}(x) - \Phi(x) \right| \;\le\; \frac{C \, \mathbb{E}|X - \mu|^3}{\sigma^3 \, \sqrt n},

where Zn=n(Xˉnμ)/σZ_n = \sqrt n (\bar X_n - \mu) / \sigma, Φ\Phi is the standard-Normal CDF, and CC is an absolute constant. Berry (1941) proved the bound holds with some CC; Esseen (1942) gave a sharper constant. The current sharpest known value for iid summands is C0.4748C \le 0.4748 (Shevtsova 2010, building on Esseen's technique). The exact constant matters less than the SCALING: the sup-distance from the standard Normal decays at the 1/n1/\sqrt n rate, with multiplier proportional to the SKEWNESS-like quantity EXμ3/σ3\mathbb{E}|X - \mu|^3 / \sigma^3.

Reading the bound: a HIGHLY SKEWED population (large EXμ3/σ3\mathbb{E}|X - \mu|^3 / \sigma^3) needs proportionally more sample size to reach the same sup-distance to Normal as a symmetric population. This is exactly the empirical phenomenon you saw in the convergence-modes widget — Lognormal converges visibly slower than Normal at the same nn. Berry-Esséen is the theorem that says "yes, this is real, and here is the rate".

The bound is on the SUP-DISTANCE (Kolmogorov-Smirnov distance) between the standardised CDF and Φ\Phi. It is GLOBAL — it controls the whole distribution. Tighter bounds exist for specific regions: for the central body of the distribution the rate is 1/n1/\sqrt n; for the TAILS (large deviations) the rate is exponential or slower, depending on the tail of XX. See Petrov (1975, Ch. V) for the technical refinements.

When the CLT fails — three regimes and their fixes

The CLT is not universal. Three classes of failure are worth naming explicitly:

1. Infinite variance. Cauchy is the textbook example. The population mean does not exist and the variance is infinite, so the 1/n1/\sqrt n CLT rescaling does not give a non-trivial limit. The sample mean of nn iid Cauchy(0, 1) is STILL Cauchy(0, 1) at every nn — the convergence-modes widget shows this empirically. The cure: don't use the sample mean. The MEDIAN of iid Cauchy(0, 1) IS asymptotically Normal (the Cauchy density at zero is 1/π1/\pi, so n(m^0)dN(0,π2/4)\sqrt n , (\hat m - 0) \to_d \mathcal{N}(0, \pi^2 / 4) — see §1.8). For heavier tails than Cauchy the bootstrap (§1.7) and robust estimators (§1.8) are the working tools.

2. Finite variance but borderline nn with severe skew. Lognormal at small nn. The CLT applies in the limit but at n=30n = 30 the sampling distribution is still visibly right-skewed and the Gaussian CI undercovers on the right tail. The cure: bootstrap CIs (§1.7) make no Gaussian assumption and respect the skew; tt-critical-values give a small but nonzero correction for the unknown-variance case; profile-likelihood CIs (Part 3 §3.3) for parametric models can be much more accurate at moderate nn than Wald.

3. Extreme order statistics. Sample max, min, range, and quantile estimates at extreme pp (e.g. p=1/np = 1/n) are NEVER asymptotically Gaussian. The relevant limit theory is extreme-value theory (Coles 2001, Embrechts et al. 1997). The Gumbel/Fréchet/Weibull distributions are the three possible limits for the maximum of iid samples after suitable normalisation. For seismic or insurance applications where the maximum-event size matters, this is the right toolkit; the asymptotic-Gaussian CI is wrong by construction.

The honest meta-rule: the asymptotic Gaussian approximation is a tool, not a religion. When it works (most estimators, moderate nn, finite variance, mild skew) it is the easiest and most natural choice. When it doesn't, use the right alternative — robust estimator (§1.8), bootstrap CI (§1.7), profile-likelihood CI (Part 3), or extreme-value theory (out of scope here). Knowing which regime you are in is the practical content of "asymptotic theory" in applied work.

Diagnostics for "is n large enough"

How do you know whether your nn is large enough for the asymptotic Gaussian to bite? Three diagnostics, in increasing order of effort:

  • Q-Q plot the bootstrap distribution. Bootstrap B=1000B = 1000 resamples of your data, compute θ^\hat\theta on each, and Q-Q-plot the BB values against a standard Normal. A straight line ⇒ Gaussian; visible curvature in the tails ⇒ skewness or kurtosis the Wald CI will get wrong. Cheap, model-free, exactly what §1.7 was set up to provide.
  • Compare Wald and bootstrap CIs side-by-side. If they agree (within 10% on width), the Wald CI is fine. If they diverge — especially asymmetrically (Wald symmetric, bootstrap visibly asymmetric) — trust the bootstrap and report it instead.
  • Simulate at your nn. If you have a working parametric model (e.g. from §1.3 MLE), draw simulated samples of size nn from the fitted model, compute the SE-based CI on each, and check empirical coverage. If a "95%" CI covers in 92% of simulations, you have a 3% undercoverage problem — Berry-Esséen is biting and you should switch to bootstrap or profile-likelihood. This is the most expensive diagnostic but the most honest.

The convergence-modes widget is essentially a live version of diagnostic 3 with a Q-Q-flavoured shape comparison built in. Use it as a baseline for your intuition about which combinations of (population, estimator, nn) need bootstrap-style backup and which are safe with the asymptotic Gaussian.

Where §1.9 fits in the textbook

§1.9 closes Part 1 by formalising the limit theory every earlier section appealed to. Concretely:

  • §1.1 (estimator properties). Consistency and asymptotic normality are the two large-sample properties of an estimator. Unbiasedness and finite-sample variance are the small-sample counterparts.
  • §1.3 (MLE). The MLE's standard large-sample story — consistency under regularity, n(θ^MLEθ)dN(0,1/I(θ))\sqrt n (\hat\theta_{\text{MLE}} - \theta) \to_d \mathcal{N}(0, 1/I(\theta)) — uses every machine in §1.9.
  • §1.4 (CRLB). The CRLB is the lower bound; asymptotic normality at σ2=1/I(θ)\sigma^2 = 1/I(\theta) says the MLE achieves it asymptotically. §1.4's crlb-vs-empirical widget visualises that achievement.
  • §1.6 (sampling distributions). The sampling distribution IS the object that converges. §1.6 looked at it empirically; §1.9 names the limit and the rate.
  • §1.7 (bootstrap). When the asymptotic Gaussian approximation is in doubt — heavy tails, small nn, skewed estimator — the bootstrap is the universal fallback. §1.9 quantifies when "in doubt" applies.
  • §1.8 (robust / M-estimators). Robust M-estimators are asymptotically Normal with variance σ2E[ψ2]/(E[ψ])2\sigma^2 , \mathbb{E}[\psi^2] / (\mathbb{E}[\psi'])^2 — the M-estimator analogue of the CLT, derivable using exactly the asymptotic-normality + Slutsky machinery developed here.

Looking ahead: Part 2 will use these results to build hypothesis-testing machinery (Wald, score, LR tests). Part 3 §3.1 builds asymptotic CIs using exactly θ^±zα/2SE^\hat\theta \pm z_{\alpha/2} \cdot \widehat{\text{SE}}; §3.3 builds profile-likelihood CIs as the more accurate alternative at moderate nn. Part 4 (linear regression) leans on the multivariate delta method for SEs of contrasts and transformed coefficients. Part 6 (GLMs) leans on the delta method for SEs on the response scale (e.g. predicted probabilities from a logistic regression). Part 8 (resampling) ties the asymptotic and bootstrap theories together via the bootstrap-of-asymptotic-Normal-statistic results (Singh 1981, Beran 1988).

Try it

  • In the convergence-modes widget, select Normal(0, 1). At n=1000n = 1000 what is the empirical envelope width? Compute by hand the theoretical Gaussian envelope width 21.645σ/n=21.645/10000.1042 \cdot 1.645 \cdot \sigma / \sqrt n = 2 \cdot 1.645 / \sqrt{1000} \approx 0.104 and confirm the widget reports a number close to that. Switch to Lognormal; the envelope width should be larger by (e1)e2.16\sqrt{(e - 1) e} \approx 2.16 — confirm.
  • Switch to Cauchy. Read the verdict line. Note that the 5–95% envelope does NOT shrink at the 1/√n rate as nn grows along the x-axis — the envelope width stays roughly constant. The right-panel histograms at n=10,100,1000n = 10, 100, 1000 all have the same heavy-tailed Cauchy shape. The CLT fails. The widget's verdict line recommends the median (§1.8) or bootstrap (§1.7) — connect that to the sample-mean-vs-median story in §1.8.
  • Pen-and-paper: explain why the SAMPLE MAX of iid Uniform(0, 1) is NOT asymptotically Gaussian. (Hint: compute P(n(1maxXi)x)P(n(1 - \max X_i) \le x) directly. The limit is 1ex1 - e^{-x} — Exponential(1), not Gaussian. The convergence is at rate 1/n1/n not 1/n1/\sqrt n. This is the canonical extreme-value example.)
  • In the delta-method-demo widget, pick "X̄ of Exp(rate = 1)" as the base. Choose g(θ)=logθg(\theta) = \log \theta. At n=30n = 30, read the empirical SD of g(θ^)g(\hat\theta) and compare to the theoretical delta-method SD g(1)1/30=1/300.183|g'(1)| \cdot 1 / \sqrt{30} = 1 / \sqrt{30} \approx 0.183. The ratio should be close to 1.
  • Same base, switch to g(θ)=1/θg(\theta) = 1/\theta. At n=30n = 30 the empirical SD of 1/θ^1/\hat\theta may differ from 1/121/30=1/30|{-}1/1^2| \cdot 1 / \sqrt{30} = 1 / \sqrt{30} by 20–40%. Increase nn to 500 and the ratio tightens to ≈ 1. The delta method works asymptotically; at borderline nn the linearisation residual is visible.
  • Pick "p̂ of Bernoulli(0.3)" + sigmoid gg. Note that g(0.3)=0.30.70.21g'(0.3) = 0.3 \cdot 0.7 \approx 0.21 would only apply if the base estimator was on the logit scale — here the base p^ is already a probability, and we compose gg on top. The widget reports the empirical and delta numbers; reconcile them. This is the same algebra that underlies the standard error of the predicted probability from a logistic regression on the response scale.
  • Pen-and-paper: the SLUTSKY theorem says XndXX_n \to_d X AND YnpcY_n \to_p cXn+YndX+cX_n + Y_n \to_d X + c. Show by an explicit counterexample that this FAILS if YndYY_n \to_d Y instead, where YY is a non-degenerate random variable. (Hint: let Xn=ZX_n = Z and Yn=ZY_n = -Z for a single standard Normal ZZ. Then Xn+Yn=0X_n + Y_n = 0 for every nn — definitely not N(0,2)\mathcal{N}(0, 2).)
  • Berry-Esséen: for iid XiExp(1)X_i \sim \operatorname{Exp}(1), σ2=1\sigma^2 = 1 and EX132.04\mathbb{E}|X - 1|^3 \approx 2.04. The bound says supxFZn(x)Φ(x)0.47482.04/n\sup_x |F_{Z_n}(x) - \Phi(x)| \le 0.4748 \cdot 2.04 / \sqrt n. At n=100n = 100 this is 0.097\approx 0.097 — a worst-case 10% gap. Confirm in the convergence-modes widget that at n=100n = 100 the Exponential histogram still shows visible mismatch with the Gaussian overlay in the tails, then check at n=1000n = 1000 that the gap drops to 0.031\approx 0.031.
  • Pen-and-paper: an estimator is BIASED at finite nn but CONSISTENT. Sketch how this is possible using the MLE for Normal variance σ^2=(1/n)(XiXˉ)2\hat\sigma^2 = (1/n) \sum (X_i - \bar X)^2. Compute Bias(σ^2)=σ2/n\operatorname{Bias}(\hat\sigma^2) = -\sigma^2/n explicitly and show that both the bias and the variance go to 0 as nn \to \infty.
  • Construct an estimator that is UNBIASED at every nn but NOT CONSISTENT. (Hint: Tn=X1T_n = X_1 — the first observation. E[Tn]=μ\mathbb{E}[T_n] = \mu for every nn. But TnT_n does not converge to anything — it just sits at X1X_1 forever.) Conclude that "unbiased" and "consistent" are independent properties; either can hold without the other.
  • Open question (not in the widgets): can the asymptotic Gaussian approximation be made finite-sample exact for a specific estimator? Answer in the affirmative: the sample mean of EXACTLY Normal iid data is exactly Normal at every nn. This is one of the very few cases where asymptotic theory coincides with finite-sample exact theory.

Pause and reflect: §1.9 has been a section about LIMITS. Yet every estimator you will ever compute is at a SPECIFIC finite nn. What is the practical contract that asymptotic theory makes with finite-sample analysis? When nn is "large enough" the asymptotic Gaussian gives a tight, transparent CI machinery (Wald). When nn is borderline, the asymptotic Gaussian undercovers — but it tells you BY HOW MUCH (Berry-Esséen) and points you toward the alternative (bootstrap, profile likelihood, robust). When nn is in the failure regime (heavy tails, max), asymptotic Gaussian theory has the honesty to say "not me — use this other tool". The contract is not "trust the limit"; it is "the limit + its rate + its failure modes are themselves a usable toolkit".

What you now know

Three convergence modes (in-probability, almost-sure, in-distribution) with the strict implication chain a.s. ⇒ p ⇒ d and the warning that none of the reverses hold. Consistency is the formal statement θ^npθ\hat\theta_n \to_p \theta; it is a different property from unbiasedness, and biased estimators (the MLE for Gaussian variance) can be consistent. Asymptotic normality n(θ^nθ)dN(0,σ2)\sqrt n (\hat\theta_n - \theta) \to_d \mathcal{N}(0, \sigma^2) is the rate-and-shape statement that powers every CLT-based CI and test; under MLE regularity σ2=1/I(θ)\sigma^2 = 1/I(\theta).

The delta method propagates asymptotic normality through a smooth transformation: n(g(θ^)g(θ))dN(0,[g(θ)]2σ2)\sqrt n (g(\hat\theta) - g(\theta)) \to_d \mathcal{N}(0, [g'(\theta)]^2 \sigma^2). The workhorse for SEs of log-rates, log-odds, ratios. Slutsky's theorem licenses the swap of a true SE for a consistent SE estimate inside a standardised statistic, so the Wald CI θ^±1.96SE^\hat\theta \pm 1.96 , \widehat{\text{SE}} has the right asymptotic coverage even when σ\sigma itself is estimated.

"Large enough" practical rules: sample mean for finite-variance symmetric populations is essentially Gaussian by n30n \approx 30; right-skewed populations need n100n \approx 100; heavy-tailed-but-finite-variance ones need n1000n \ge 1000; sample variance needs much more. The Berry-Esséen bound supxFZn(x)Φ(x)CEXμ3/(σ3n)\sup_x |F_{Z_n}(x) - \Phi(x)| \le C , \mathbb{E}|X - \mu|^3 / (\sigma^3 \sqrt n) quantifies the 1/n1/\sqrt n rate. Three classes of CLT failure: infinite variance (use robust estimators or bootstrap), borderline nn with severe skew (use bootstrap or profile likelihood), and extreme order statistics (use extreme-value theory). The asymptotic Gaussian is a tool, not a religion.

This section closes Part 1. The estimator catalogue is built; the asymptotic theory that organises it is in place. Part 2 begins. The opening sections — Neyman-Pearson framework, Type-I/II errors and power, the classical tests (t, chi-square, F) done by hand, what a pp-value is and is not — use the §1.9 machinery without re-deriving it. Every test statistic in Part 2 has an asymptotic Gaussian or χ2\chi^2 limit; every power calculation invokes the asymptotic-normality picture; every p-value interpretation rests on the sampling-distribution view §1.6 set up and §1.9 formalised.

References

  • Lehmann, E.L. (1999). Elements of Large-Sample Theory. Springer. (The standard graduate textbook on asymptotic statistics. Chapters 2-3 develop convergence modes, consistency, and asymptotic normality; Chapter 5 the delta method.)
  • van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge University Press. (The most-cited modern reference. Chapter 2 covers stochastic convergence; Chapter 3 the delta method; Chapter 5 M- and Z-estimator asymptotics; Chapter 23 efficient estimation.)
  • DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability. Springer. (Encyclopedic. Chapter 3 covers Berry-Esséen and CLT refinements; Chapter 6 the bootstrap from the asymptotic angle.)
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 5 is the cleanest one-chapter survey of convergence modes, LLN, and CLT for a stats audience; Chapter 9 covers the delta method and asymptotic theory of estimators.)
  • Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 10 covers asymptotic evaluations of point estimators including consistency and asymptotic normality.)
  • Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. (Foundational text; the original modern treatment of asymptotic theory for MLE.)
  • Berry, A.C. (1941). "The accuracy of the Gaussian approximation to the sum of independent variables." Transactions of the American Mathematical Society 49(1), 122-136. (The Berry side of the Berry-Esséen theorem.)
  • Esseen, C.G. (1942). "On the Liapunoff limit of error in the theory of probability." Arkiv för Matematik, Astronomi och Fysik 28A, 1-19. (The Esseen side of the Berry-Esséen theorem, sharpening Berry's constant.)
  • Pratt, J.W., Gibbons, J.D. (1981). Concepts of Nonparametric Theory. Springer. (Useful complement on order-statistic asymptotics — the extreme-value regime where the CLT does not apply.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.