Consistency, asymptotics, and "large enough"

Part 1 — Estimation

Learning objectives

Define the three convergence modes for random variables — convergence in PROBABILITY ( $\theta_n \to_p \theta$ : for every $\varepsilon > 0$ , $P(|\theta_n - \theta| > \varepsilon) \to 0$ ), ALMOST SURELY ( $P(\theta_n \to \theta) = 1$ ), and IN DISTRIBUTION ( $F_{\theta_n}(x) \to F_\theta(x)$ at every continuity point of $F_\theta$ ) — and read the strict implications a.s. ⇒ in-probability ⇒ in-distribution, none of the reverses true in general
Define CONSISTENCY as $\hat\theta_n \to_p \theta$ (weak) or $\hat\theta_n \to_{\text{a.s.}} \theta$ (strong), and recognise that consistency does NOT require unbiasedness — the MLE for $\sigma^2$ on Normal data, $(1/n) \sum (X_i - \bar X)^2$ , is biased at finite $n$ but consistent
State ASYMPTOTIC NORMALITY as $\sqrt{n}(\hat\theta_n - \theta) \to_d \mathcal{N}(0, \sigma^2)$ where $\sigma^2$ is the asymptotic variance, recognise this as the CLT-style statement that powers every Gaussian-based confidence interval, and connect to §1.4: when the MLE achieves the CRLB, $\sigma^2 = 1 / I(\theta)$
State the DELTA METHOD: if $\sqrt n (\hat\theta_n - \theta) \to_d \mathcal{N}(0, \sigma^2)$ and $g$ is differentiable at $\theta$ with $g'(\theta) \neq 0$ , then $\sqrt n (g(\hat\theta_n) - g(\theta)) \to_d \mathcal{N}(0, [g'(\theta)]^2 \sigma^2)$ ; apply to log-rate, log-odds, ratio of means
State SLUTSKY's theorem: if $X_n \to_d X$ and $Y_n \to_p c$ (constant), then $X_n Y_n \to_d c X$ and $X_n + Y_n \to_d X + c$ ; recognise it as the licence to replace a true SE by a consistent SE estimate when standardising $(\hat\theta - \theta) / \widehat{\text{SE}}$
Articulate the practical "large enough" rules: sample mean for finite-variance symmetric populations is essentially Gaussian by $n \approx 30$ ; right-skewed populations need $n \approx 100$ ; heavy-tailed-but-finite-variance populations (Lognormal, t with large df) may need $n \ge 1000$ ; sample variance needs much larger $n$ ; sample max is NEVER asymptotically Gaussian
State the BERRY-ESSÉEN bound $\sup_x |F_{Z_n}(x) - \Phi(x)| \le C \, \mathbb{E}|X - \mu|^3 / (\sigma^3 \sqrt n)$ for some absolute constant $C$ (Esseen 1956 sharpened to $C \le 0.4748$ ), and read it as the quantitative CLT-rate: convergence is $O(1/\sqrt n)$ in the sup-distance, and the constant depends on the SKEWNESS of $X$
Identify three CLT-failure regimes and the alternative tool each calls for: (i) infinite variance (Cauchy, certain Pareto) → use the median or another robust estimator (§1.8); (ii) finite variance but pathological skew at borderline $n$ → use the bootstrap (§1.7) or t-rather-than-Gaussian critical values; (iii) extreme order statistics (sample max, min) → extreme-value theory (Gumbel / Fréchet / Weibull limits, not Gaussian)
Recognise diagnostics for "is n large enough" — compare the sampling distribution to a Gaussian via a Q-Q plot, bootstrap a single sample and check the bootstrap distribution's normality, or run the convergence-modes widget for the problem at hand; treat the asymptotic Gaussian as a TOOL, not a religion

§1.1 through §1.8 built the estimator catalogue: bias and variance (§1.1), method of moments (§1.2), MLE (§1.3), the CRLB lower bound on variance (§1.4), the bias-variance trade-off frontier (§1.5), the sampling distribution as the central object (§1.6), the bootstrap as a sampling-distribution engine (§1.7), and M-estimators as the robust replacement when classical estimators break (§1.8). Every one of those sections appealed informally to a notion of "as $n$ grows large" — bias goes to zero, the sampling distribution shrinks, the CLT pulls things toward Gaussian. §1.9 closes Part 1 by making that informal limit theory precise.

The §1.9 arc has six stops. First, the THREE CONVERGENCE MODES — in-probability, almost-sure, in-distribution — with their strict implication chain and the warning that none of the reverse arrows holds. Second, CONSISTENCY as the formal version of "more data eventually gets you to the truth" and the warning that consistency is a different property from unbiasedness. Third, ASYMPTOTIC NORMALITY as the foundation of every CLT-based CI and test downstream. Fourth, the DELTA METHOD as the workhorse for SEs of transformed parameters (log-odds in logistic regression, log-rate in Poisson, ratio of means). Fifth, SLUTSKY's THEOREM as the licence that lets you replace a true SE by a consistent SE estimate in the standardisation $(\hat\theta - \theta) / \widehat{\text{SE}}$ without breaking the Gaussian limit. Sixth, the PRACTICAL QUESTION — when is $n$ "large enough" for the asymptotic Gaussian to bite? — with concrete rules of thumb, the Berry-Esséen quantitative bound, and three explicit failure regimes where the answer is "never, use a different tool". Two widgets thread the section: the FIRST visualises all three convergence modes simultaneously; the SECOND makes the delta method tactile.

The three convergence modes

A sequence of random variables $\theta_1, \theta_2, \ldots$ can "converge to a limit $\theta$ " in three formally distinct senses. The differences matter because every limit theorem you will see — LLN, CLT, asymptotic normality of the MLE — uses ONE of these modes specifically.

Convergence in probability ( $\theta_n \to_p \theta$ ). For every $\varepsilon > 0$ ,

\lim_{n \to \infty} P\bigl(|\theta_n - \theta| > \varepsilon\bigr) \;=\; 0.

Reading: the probability that $\theta_n$ misses $\theta$ by more than $\varepsilon$ shrinks to zero as $n$ grows. This is the mode the WEAK LAW OF LARGE NUMBERS uses: for iid finite-variance $X_1, X_2, \ldots$ with mean $\mu$ , $\bar X_n \to_p \mu$ .

Almost-sure convergence ( $\theta_n \to_{\text{a.s.}} \theta$ ). The event " $\theta_n$ converges to $\theta$ in the usual deterministic sense" has probability 1:

P\bigl(\lim_{n \to \infty} \theta_n = \theta\bigr) \;=\; 1.

Reading: pick a realisation of the entire infinite sequence; almost every realisation is one where $\theta_n$ eventually stays inside any neighbourhood of $\theta$ . This is what the STRONG LAW OF LARGE NUMBERS gives: for iid $X_i$ with finite mean, $\bar X_n \to_{\text{a.s.}} \mu$ .

Convergence in distribution ( $\theta_n \to_d \theta$ ). The CDFs converge: $F_{\theta_n}(x) \to F_\theta(x)$ at every continuity point of $F_\theta$ . Reading: the shape of the $\theta_n$ -distribution stabilises onto the shape of $\theta$ 's distribution, regardless of which realisation you happen to draw. This is the mode the CENTRAL LIMIT THEOREM uses: for iid finite-variance $X_i$ with mean $\mu$ and variance $\sigma^2$ ,

\sqrt n \, (\bar X_n - \mu) \;\to_d\; \mathcal{N}(0, \sigma^2).

The three modes are ordered by strength:

\theta_n \to_{\text{a.s.}} \theta \;\Longrightarrow\; \theta_n \to_p \theta \;\Longrightarrow\; \theta_n \to_d \theta.

None of the reverse arrows hold in general. You can construct sequences that converge in probability but not almost surely (the typewriter sequence; see Wasserman 2004 §5.5 or DasGupta 2008 Ch. 3) and sequences that converge in distribution but not in probability (independent draws all sharing the same limiting distribution). For a single LIMIT that is a CONSTANT, $\to_d c$ and $\to_p c$ ARE equivalent — a useful special case in Slutsky-style proofs below.

The first widget makes the three modes simultaneously visible. The LEFT panel plots realisation paths of $\hat\theta_n = \bar X_n$ as $n$ grows from 1 to 5000 for 100 independent simulation runs — that is the in-probability / a.s. picture, with the 5-95% envelope of paths shaded as a band. The RIGHT panels plot the histogram of the standardised error $\sqrt n (\bar X_n - \theta)$ at three fixed sample sizes — that is the in-distribution picture, with the asymptotic Gaussian N(0, σ²) overlay. Reader picks the population: for Normal, Exponential, Uniform, Bernoulli, Lognormal you see the envelope collapse AND the histogram stabilise; for Cauchy you see NEITHER — the envelope refuses to shrink at the 1/√n rate, and the histogram stays heavy-tailed at every $n$ because Cauchy has no finite variance.

Things to verify:

Normal: envelope width at $n = 100$ is about $2 \cdot 1.645 / \sqrt{100} \approx 0.33$ ; the right-panel histogram at $n = 10$ is already indistinguishable from N(0, 1). Convergence is essentially exact at every $n$ .
Exponential: envelope shrinks at the 1/√n rate; the right-panel histogram at $n = 10$ is visibly right-skewed (echoing the population) but by $n = 1000$ it lies on top of N(0, 1).
Lognormal: same rate but slower visual convergence — even $n = 1000$ still shows a hint of right skew. This is the Berry-Esséen constant $C \cdot \mathbb{E}|X|^3 / (\sigma^3 \sqrt n)$ in action: large third absolute moment ⇒ slow rate.
Bernoulli(0.3): the right-panel histogram at $n = 10$ is LUMPY because $\sqrt n (\hat p - p)$ takes only 11 values; by $n = 100$ the lumps blur into a smooth bell.
Cauchy: the envelope refuses to shrink. The path lines wander dramatically at every $n$ . The three right-panel histograms have the SAME heavy-tailed shape regardless of $n$ — the CLT machinery has no traction. The widget says so explicitly in the verdict line. The cure for Cauchy is the median (which IS asymptotically Normal, §1.8) or the bootstrap (§1.7).

Consistency — and how it differs from unbiasedness

An estimator $\hat\theta_n$ is (weakly) consistent for $\theta$ if $\hat\theta_n \to_p \theta$ ; strongly consistent if $\hat\theta_n \to_{\text{a.s.}} \theta$ . Reading: consistency is the formal version of "throw enough data at it and you eventually nail the right answer". It is a LARGE-SAMPLE property — it says nothing about how the estimator behaves at any specific finite $n$ .

Consistency is a different property from unbiasedness. Here is the textbook example that drives the distinction home. For iid $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ , the MLE of $\sigma^2$ is

\hat\sigma^2_{\text{MLE}} \;=\; \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar X_n)^2.

Direct calculation gives $\mathbb{E}[\hat\sigma^2_{\text{MLE}}] = \sigma^2 (n - 1)/n < \sigma^2$ — BIASED downward, by a factor $(n - 1)/n$ . The "Bessel-corrected" sample variance $s^2 = (1/(n-1)) \sum (X_i - \bar X)^2$ is unbiased. Yet both are CONSISTENT: as $n \to \infty$ , the bias $\sigma^2 / n \to 0$ and the variance also goes to zero, so by Chebyshev $\hat\sigma^2 \to_p \sigma^2$ . The MLE's bias is "harmless asymptotically" — consistency only asks that the FINAL answer be right, not that every intermediate finite- $n$ answer be unbiased.

The two properties are LOGICALLY INDEPENDENT in both directions:

An estimator can be UNBIASED but NOT CONSISTENT. Example: $T_n = X_1$ (use the first observation, ignore the rest). It is unbiased for the population mean, but $T_n$ does not converge in probability — it just sits at its first realisation forever. A "single observation" estimator never improves with more data.
An estimator can be CONSISTENT but NOT UNBIASED at every $n$ . The MLE of $\sigma^2$ above. So can the MLE of $\theta$ for Uniform(0, θ): $\hat\theta = \max X_i$ has bias $-\theta/(n+1)$ but is consistent (and the bias goes to zero faster than $1/\sqrt n$ — a "super-efficient" rate, see §1.4 for the regularity-failure context).

The practical heuristic: among consistent estimators, prefer the one with smaller asymptotic variance; among unbiased estimators, prefer the one with smaller variance. Consistency is a MINIMUM requirement for any estimator you intend to use with growing-data; unbiasedness is a finite-sample property that may or may not matter depending on whether your $n$ is small enough for the $O(1/n)$ bias term to dominate the $O(1/\sqrt n)$ standard error.

Asymptotic normality — the foundation of CLT-based inference

Consistency tells you the estimator eventually gets the right answer; ASYMPTOTIC NORMALITY tells you the RATE and the SHAPE of the residual error. The standard statement is

\sqrt n \, (\hat\theta_n - \theta) \;\to_d\; \mathcal{N}(0, \sigma^2),

where $\sigma^2$ is the asymptotic variance. The $\sqrt n$ rescaling is what stops the limit from being a degenerate point mass at zero (the unrescaled $\hat\theta_n - \theta$ converges to 0 by consistency; rescaling magnifies the residual error to a non-trivial limit). Equivalently, for large $n$ ,

\hat\theta_n \;\approx\; \mathcal{N}\!\left(\theta,\; \sigma^2 / n\right),

which is the form every Gaussian-based confidence interval and Wald test relies on downstream (Part 2, Part 3). For the SAMPLE MEAN this is the classical CLT: $\sigma^2 = \operatorname{Var}(X_i)$ . For the MLE under regularity, this is the asymptotic-normality-of-the-MLE theorem of §1.4: $\sigma^2 = 1 / I(\theta)$ , the inverse Fisher information — which is also the Cramér-Rao lower bound, so the MLE is asymptotically EFFICIENT (achieves the CRLB asymptotically; see §1.4's crlb-vs-empirical widget for the live picture).

What the statement is NOT: it is NOT a claim that $\hat\theta_n$ is Gaussian for any specific $n$ . The Gaussian limit is approached as $n \to \infty$ ; at finite $n$ the sampling distribution may be visibly skewed, lumpy, or otherwise non-Gaussian. The §1.6 widget made this point empirically; the §1.9 widget above adds the in-probability path view that lets you SEE the residual error stabilising at the $1/\sqrt n$ rate as the histograms stabilise on the Gaussian shape.

The delta method — SEs for transformed parameters

The DELTA METHOD is the linearisation trick that propagates the asymptotic-normality statement through a differentiable transformation $g$ .

Theorem (univariate delta method). Suppose $\sqrt n (\hat\theta_n - \theta) \to_d \mathcal{N}(0, \sigma^2)$ and $g$ is differentiable at $\theta$ with $g'(\theta) \neq 0$ . Then

\sqrt n \, \bigl(g(\hat\theta_n) - g(\theta)\bigr) \;\to_d\; \mathcal{N}\!\left(0,\; [g'(\theta)]^2 \, \sigma^2\right).

Equivalently, for large $n$ ,

g(\hat\theta_n) \;\approx\; \mathcal{N}\!\left(g(\theta),\; [g'(\theta)]^2 \, \sigma^2 / n\right).

Proof sketch (Taylor expansion + Slutsky). Expand $g$ around $\theta$ : $g(\hat\theta_n) = g(\theta) + g'(\theta) (\hat\theta_n - \theta) + R_n$ where $R_n = O_p((\hat\theta_n - \theta)^2) = O_p(1/n)$ . Multiply by $\sqrt n$ : $\sqrt n (g(\hat\theta_n) - g(\theta)) = g'(\theta) \cdot \sqrt n (\hat\theta_n - \theta) + \sqrt n R_n$ . The first term tends in distribution to $g'(\theta) \cdot \mathcal{N}(0, \sigma^2) = \mathcal{N}(0, [g'(\theta)]^2 \sigma^2)$ . The second term is $O_p(1/\sqrt n) \to_p 0$ . Slutsky's theorem assembles them.

The delta method is the workhorse for SEs of transformed parameters. Three canonical applications:

Log of a positive estimate. $g(\theta) = \log \theta$ , $g'(\theta) = 1/\theta$ . So $\text{SE}(\log \hat\theta) \approx \text{SE}(\hat\theta) / \hat\theta$ . Used for log-rate parameters in Poisson regression, log-hazard in survival analysis, log of a count of events.
Log-odds (logit). $g(p) = \log(p / (1 - p))$ , $g'(p) = 1 / (p(1 - p))$ . So $\text{SE}(\text{logit}(\hat p)) \approx \text{SE}(\hat p) / (\hat p (1 - \hat p))$ . This IS where the standard logistic-regression coefficient SEs come from.
Ratio of two means. $g(\mu_1, \mu_2) = \mu_1 / \mu_2$ , requires the multivariate delta method (next paragraph). Used for relative risk, fold-change in genomics, percent difference.

Multivariate delta method. If $\sqrt n (\hat{\boldsymbol\theta}_n - \boldsymbol\theta) \to_d \mathcal{N}_p(\mathbf 0, \Sigma)$ and $g: \mathbb{R}^p \to \mathbb{R}^q$ is differentiable at $\boldsymbol\theta$ with Jacobian $\mathbf{J} = \partial g / \partial \boldsymbol\theta$ of full row rank, then $\sqrt n (g(\hat{\boldsymbol\theta}_n) - g(\boldsymbol\theta)) \to_d \mathcal{N}_q(\mathbf 0, \mathbf{J} \Sigma \mathbf{J}^\top)$ . Same trick, vectorised.

The second widget makes the delta method tactile. Reader picks a transformation $g$ from a menu (identity, log, √, 1/θ, sigmoid, atan, exp, θ²) and a base estimator (sample mean of Normal, Exponential, or Bernoulli proportion). The widget simulates $R = 2000$ samples, computes $\hat\theta$ and $g(\hat\theta)$ on each, and overlays the delta-method Gaussian $\mathcal{N}(g(\theta), [g'(\theta)]^2 \sigma^2/n)$ on the right-panel histogram. When the orange curve hugs the histogram, the delta method works. When it doesn't, you have found a regime where the linearisation breaks — typically because $g'(\theta) \approx 0$ (e.g. $g(\theta) = \theta^2$ at $\theta = 0$ ) or because $g$ has too much curvature at the relevant scale.

Things to verify:

Identity: $g(\theta) = \theta$ . The two panels are visually identical; the empirical SD ratio is 1.00 ± 0.02. Sanity check.
Log of Exp mean at θ = 1: $g'(1) = 1$ , so the delta SD equals the direct SE. Empirical ratio ≈ 1. The right-panel histogram is shifted by $\log 1 = 0$ and otherwise indistinguishable from the left.
Sigmoid at the Normal(2, 1) mean: $g'(2) = \sigma(2)(1 - \sigma(2)) \approx 0.105$ . Delta-method SD shrinks by that factor; empirical SD matches within 5%.
Reciprocal of Bernoulli(0.3) proportion: at $n = 10$ the empirical histogram of $1 / \hat p$ is wildly right-skewed because $\hat p$ can be very small. Delta method underestimates the spread. Increase $n$ to 200 and the delta-method Gaussian fits better.
Square of Normal(0, 1) mean (set θ = 2 in Normal base, switch to θ²): at $\theta = 2$ the delta works. Now switch the base to a setting where $\theta = 0$ — the only available base with $\theta = 0$ is Normal(μ=2)'s shifted sample-mean if you decrement μ, but in the widget the bases have nonzero θ. The CONCEPTUAL point — that $g'(\theta) = 2\theta = 0$ at $\theta = 0$ produces a non-Gaussian chi-squared-like distribution for $\hat\theta^2$ — is described in the verdict line when you choose a base whose $\theta$ is near zero or pick a transformation with vanishing derivative.
exp of Normal(2, 1) mean: $g'(2) = e^2 \approx 7.39$ . Delta SD blows up. At $n = 30$ the empirical histogram is visibly right-skewed (lognormal-like) — the delta-method Gaussian is the right SCALE but the wrong SHAPE. Increase $n$ and the histogram concentrates and Gaussianises.

Slutsky's theorem — the licence to swap in a consistent SE estimate

SLUTSKY's THEOREM is the technical result that makes the practical standardisation $(\hat\theta - \theta) / \widehat{\text{SE}}$ work. The statement:

If $X_n \to_d X$ and $Y_n \to_p c$ for some CONSTANT $c$ , then

X_n + Y_n \;\to_d\; X + c, \qquad X_n \cdot Y_n \;\to_d\; c \cdot X, \qquad X_n / Y_n \;\to_d\; X / c \;\;\text{(if } c \neq 0\text{).}

Slutsky says that in-distribution + in-probability-to-a-constant combine cleanly. Where it matters: a typical Wald CI is built from

\frac{\hat\theta - \theta}{\widehat{\text{SE}}(\hat\theta)} \;=\; \frac{\sqrt n (\hat\theta - \theta)}{\sqrt n \, \widehat{\text{SE}}(\hat\theta)}.

The numerator $\sqrt n (\hat\theta - \theta) \to_d \mathcal{N}(0, \sigma^2)$ by asymptotic normality. The denominator $\sqrt n , \widehat{\text{SE}}(\hat\theta)$ is a consistent estimate of $\sigma$ (a finite positive constant). By Slutsky, the ratio $\to_d \mathcal{N}(0, \sigma^2) / \sigma = \mathcal{N}(0, 1)$ . The standardised statistic is asymptotically standard Normal, so the Wald $95%$ CI $\hat\theta \pm 1.96 \cdot \widehat{\text{SE}}$ has asymptotic coverage $95%$ .

Without Slutsky you would be stuck: you have a Gaussian limit when you divide by the TRUE $\sigma / \sqrt n$ , but in practice you never know $\sigma$ and have to estimate it. Slutsky says replacing $\sigma$ by a CONSISTENT estimate $\hat\sigma$ does not break the Gaussian limit. The same machinery underlies the $t$ -statistic, the score statistic, and every "studentised" quantity in classical inference.

A useful one-line warning: Slutsky requires $Y_n \to_p c$ to a CONSTANT. If $Y_n$ converges to a non-degenerate random variable, the conclusion fails. That is why bootstrap-of-bootstrap variance estimates do NOT plug into Slutsky cleanly without extra justification — the bootstrap variance is asymptotically a random functional of the original sampling distribution, not a constant.

"Large enough" — when does the asymptotic Gaussian actually bite?

All of the above is asymptotic. In practice you have a finite $n$ and need to decide whether the asymptotic Gaussian approximation is trustworthy. The honest answer is "it depends" — on the estimator, the population, and what you want to do with the resulting CI or test. But several quantitative rules have been distilled in the textbook literature (Wasserman 2004 §5, DasGupta 2008 Ch. 3, Lehmann 1999):

Sample mean of finite-variance populations. The classical CLT applies. As a rule of thumb:

SYMMETRIC populations (Normal, Uniform, symmetric mixtures): $n \ge 30$ is generally fine. The Berry-Esséen bound below kicks in immediately because the third absolute moment is comparable to $\sigma^3$ .
MODERATELY SKEWED populations (Exponential, Chi-squared with small df, lognormal-like): $n \ge 100$ for the central body of the sampling distribution. The tails may need more.
HEAVILY SKEWED but finite-variance populations (Lognormal(0, 1), t with df = 5): $n \ge 1000$ for tight Gaussian approximation. The third absolute moment is large, so the Berry-Esséen rate is slow.
BIMODAL or otherwise structured populations: depends on the mode-separation. Rule of thumb is irrelevant; check the sampling distribution directly.

Sample variance. Needs much larger $n$ than the sample mean. The asymptotic variance of $s^2$ depends on the FOURTH central moment of the population, not just the second. For Gaussian populations $\text{Var}(s^2) = 2\sigma^4 / (n - 1)$ ; for heavy-tailed populations the constant is much larger. Practical rule: even $n = 200$ on a heavy-tailed population may not be enough for $s^2$ 's asymptotic Gaussian to be useful.

Correlation. Sample correlation $r$ is asymptotically Gaussian by the delta method (applied to bivariate sample moments), but the convergence rate depends on the true $\rho$ and the bivariate distribution. Fisher's $z$ -transformation $\tanh^{-1}(r)$ — itself a delta-method application — is approximately Gaussian at much smaller $n$ than $r$ itself.

Sample max (and extreme order statistics). NEVER asymptotically Gaussian. The maximum of $n$ iid draws, suitably normalised, converges to one of three extreme-value distributions (Gumbel, Fréchet, Weibull) depending on the tail behaviour of the population. For Uniform(0, 1), $n(1 - \max X_i) \to_d \operatorname{Exponential}(1)$ ; for Exponential, the max converges to a Gumbel after subtracting $\log n$ . The asymptotic Gaussian CI is the WRONG tool here; extreme-value theory (Coles 2001) is the right one.

Berry-Esséen — quantifying the CLT rate

The CLT says the standardised sample-mean distribution stabilises onto the standard Normal. The BERRY-ESSÉEN bound quantifies the rate:

\sup_x \left| F_{Z_n}(x) - \Phi(x) \right| \;\le\; \frac{C \, \mathbb{E}|X - \mu|^3}{\sigma^3 \, \sqrt n},

where $Z_n = \sqrt n (\bar X_n - \mu) / \sigma$ , $\Phi$ is the standard-Normal CDF, and $C$ is an absolute constant. Berry (1941) proved the bound holds with some $C$ ; Esseen (1942) gave a sharper constant. The current sharpest known value for iid summands is $C \le 0.4748$ (Shevtsova 2010, building on Esseen's technique). The exact constant matters less than the SCALING: the sup-distance from the standard Normal decays at the $1/\sqrt n$ rate, with multiplier proportional to the SKEWNESS-like quantity $\mathbb{E}|X - \mu|^3 / \sigma^3$ .

Reading the bound: a HIGHLY SKEWED population (large $\mathbb{E}|X - \mu|^3 / \sigma^3$ ) needs proportionally more sample size to reach the same sup-distance to Normal as a symmetric population. This is exactly the empirical phenomenon you saw in the convergence-modes widget — Lognormal converges visibly slower than Normal at the same $n$ . Berry-Esséen is the theorem that says "yes, this is real, and here is the rate".

The bound is on the SUP-DISTANCE (Kolmogorov-Smirnov distance) between the standardised CDF and $\Phi$ . It is GLOBAL — it controls the whole distribution. Tighter bounds exist for specific regions: for the central body of the distribution the rate is $1/\sqrt n$ ; for the TAILS (large deviations) the rate is exponential or slower, depending on the tail of $X$ . See Petrov (1975, Ch. V) for the technical refinements.

When the CLT fails — three regimes and their fixes

The CLT is not universal. Three classes of failure are worth naming explicitly:

1. Infinite variance. Cauchy is the textbook example. The population mean does not exist and the variance is infinite, so the $1/\sqrt n$ CLT rescaling does not give a non-trivial limit. The sample mean of $n$ iid Cauchy(0, 1) is STILL Cauchy(0, 1) at every $n$ — the convergence-modes widget shows this empirically. The cure: don't use the sample mean. The MEDIAN of iid Cauchy(0, 1) IS asymptotically Normal (the Cauchy density at zero is $1/\pi$ , so $\sqrt n , (\hat m - 0) \to_d \mathcal{N}(0, \pi^2 / 4)$ — see §1.8). For heavier tails than Cauchy the bootstrap (§1.7) and robust estimators (§1.8) are the working tools.

2. Finite variance but borderline $n$ with severe skew. Lognormal at small $n$ . The CLT applies in the limit but at $n = 30$ the sampling distribution is still visibly right-skewed and the Gaussian CI undercovers on the right tail. The cure: bootstrap CIs (§1.7) make no Gaussian assumption and respect the skew; $t$ -critical-values give a small but nonzero correction for the unknown-variance case; profile-likelihood CIs (Part 3 §3.3) for parametric models can be much more accurate at moderate $n$ than Wald.

3. Extreme order statistics. Sample max, min, range, and quantile estimates at extreme $p$ (e.g. $p = 1/n$ ) are NEVER asymptotically Gaussian. The relevant limit theory is extreme-value theory (Coles 2001, Embrechts et al. 1997). The Gumbel/Fréchet/Weibull distributions are the three possible limits for the maximum of iid samples after suitable normalisation. For seismic or insurance applications where the maximum-event size matters, this is the right toolkit; the asymptotic-Gaussian CI is wrong by construction.

The honest meta-rule: the asymptotic Gaussian approximation is a tool, not a religion. When it works (most estimators, moderate $n$ , finite variance, mild skew) it is the easiest and most natural choice. When it doesn't, use the right alternative — robust estimator (§1.8), bootstrap CI (§1.7), profile-likelihood CI (Part 3), or extreme-value theory (out of scope here). Knowing which regime you are in is the practical content of "asymptotic theory" in applied work.

Diagnostics for "is n large enough"

How do you know whether your $n$ is large enough for the asymptotic Gaussian to bite? Three diagnostics, in increasing order of effort:

Q-Q plot the bootstrap distribution. Bootstrap $B = 1000$ resamples of your data, compute $\hat\theta$ on each, and Q-Q-plot the $B$ values against a standard Normal. A straight line ⇒ Gaussian; visible curvature in the tails ⇒ skewness or kurtosis the Wald CI will get wrong. Cheap, model-free, exactly what §1.7 was set up to provide.
Compare Wald and bootstrap CIs side-by-side. If they agree (within 10% on width), the Wald CI is fine. If they diverge — especially asymmetrically (Wald symmetric, bootstrap visibly asymmetric) — trust the bootstrap and report it instead.
Simulate at your $n$ . If you have a working parametric model (e.g. from §1.3 MLE), draw simulated samples of size $n$ from the fitted model, compute the SE-based CI on each, and check empirical coverage. If a "95%" CI covers in 92% of simulations, you have a 3% undercoverage problem — Berry-Esséen is biting and you should switch to bootstrap or profile-likelihood. This is the most expensive diagnostic but the most honest.

The convergence-modes widget is essentially a live version of diagnostic 3 with a Q-Q-flavoured shape comparison built in. Use it as a baseline for your intuition about which combinations of (population, estimator, $n$ ) need bootstrap-style backup and which are safe with the asymptotic Gaussian.

Where §1.9 fits in the textbook

§1.9 closes Part 1 by formalising the limit theory every earlier section appealed to. Concretely:

§1.1 (estimator properties). Consistency and asymptotic normality are the two large-sample properties of an estimator. Unbiasedness and finite-sample variance are the small-sample counterparts.
§1.3 (MLE). The MLE's standard large-sample story — consistency under regularity, $\sqrt n (\hat\theta_{\text{MLE}} - \theta) \to_d \mathcal{N}(0, 1/I(\theta))$ — uses every machine in §1.9.
§1.4 (CRLB). The CRLB is the lower bound; asymptotic normality at $\sigma^2 = 1/I(\theta)$ says the MLE achieves it asymptotically. §1.4's crlb-vs-empirical widget visualises that achievement.
§1.6 (sampling distributions). The sampling distribution IS the object that converges. §1.6 looked at it empirically; §1.9 names the limit and the rate.
§1.7 (bootstrap). When the asymptotic Gaussian approximation is in doubt — heavy tails, small $n$ , skewed estimator — the bootstrap is the universal fallback. §1.9 quantifies when "in doubt" applies.
§1.8 (robust / M-estimators). Robust M-estimators are asymptotically Normal with variance $\sigma^2 , \mathbb{E}[\psi^2] / (\mathbb{E}[\psi'])^2$ — the M-estimator analogue of the CLT, derivable using exactly the asymptotic-normality + Slutsky machinery developed here.

Looking ahead: Part 2 will use these results to build hypothesis-testing machinery (Wald, score, LR tests). Part 3 §3.1 builds asymptotic CIs using exactly $\hat\theta \pm z_{\alpha/2} \cdot \widehat{\text{SE}}$ ; §3.3 builds profile-likelihood CIs as the more accurate alternative at moderate $n$ . Part 4 (linear regression) leans on the multivariate delta method for SEs of contrasts and transformed coefficients. Part 6 (GLMs) leans on the delta method for SEs on the response scale (e.g. predicted probabilities from a logistic regression). Part 8 (resampling) ties the asymptotic and bootstrap theories together via the bootstrap-of-asymptotic-Normal-statistic results (Singh 1981, Beran 1988).

Try it

In the convergence-modes widget, select Normal(0, 1). At $n = 1000$ what is the empirical envelope width? Compute by hand the theoretical Gaussian envelope width $2 \cdot 1.645 \cdot \sigma / \sqrt n = 2 \cdot 1.645 / \sqrt{1000} \approx 0.104$ and confirm the widget reports a number close to that. Switch to Lognormal; the envelope width should be larger by $\sqrt{(e - 1) e} \approx 2.16$ — confirm.
Switch to Cauchy. Read the verdict line. Note that the 5–95% envelope does NOT shrink at the 1/√n rate as $n$ grows along the x-axis — the envelope width stays roughly constant. The right-panel histograms at $n = 10, 100, 1000$ all have the same heavy-tailed Cauchy shape. The CLT fails. The widget's verdict line recommends the median (§1.8) or bootstrap (§1.7) — connect that to the sample-mean-vs-median story in §1.8.
Pen-and-paper: explain why the SAMPLE MAX of iid Uniform(0, 1) is NOT asymptotically Gaussian. (Hint: compute $P(n(1 - \max X_i) \le x)$ directly. The limit is $1 - e^{-x}$ — Exponential(1), not Gaussian. The convergence is at rate $1/n$ not $1/\sqrt n$ . This is the canonical extreme-value example.)
In the delta-method-demo widget, pick "X̄ of Exp(rate = 1)" as the base. Choose $g(\theta) = \log \theta$ . At $n = 30$ , read the empirical SD of $g(\hat\theta)$ and compare to the theoretical delta-method SD $|g'(1)| \cdot 1 / \sqrt{30} = 1 / \sqrt{30} \approx 0.183$ . The ratio should be close to 1.
Same base, switch to $g(\theta) = 1/\theta$ . At $n = 30$ the empirical SD of $1/\hat\theta$ may differ from $|{-}1/1^2| \cdot 1 / \sqrt{30} = 1 / \sqrt{30}$ by 20–40%. Increase $n$ to 500 and the ratio tightens to ≈ 1. The delta method works asymptotically; at borderline $n$ the linearisation residual is visible.
Pick "p̂ of Bernoulli(0.3)" + sigmoid $g$ . Note that $g'(0.3) = 0.3 \cdot 0.7 \approx 0.21$ would only apply if the base estimator was on the logit scale — here the base $p̂$ is already a probability, and we compose $g$ on top. The widget reports the empirical and delta numbers; reconcile them. This is the same algebra that underlies the standard error of the predicted probability from a logistic regression on the response scale.
Pen-and-paper: the SLUTSKY theorem says $X_n \to_d X$ AND $Y_n \to_p c$ ⇒ $X_n + Y_n \to_d X + c$ . Show by an explicit counterexample that this FAILS if $Y_n \to_d Y$ instead, where $Y$ is a non-degenerate random variable. (Hint: let $X_n = Z$ and $Y_n = -Z$ for a single standard Normal $Z$ . Then $X_n + Y_n = 0$ for every $n$ — definitely not $\mathcal{N}(0, 2)$ .)
Berry-Esséen: for iid $X_i \sim \operatorname{Exp}(1)$ , $\sigma^2 = 1$ and $\mathbb{E}|X - 1|^3 \approx 2.04$ . The bound says $\sup_x |F_{Z_n}(x) - \Phi(x)| \le 0.4748 \cdot 2.04 / \sqrt n$ . At $n = 100$ this is $\approx 0.097$ — a worst-case 10% gap. Confirm in the convergence-modes widget that at $n = 100$ the Exponential histogram still shows visible mismatch with the Gaussian overlay in the tails, then check at $n = 1000$ that the gap drops to $\approx 0.031$ .
Pen-and-paper: an estimator is BIASED at finite $n$ but CONSISTENT. Sketch how this is possible using the MLE for Normal variance $\hat\sigma^2 = (1/n) \sum (X_i - \bar X)^2$ . Compute $\operatorname{Bias}(\hat\sigma^2) = -\sigma^2/n$ explicitly and show that both the bias and the variance go to 0 as $n \to \infty$ .
Construct an estimator that is UNBIASED at every $n$ but NOT CONSISTENT. (Hint: $T_n = X_1$ — the first observation. $\mathbb{E}[T_n] = \mu$ for every $n$ . But $T_n$ does not converge to anything — it just sits at $X_1$ forever.) Conclude that "unbiased" and "consistent" are independent properties; either can hold without the other.
Open question (not in the widgets): can the asymptotic Gaussian approximation be made finite-sample exact for a specific estimator? Answer in the affirmative: the sample mean of EXACTLY Normal iid data is exactly Normal at every $n$ . This is one of the very few cases where asymptotic theory coincides with finite-sample exact theory.

Pause and reflect: §1.9 has been a section about LIMITS. Yet every estimator you will ever compute is at a SPECIFIC finite $n$ . What is the practical contract that asymptotic theory makes with finite-sample analysis? When $n$ is "large enough" the asymptotic Gaussian gives a tight, transparent CI machinery (Wald). When $n$ is borderline, the asymptotic Gaussian undercovers — but it tells you BY HOW MUCH (Berry-Esséen) and points you toward the alternative (bootstrap, profile likelihood, robust). When $n$ is in the failure regime (heavy tails, max), asymptotic Gaussian theory has the honesty to say "not me — use this other tool". The contract is not "trust the limit"; it is "the limit + its rate + its failure modes are themselves a usable toolkit".

What you now know

Three convergence modes (in-probability, almost-sure, in-distribution) with the strict implication chain a.s. ⇒ p ⇒ d and the warning that none of the reverses hold. Consistency is the formal statement $\hat\theta_n \to_p \theta$ ; it is a different property from unbiasedness, and biased estimators (the MLE for Gaussian variance) can be consistent. Asymptotic normality $\sqrt n (\hat\theta_n - \theta) \to_d \mathcal{N}(0, \sigma^2)$ is the rate-and-shape statement that powers every CLT-based CI and test; under MLE regularity $\sigma^2 = 1/I(\theta)$ .

The delta method propagates asymptotic normality through a smooth transformation: $\sqrt n (g(\hat\theta) - g(\theta)) \to_d \mathcal{N}(0, [g'(\theta)]^2 \sigma^2)$ . The workhorse for SEs of log-rates, log-odds, ratios. Slutsky's theorem licenses the swap of a true SE for a consistent SE estimate inside a standardised statistic, so the Wald CI $\hat\theta \pm 1.96 , \widehat{\text{SE}}$ has the right asymptotic coverage even when $\sigma$ itself is estimated.

"Large enough" practical rules: sample mean for finite-variance symmetric populations is essentially Gaussian by $n \approx 30$ ; right-skewed populations need $n \approx 100$ ; heavy-tailed-but-finite-variance ones need $n \ge 1000$ ; sample variance needs much more. The Berry-Esséen bound $\sup_x |F_{Z_n}(x) - \Phi(x)| \le C , \mathbb{E}|X - \mu|^3 / (\sigma^3 \sqrt n)$ quantifies the $1/\sqrt n$ rate. Three classes of CLT failure: infinite variance (use robust estimators or bootstrap), borderline $n$ with severe skew (use bootstrap or profile likelihood), and extreme order statistics (use extreme-value theory). The asymptotic Gaussian is a tool, not a religion.

This section closes Part 1. The estimator catalogue is built; the asymptotic theory that organises it is in place. Part 2 begins. The opening sections — Neyman-Pearson framework, Type-I/II errors and power, the classical tests (t, chi-square, F) done by hand, what a $p$ -value is and is not — use the §1.9 machinery without re-deriving it. Every test statistic in Part 2 has an asymptotic Gaussian or $\chi^2$ limit; every power calculation invokes the asymptotic-normality picture; every p-value interpretation rests on the sampling-distribution view §1.6 set up and §1.9 formalised.

References

Lehmann, E.L. (1999). Elements of Large-Sample Theory. Springer. (The standard graduate textbook on asymptotic statistics. Chapters 2-3 develop convergence modes, consistency, and asymptotic normality; Chapter 5 the delta method.)
van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge University Press. (The most-cited modern reference. Chapter 2 covers stochastic convergence; Chapter 3 the delta method; Chapter 5 M- and Z-estimator asymptotics; Chapter 23 efficient estimation.)
DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability. Springer. (Encyclopedic. Chapter 3 covers Berry-Esséen and CLT refinements; Chapter 6 the bootstrap from the asymptotic angle.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 5 is the cleanest one-chapter survey of convergence modes, LLN, and CLT for a stats audience; Chapter 9 covers the delta method and asymptotic theory of estimators.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 10 covers asymptotic evaluations of point estimators including consistency and asymptotic normality.)
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. (Foundational text; the original modern treatment of asymptotic theory for MLE.)
Berry, A.C. (1941). "The accuracy of the Gaussian approximation to the sum of independent variables." Transactions of the American Mathematical Society 49(1), 122-136. (The Berry side of the Berry-Esséen theorem.)
Esseen, C.G. (1942). "On the Liapunoff limit of error in the theory of probability." Arkiv för Matematik, Astronomi och Fysik 28A, 1-19. (The Esseen side of the Berry-Esséen theorem, sharpening Berry's constant.)
Pratt, J.W., Gibbons, J.D. (1981). Concepts of Nonparametric Theory. Springer. (Useful complement on order-statistic asymptotics — the extreme-value regime where the CLT does not apply.)