The central limit theorem
Learning objectives
- State the classical CLT and recognise its assumptions (finite variance, i.i.d.)
- Compute the asymptotic distribution N(μ, σ²/n) of the sample mean
- Recognise distributions where CLT fails (Cauchy, infinite-variance Pareto)
- Apply CLT to justify Normal approximations in t-tests, CIs, and ML estimators
- Distinguish CLT (sample mean → Normal) from LLN (sample mean → μ constant)
The Central Limit Theorem is statistics' second big idea: not only does the sample mean converge to μ (the LLN), but it does so with a UNIVERSAL Gaussian shape. The DISTRIBUTION of becomes approximately Normal, regardless of the parent distribution. This makes Normal-based inference apply to almost ANY data — provided n is large enough.
Classical CLT
Let be i.i.d. with finite mean and finite variance . Then:
Equivalently for large n. The SHAPE of the parent doesn't matter for the limit — only μ and σ². This is the universality property.
Why does it work?
The intuition: averaging many independent small noise terms produces a Gaussian by symmetry. Lyapunov and Lindeberg proved the CLT under various conditions. The crucial requirement is FINITE VARIANCE — heavy-tailed distributions can break this. The CLT also assumes (i)ndependent (i)dentically (d)istributed, but these can be relaxed: Lindeberg-Feller CLT allows independence with non-identical distributions if no single one dominates.
CLT failure: when σ² is infinite
For Cauchy(0, 1) the variance is infinite — so CLT doesn't apply. The sum of n i.i.d. Cauchy is still Cauchy(0, n), so X̄_n is Cauchy(0, 1) — same as a single sample. No Gaussian shape ever emerges.
For Pareto(α) with shape α < 2, variance is also infinite. Sums and means follow STABLE DISTRIBUTIONS (Levy 1925) — a generalisation of CLT that allows non-Normal limits when variance is infinite.
Sample size: how big is "large"?
Rule of thumb: n ≥ 30 for moderate skewness, much larger for heavy-tailed distributions. For Bernoulli(p) with small p, the rule is np(1-p) > 9. For an Exponential, n = 30 is usually enough but n = 100 is safer. The shape of the parent matters: lighter-tailed (like uniform) → fast convergence; heavier-tailed → slow.
Why CLT is the workhorse of inference
- t-tests rely on being approximately Normal even when raw data isn't.
- Confidence intervals use the CLT to choose 1.96 from N(0, 1).
- Maximum likelihood estimators 'asymptotic Normality' (§1.3) is a CLT application.
- The bootstrap (§3.2) relies on CLT-like behaviour of empirical-distribution averages.
Try it
- Uniform(0, 1) at n = 1. The histogram of "means" is just the parent distribution itself — flat. Move n to 5: visible bell shape already. n = 30: near-perfect match to the green Normal curve. CLT is extremely fast for light-tailed distributions.
- Exponential(1) at n = 5. Distribution of X̄_5 is still visibly right-skewed (skewness ~ 2/√n at n=5 is large). At n = 30 it's much more bell-shaped. At n = 100 it's indistinguishable from Normal.
- Bernoulli(0.3) at n = 10. Distribution of X̄_10 is discrete with 11 atoms — clearly not Normal. n = 50: still discrete but the envelope is bell-shaped. n = 200: near-continuous bell.
- Cauchy(0, 1) at any n. The histogram NEVER converges to Normal — even at n = 200, the tails extend far beyond what the green Normal predicts. This is the CLT's failure case: violated when σ² = ∞.
- Match empirical SD vs theoretical σ/√n in the readout. For Uniform at n = 30: theoretical SD = √(1/12)/√30 ≈ 0.053. Empirical SD over M = 2000 trials should match to within 5%.
A clinical trial reports an average treatment effect of 2.3 ± 0.4 (SE) with n = 100. The data distribution is highly skewed (long right tail). Should the researchers trust a Normal-based 95% CI of [1.5, 3.1] — or run a bootstrap CI instead? Justify with a sample-size argument.
What you now know
For finite-variance distributions, the sample mean is approximately Normal regardless of parent shape. This justifies almost all classical statistical inference. CLT fails for heavy-tailed distributions; bootstrap and robust methods cover those cases. §0.8 (transformations) handles non-linear functions of random variables; §0.9 (MGF) reveals why CLT works algebraically; §0.10 (simulation) makes all of this empirically verifiable.
References
- Wasserman, L. (2004). All of Statistics. Springer. (Chapter 5.4 — CLT.)
- Billingsley, P. (1995). Probability and Measure, 3rd ed. Wiley. (Chapter 27 — Lindeberg-Feller CLT.)
- Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley. (Chapter 8 — CLT proofs.)
- Hoeffding, W. (1948). "A class of statistics with asymptotically normal distribution." Annals of Math. Stat. 19, 293-325. (U-statistics CLT.)
- Petrov, V.V. (1995). Limit Theorems of Probability Theory. Oxford. (Comprehensive CLT generalisations.)