The central limit theorem

Probability from zero

Learning objectives

State the classical CLT and recognise its assumptions (finite variance, i.i.d.)
Compute the asymptotic distribution N(μ, σ²/n) of the sample mean
Recognise distributions where CLT fails (Cauchy, infinite-variance Pareto)
Apply CLT to justify Normal approximations in t-tests, CIs, and ML estimators
Distinguish CLT (sample mean → Normal) from LLN (sample mean → μ constant)

The Central Limit Theorem is statistics' second big idea: not only does the sample mean converge to μ (the LLN), but it does so with a UNIVERSAL Gaussian shape. The DISTRIBUTION of $\bar{X}_n$ becomes approximately Normal, regardless of the parent distribution. This makes Normal-based inference apply to almost ANY data, provided n is large enough.

Classical CLT

Let $X_1, \ldots, X_n$ be i.i.d. with finite mean $\mu$ and finite variance $\sigma^2 > 0$ . Then:

\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2).

Equivalently $\bar{X}_n \approx \mathcal{N}(\mu, \sigma^2 / n)$ for large n. The SHAPE of the parent doesn't matter for the limit, only μ and σ². This is the universality property.

Why does it work?

The intuition: averaging many independent small noise terms produces a Gaussian by symmetry. Lyapunov and Lindeberg proved the CLT under various conditions. The crucial requirement is FINITE VARIANCE, heavy-tailed distributions can break this. The CLT also assumes (i)ndependent (i)dentically (d)istributed, but these can be relaxed: Lindeberg-Feller CLT allows independence with non-identical distributions if no single one dominates.

CLT failure: when σ² is infinite

For Cauchy(0, 1) the variance is infinite, so CLT doesn't apply. The sum of n i.i.d. Cauchy is still Cauchy(0, n), so X̄_n is Cauchy(0, 1), same as a single sample. No Gaussian shape ever emerges.

For Pareto(α) with shape α < 2, variance is also infinite. Sums and means follow STABLE DISTRIBUTIONS (Levy 1925), a generalisation of CLT that allows non-Normal limits when variance is infinite.

Sample size: how big is "large"?

Rule of thumb: n ≥ 30 for moderate skewness, much larger for heavy-tailed distributions. For Bernoulli(p) with small p, the rule is np(1-p) > 9. For an Exponential, n = 30 is usually enough but n = 100 is safer. The shape of the parent matters: lighter-tailed (like uniform) → fast convergence; heavier-tailed → slow.

Why CLT is the workhorse of inference

t-tests rely on $\bar{X}_n$ being approximately Normal even when raw data isn't.
Confidence intervals $\bar{X}_n \pm 1.96, s/\sqrt{n}$ use the CLT to choose 1.96 from N(0, 1).
Maximum likelihood estimators 'asymptotic Normality' (§1.3) is a CLT application.
The bootstrap (§3.2) relies on CLT-like behaviour of empirical-distribution averages.

Try it

Uniform(0, 1) at n = 1. The histogram of "means" is just the parent distribution itself, flat. Move n to 5: visible bell shape already. n = 30: near-perfect match to the green Normal curve. CLT is extremely fast for light-tailed distributions.
Exponential(1) at n = 5. Distribution of X̄_5 is still visibly right-skewed (skewness ~ 2/√n at n=5 is large). At n = 30 it's much more bell-shaped. At n = 100 it's indistinguishable from Normal.
Bernoulli(0.3) at n = 10. Distribution of X̄_10 is discrete with 11 atoms, clearly not Normal. n = 50: still discrete but the envelope is bell-shaped. n = 200: near-continuous bell.
Cauchy(0, 1) at any n. The histogram NEVER converges to Normal, even at n = 200, the tails extend far beyond what the green Normal predicts. This is the CLT's failure case: violated when σ² = ∞.
Match empirical SD vs theoretical σ/√n in the readout. For Uniform at n = 30: theoretical SD = √(1/12)/√30 ≈ 0.053. Empirical SD over M = 2000 trials should match to within 5%.

A clinical trial reports an average treatment effect of 2.3 ± 0.4 (SE) with n = 100. The data distribution is highly skewed (long right tail). Should the researchers trust a Normal-based 95% CI of [1.5, 3.1], or run a bootstrap CI instead? Justify with a sample-size argument.

What you now know

For finite-variance distributions, the sample mean is approximately Normal regardless of parent shape. This justifies almost all classical statistical inference. CLT fails for heavy-tailed distributions; bootstrap and robust methods cover those cases. §0.8 (transformations) handles non-linear functions of random variables; §0.9 (MGF) reveals why CLT works algebraically; §0.10 (simulation) makes all of this empirically verifiable.

References

Wasserman, L. (2004). All of Statistics. Springer. (Chapter 5.4, CLT.)
Billingsley, P. (1995). Probability and Measure, 3rd ed. Wiley. (Chapter 27, Lindeberg-Feller CLT.)
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley. (Chapter 8, CLT proofs.)
Hoeffding, W. (1948). "A class of statistics with asymptotically normal distribution." Annals of Math. Stat. 19, 293-325. (U-statistics CLT.)
Petrov, V.V. (1995). Limit Theorems of Probability Theory. Oxford. (Comprehensive CLT generalisations.)