Limit Theorems and the Central Limit Theorem

Part 15, Chapter 15: Combinatorics and Probability

Learning objectives

  • State the Central Limit Theorem precisely, including the standardisation
  • Distinguish the CLT from the Law of Large Numbers
  • Apply the CLT to compute approximate probabilities for sample means
  • Build normal-approximation confidence intervals Xˉ±1.96σ/n\bar{X} \pm 1.96 \sigma/\sqrt{n}
  • Recognise when the CLT is unsafe (heavy tails, dependence, small samples)

The Central Limit Theorem is the most consequential single result in probability and statistics. It explains why a bell curve appears everywhere, in heights, in measurement error, in the diffusion of pollutants, in the long-run behaviour of nearly every estimator we use. It also tells us the EXACT rate at which sample averages converge to their true mean, providing the inferential machinery (confidence intervals, hypothesis tests, p-values) that underpins all empirical science. If you understand the CLT, you understand why "n large" is the magic word.

The Law of Large Numbers (first)

Before the CLT, the Law of Large Numbers says: if X1,X2,ldotsX_1, X_2, \ldots are i.i.d. with finite mean mu\mu, then barXn=frac1nsumi=1nXitomu\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i \to \mun=frac1nsumi=1nXitomu as ntoinftyn \to \infty. This is a FIRST-order statement: the sample average converges to the true mean.

But it leaves the next question wide open: HOW FAST does barXn\bar{X}_nn approach mu\mu? And what does the fluctuation around mu\mu look like at finite nn? Those are the questions the CLT answers.

Statement of the CLT

Theorem (Lindeberg-Lévy CLT). Let X1,X2,ldotsX_1, X_2, \ldots be i.i.d. with mean mu\mu and FINITE variance sigma2\sigma^{2}. Define the standardised sample mean

Zn=fracbarXnmusigma/sqrtnZ_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}}nmusigma/sqrtn.

Then ZnZ_nn converges in distribution to a standard normal: ZnxrightarrowdN(0,1)Z_n \xrightarrow{d} N(0, 1)nxrightarrowdN(0,1). Equivalently, for any real a<ba < b,

P(aleqZnleqb)toPhi(b)Phi(a)P(a \leq Z_n \leq b) \to \Phi(b) - \Phi(a)nleqb)toPhi(b)Phi(a) as ntoinftyn \to \infty,

where Phi(z)=frac1sqrt2piintinftyzet2/2,dt\Phi(z) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{z} e^{-t^{2}/2} \, dtinftyzet2/2,dt is the standard-normal CDF.

Reading this in plain language: the sample mean is approximately normal with mean mu\mu and variance sigma2/n\sigma^{2}/n. The remarkable feature is that this is true REGARDLESS of the distribution of the XiX_ii, uniform, exponential, Bernoulli, Poisson, any finite-variance distribution. The bell curve emerges from the sum, not from the summands.

A worked example: 100 dice

Roll 100 fair dice and average the faces. The mean of one die is mu=3.5\mu = 3.5; the variance is sigma2=35/12approx2.917\sigma^{2} = 35/12 \approx 2.917, so sigmaapprox1.708\sigma \approx 1.708. By the CLT:

barX100approxN!left(3.5,frac2.917100right)=N(3.5,0.02917)\bar{X}_{100} \approx N\!\left(3.5, \frac{2.917}{100}\right) = N(3.5, 0.02917)100approxN!left(3.5,frac2.917100right)=N(3.5,0.02917),

so the standard error of the sample mean is sqrt0.02917approx0.171\sqrt{0.02917} \approx 0.171. The probability that the sample mean exceeds 3.7 is approximately

P(barX100>3.7)=P!left(Z>frac3.73.50.171right)=P(Z>1.17)approx0.121P(\bar{X}_{100} > 3.7) = P\!\left(Z > \frac{3.7 - 3.5}{0.171}\right) = P(Z > 1.17) \approx 0.121100>3.7)=P!left(Z>frac3.73.50.171right)=P(Z>1.17)approx0.121.

The original dice distribution is uniform-discrete on 1,2,3,4,5,6\{1, 2, 3, 4, 5, 6\}, nothing bell-curve-like about it. But the AVERAGE of 100 of them is, to two decimal places, a normal random variable.

Plot the standard-normal density varphi(z)=frac1sqrt2piez2/2\varphi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^{2}/2} and notice the iconic bell shape with mass concentrated within pm3\pm 3. Memorise the 68-95-99.7 rule: probabilities approx0.68,0.95,0.997\approx 0.68, 0.95, 0.997 lie within 1,2,31, 2, 3 standard deviations of the mean.

Confidence intervals

The CLT gives the standard normal-approximation confidence interval for an unknown mean mu\mu:

barXnpmzalpha/2cdotfracsigmasqrtn\bar{X}_n \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}alpha/2cdotfracsigmasqrtn.

For 95% confidence, zalpha/2=1.96z_{\alpha/2} = 1.96alpha/2=1.96. The half-width 1.96,sigma/sqrtn1.96\,\sigma/\sqrt{n} is the margin of error; it shrinks like 1/sqrtn1/\sqrt{n}, which is the source of the slogan "four times the data buys you half the error."

When sigma\sigma is unknown (always, in practice), it is replaced by the sample standard deviation ss, and for small nn the normal quantile zz is replaced by a Student-tt quantile to account for the extra uncertainty in estimating sigma\sigma.

Where this shows up
  • Statistical inference everywhere: Every t-test, confidence interval, p-value, and standard error reported in any field of empirical science is a direct application of the CLT. The Normal distribution's monopoly on inference is its consequence.
  • Polling and survey research: A poll of n=1000n = 1000 voters reports "margin of error 3 percentage points", that is 1.96sqrt0.25/1000approx0.0311.96 \sqrt{0.25/1000} \approx 0.031, the worst-case CLT bound for a Bernoulli sample mean.
  • Quality control, control charts: Manufacturing processes plot barXn\bar{X}_nn over time and trigger alarms when it crosses pm3sigma/sqrtn\pm 3\sigma/\sqrt{n} control limits. Pure CLT.
  • Insurance and risk: The capital requirement of an insurer holding nn independent policies scales as sigmasqrtn\sigma \sqrt{n} (one CLT-standard-deviation of loss), not as the total expected loss. This is why pooling makes insurance feasible.
  • Brownian motion and diffusion: Take the CLT limit of a random walk over fine time-steps and you get continuous-time Brownian motion, the foundation of the Black-Scholes model and most stochastic differential equations in physics.
  • Pause and think: The CLT requires FINITE variance. Cauchy random variables have undefined variance, and their sample mean does NOT converge to a constant, in fact, barXn\bar{X}_nn has the same Cauchy distribution as a single observation. Why does the CLT machinery fail in this case? (Hint: where does sigma2/n\sigma^{2}/n even live when sigma2=infty\sigma^{2} = \infty?)

    Try it

    • A factory produces resistors with mean resistance mu=100Omega\mu = 100 \Omega and standard deviation sigma=5Omega\sigma = 5 \Omega. You sample 25 resistors. What is the approximate probability that the sample mean exceeds 101Omega101 \Omega? (Answer: P(Z>1)approx0.16P(Z > 1) \approx 0.16.)
    • Compute the 95% confidence interval for the mean voter preference in a Bernoulli(p)(p) poll of n=400n = 400 where hatp=0.52\hat{p} = 0.52. (Hint: textSE=sqrthatp(1hatp)/napprox0.025\text{SE} = \sqrt{\hat{p}(1-\hat{p})/n} \approx 0.025; interval approx0.52pm0.049\approx 0.52 \pm 0.049.)
    • You need a margin of error leq0.01\leq 0.01 for a Bernoulli(p)(p) poll. What is the smallest nn guaranteed to suffice, regardless of pp? (Use worst-case p(1p)=0.25p(1-p) = 0.25.)
    • Roll a fair die 1000 times. Approximate the probability that the total exceeds 3600. (Compute mutexttotal=3500\mu_{\text{total}} = 3500texttotal=3500, sigmatexttotal=sqrt1000cdot35/12approx54\sigma_{\text{total}} = \sqrt{1000 \cdot 35/12} \approx 54texttotal=sqrt1000cdot35/12approx54; P(Z>1.85)approx0.032P(Z > 1.85) \approx 0.032.)
    • Distinguish the LLN (the sample mean converges to mu\mu) from the CLT (the standardised sample mean converges in distribution to N(0,1)N(0, 1)). Which one would you cite when justifying that simulation estimates eventually become exact, vs. when computing a confidence interval?
    • A trap to watch for

      The CLT is a LIMIT theorem, it says nothing about small nn, and even at moderate nn it can be a poor approximation for HEAVY-TAILED distributions (large variance contribution from rare events) or HIGHLY SKEWED distributions. A common heuristic is "n at least 30" but this is folklore, not a theorem; for Bernoulli(p)(p) with small pp, you need at least npgeq5np \geq 5 AND n(1p)geq5n(1-p) \geq 5 for the normal approximation to be safe. Also: the CLT assumes INDEPENDENCE. For correlated data (time series, network data, repeated measurements on the same subject), the effective sample size is smaller than nn and the naive CLT understates the confidence interval. Time-series statisticians spend their careers correcting for this.

      What you now know

      You can apply the CLT to compute approximate probabilities for sample means, build normal-approximation confidence intervals, recognise that the convergence rate is 1/sqrtn1/\sqrt{n}, and spot the failure modes (heavy tails, dependence, small nn). You now have the toolkit to read and write empirical claims with quantitative uncertainty, the working language of every scientific discipline downstream. The next chapter pivots from probability to algorithms, where the rules of the game change from "what happens by chance" to "how fast can a procedure run."

      Mark section complete →

      References

      • Garrity, T. (2002). All the Mathematics You Missed. Cambridge University Press, ch. 15.
      • Ross, S. M. (2014). A First Course in Probability (9th ed.). Pearson, ch. 8.
      • Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1 (3rd ed.). Wiley, ch. 10.
      • Durrett, R. (2019). Probability: Theory and Examples (5th ed.). Cambridge University Press, ch. 3.
      • Billingsley, P. (1995). Probability and Measure (3rd ed.). Wiley, ch. 27 (rigorous CLT proofs).

      This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.