The law of large numbers

Probability from zero

Learning objectives

State the strong and weak LLN
Recognise the SLLN convergence rate: deviations from μ shrink at 1/√n
Identify when LLN fails: distributions with infinite mean (Cauchy, certain Pareto)
Apply LLN to justify sample means as estimators
Distinguish almost-sure (strong) from in-probability (weak) convergence

The Law of Large Numbers is statistics' first big idea: the SAMPLE AVERAGE of i.i.d. observations converges to the POPULATION MEAN as the sample size grows. It is what makes statistical inference possible, sample-based estimators tell us something about populations.

Two forms: weak and strong

Let $X_1, X_2, \ldots$ be i.i.d. with finite mean $\mu = E[X_i]$ .

Weak LLN: $\bar{X}_n \xrightarrow{P} \mu$ , convergence in probability. For every $\varepsilon > 0$ , $P(|\bar{X}_n - \mu| > \varepsilon) \to 0$ .
Strong LLN: $\bar{X}_n \xrightarrow{a.s.} \mu$ , convergence almost surely. With probability 1, the path $\bar{X}_n(\omega) \to \mu$ .

SLLN ⇒ WLLN; not vice versa. SLLN gives stronger guarantees: ALMOST EVERY sample path eventually settles near μ. WLLN allows wild paths so long as the probability of being far drops with n.

Convergence rate: the 1/√n rule

For finite-variance distributions: $\mathrm{SD}(\bar{X}_n) = \sigma / \sqrt{n}$ . So the sample mean concentrates at rate 1/√n. To halve the error, you need 4× the data. To shrink by 10×, you need 100× the data.

This is the WHY behind: large studies (clinical trials, surveys) need large n; precision is expensive.

When LLN fails: pathological distributions

The LLN requires finite mean. For distributions with INFINITE OR UNDEFINED mean:

Cauchy(0, 1): the PDF $f(x) = 1/(\pi(1 + x^2))$ has heavy tails that make $E[|X|]$ diverge. Sample means do NOT converge, they have the SAME distribution as a single sample.
Pareto with shape α ≤ 1: heavy enough tails that even finite first moment fails.

These aren't just curiosities: real heavy-tailed phenomena (financial returns, network packet sizes, city sizes) can have distributions for which the empirical mean is unstable. Robust statistics (§4.5) becomes essential.

Try it

Default Normal(0, 1). Three independent traces (different seeds) ALL approach the red μ = 0 line as n grows. By n = 1000 the standard deviation of the mean is σ/√n = 1/√1000 ≈ 0.032, so the three paths span roughly ±2 SD (≈ 0.06).
Switch to Exponential(1). Mean = 1. Convergence is slower because the distribution is skewed. By n = 1000, paths still bracket μ = 1 within ±0.05.
Switch to Cauchy(0, 1). Spread across paths does NOT shrink with n, sometimes the running mean jumps wildly even at n = 5000. This is LLN's failure case. The Cauchy is heavy-tailed enough that a single extreme sample can shift the running mean indefinitely.
Switch to Bernoulli(0.3). Discrete; both possible values 0 and 1. Sample mean converges to 0.3 cleanly. At n = 100 paths typically within ±0.05 of 0.3; at n = 1000 within ±0.015.
Slide n_max from 100 to 5000 with Normal. Verify the convergence visually: at n = 100 paths still drift; at n = 5000 they hug μ tightly. The PROPORTIONAL improvement is 1/√(5000/100) = 1/√50 ≈ 14%, a 50× n only gives ~7× better precision.

A pollster reports a survey of 1000 voters with margin of error ±3%. A follow-up survey doubles n to 2000. What new margin of error should they report (and why is it > ±1.5%)?

What you now know

Sample means converge to population means at rate 1/√n, but only for distributions with finite mean. Pathological heavy-tailed distributions (Cauchy) break the law entirely. Robust estimators (median, trimmed means) can recover convergence properties even for slightly-heavy-tailed distributions. §0.7 takes the next step: the CLT tells us how the sample mean is distributed around μ, bell-shaped with width σ/√n.

References

Wasserman, L. (2004). All of Statistics. Springer. (Chapter 5, convergence of random variables.)
Billingsley, P. (1995). Probability and Measure, 3rd ed. Wiley. (Chapter 6, strong LLN.)
Etemadi, N. (1981). "An elementary proof of the strong law of large numbers." Zeitschrift für Wahrscheinlichkeitstheorie 55, 119-122.
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Vol. 1. Wiley. (Chapter 10.)
Tukey, J.W. (1960). "A survey of sampling from contaminated distributions." (Robust statistics motivation when LLN works but slowly under contamination.)