Sampling distributions and the standard error
Learning objectives
- Define the sampling distribution of an estimator as the distribution of across independent samples of size n from a fixed population
- Define the standard error SE(θ̂) as the standard deviation of the sampling distribution, and distinguish it carefully from the population SD
- Derive and apply the textbook rule SE(X̄) = σ/√n for the sample mean, and recognise that the √n scaling is the DEFAULT but not the only possibility
- Know the SE formulas for sample median, sample variance, correlation, and other common non-mean estimators — and where each formula breaks
- State the CLT in operational form: for finite-variance populations the sample mean's sampling distribution is approximately Gaussian for large n
- Identify three settings where the CLT does NOT help: heavy-tailed populations with infinite variance (Cauchy), non-asymptotic estimators (sample max), and very small samples
- Use Monte Carlo simulation to estimate any sampling distribution empirically, even when no closed-form formula exists — the foundation of §1.7 (bootstrap)
- Apply the plug-in principle to compute SEs from unknown population quantities (e.g. SÊ(X̄) = s/√n), and understand why plug-in SEs are themselves random and can be wrong in small samples
- Connect SE to the approximate 95% confidence interval , and recognise when this approximation is honest (Gaussian sampling distribution) vs misleading (skewed, heavy-tailed, or non-asymptotic)
You have spent five sections building estimators: method-of-moments, maximum likelihood, Fisher-information-aware versions of both. Every one of them is a function of the data. Every one of them is, therefore, a random variable. And every one of them has, sitting behind it, a probability distribution — a sampling distribution — that describes how the estimator varies if you redrew the data many times. §1.6 makes that distribution the explicit object of study, defines the standard error as its natural one-number summary, and shows when the textbook √n rule applies and when it does not.
This is the conceptual hinge between Part 1 and Part 3. Part 1 has been about point estimates: a single number for . Part 3 will be about interval estimates: a confidence statement around . The standard error is the bridge. To get from "" to "with 95% confidence lies in [1.71, 1.95]" you need a measure of how much would wobble across independent samples — that is exactly what SE measures. The bootstrap (§1.7) is the universal engine for estimating SEs without closed-form formulas; this section lays the conceptual groundwork.
The sampling distribution
Fix a population with parameter — say a Normal with mean and known variance , or a Bernoulli with success probability , or anything else with a parameter you would like to estimate. Now imagine the following experiment, performed once on planet A:
- Draw an IID sample from the population.
- Compute the estimator — sample mean, sample median, MLE, whatever.
On planet A you get one number, . On parallel planet B, the same population produces a different sample and so a different estimate . Across all the parallel planets — infinitely many independent draws of size from the same population — the estimator has a distribution. That is the sampling distribution.
Formally: if is a measurable function of and the are IID from a population indexed by , then the sampling distribution of is the probability distribution of the random variable — typically written for sets .
The sampling distribution depends on three things:
- The population: Normal, Exponential, Cauchy, Lognormal, Uniform — different populations give different sampling distributions even for the same estimator.
- The sample size : larger usually means a tighter sampling distribution (smaller SE).
- The estimator: sample mean, median, max, variance — different estimators have completely different sampling distributions, with different shapes, biases, and SEs.
The sampling distribution is what every claim of the form "my estimate is roughly correct to within ±x" is implicitly about. It is what is being summarised when a paper reports an "SE" or builds a 95% CI. Until you can see the sampling distribution clearly, those reports are opaque.
The standard error
The sampling distribution is a full probability distribution — it has a mean, a variance, a skewness, possibly heavy tails. For a one-number summary of its spread, the natural choice is the standard deviation of the sampling distribution. Statisticians call this the standard error of the estimator:
Two-sentence definition. Worth memorising.
The standard error is not the standard deviation of the population, and not the standard deviation of a single observation. It is the standard deviation of the random variable as it varies across independent samples. The two are routinely confused; the distinction is everything in inference. Concretely:
- SD of a single observation: how spread out are individual values? For Normal(μ, σ²) the SD is . The SD of one data point in your sample is just (or if you do not know ).
- SE of the sample mean: how spread out are values across independent samples of size n? This is — much smaller than for n > 1.
Doubling n cuts the SE by a factor of ; quadrupling n halves it. The population SD does not depend on n at all — it is a fixed property of the population.
The textbook rule: SE(X̄) = σ/√n
For the IID sample mean with :
Two-line derivation: variance is additive for independent random variables, and dividing by divides the variance by . The √n scaling is the single most important rate in elementary statistics — it explains why doubling the sample size only modestly tightens an estimate, and why polls of 1000 voters give SEs around 1.5 percentage points regardless of country population.
If the population variance is unknown, we substitute the sample standard deviation to form the plug-in SE:
Note the hat over SE — that signals that the SE is now itself an estimate, with its own sampling variability. For large this plug-in is excellent; for small (say ), the discrepancy between and contributes nontrivial extra uncertainty, which is what motivates the Student t-distribution corrections you will meet in Part 3.
SE for other estimators
The sample mean is the headline example, but every estimator has its own SE formula. Three you should know:
Sample median (for continuous symmetric populations). Let denote the population median and the density at . Then asymptotically
For Normal(μ, σ²) data, at , giving . About 25% larger than the mean. So the mean is more efficient when Normality holds — but the median wins handily once outliers or heavy tails enter the picture (Part 1 §1.8 develops robust estimation in detail).
Sample variance (for Gaussian populations).
Same rate as the mean — but only for Gaussian populations. For non-Gaussian populations the SE depends on the population's fourth central moment , and can be much larger than the Gaussian formula suggests. Lognormal populations have particularly heavy fourth moments — the estimator there is much noisier than the formula predicts at any practical .
Sample correlation coefficient .
Highly nonlinear in : when is near zero the SE is roughly , but as approaches 1 the SE shrinks toward zero. The dependence on the true is also why Fisher's z-transform is the standard tool for confidence intervals on correlations (Part 3 covers this in detail).
Sample skewness (for Gaussian populations).
This SE is large at any practical — for it is about , meaning a sample skewness of is barely distinguishable from zero. Tests of normality based on sample skewness have low power at moderate for this reason.
See sampling distributions in action
Reading SE formulas is one thing; seeing a sampling distribution materialise from Monte Carlo simulation is another. The widget below lets you pick a population (Normal, Exponential, Cauchy, Lognormal, Uniform), an estimator (mean, median, variance, max, log of the mean), and a sample size n. We draw 2000 independent samples of size n, compute the estimator on each, and histogram the result. The Normal overlay has the same mean and SE as the empirical histogram — when the CLT applies, the overlay fits; when it does not, the mismatch is obvious.
Things to verify:
- Sample mean, Normal: Gaussian-looking at every n. The overlay always fits. SE shrinks like .
- Sample mean, Cauchy: the histogram is Cauchy-shaped at every n — heavy tails, slow decay. The Normal overlay never fits. SE does not shrink with n; "empirical SE" reported by the widget keeps wandering because Cauchy has no finite variance. This is the CLT failure mode.
- Sample mean, Lognormal, small n: at the histogram is visibly right-skewed; the Normal overlay does not fit the tail. At the histogram is much closer to Gaussian — the CLT is finally winning over Lognormal's skewness.
- Sample median, Cauchy: even though the sample mean fails for Cauchy, the sample median IS asymptotically Normal. The histogram is Gaussian-shaped from onward.
- Sample max, Uniform(0,1): bunched against the upper bound with an exponential-shaped left tail. The Normal overlay does not fit. SE shrinks like , NOT — the second widget will make this visible.
- Sample variance, Lognormal: highly skewed; takes much larger n to look Gaussian than the sample mean. The empirical SE is much larger than the Gaussian-based formula predicts.
The CLT side door
The Central Limit Theorem (Part 0 §0.7) gave a precise statement: if are IID with mean and finite variance , then
In §1.6's language: the sampling distribution of the sample mean is approximately Normal for large n, regardless of the population distribution — provided the population has finite variance. That last clause is the conditional clause many introductory textbooks gloss over, and it is exactly where the widget shows the CLT failing for Cauchy (no finite variance) and shining for Normal/Exponential/Lognormal (finite variance, however skewed).
This is the side door into honest inference. Even if your population is highly skewed (lognormal income data, exponential survival times, log-uniform geological measurements), the sample mean of a reasonably-sized sample will have a sampling distribution that looks Gaussian. That is why SE-based 95% confidence intervals work even when the population is not Normal — for the mean, and only for the mean, and only at large enough n. (How large is "large enough" depends on the population skewness; rule of thumb 30 for mild skew, 300 for serious lognormal-style skew. The widget shows you exactly.)
When the CLT does NOT help
Three classes of failure to keep in mind:
- Heavy-tailed populations. Cauchy is the canonical example: it has no mean and no variance. The CLT does not apply; the sample mean stays Cauchy-distributed at every n. Inference on the mean is, frankly, hopeless — there is no "average Cauchy variable". The median IS asymptotically Normal here, which is why robust methods (§1.8) replace the mean with the median for heavy-tailed data. Stable-distribution machinery (Mandelbrot, Fama 1965) generalises the CLT to populations whose tails decay like power laws, but for everyday statistics the rule is simple: if your population has , do not summarise it with a mean.
- Non-asymptotic estimators. The sample max of data is the textbook example. Its sampling distribution is Beta(n, 1) on ; it converges to at rate (NOT ); and the limiting distribution after rescaling is exponential, not Gaussian. The CLT simply does not apply to extreme-value estimators — the relevant theory is extreme-value theory (Fisher-Tippett-Gnedenko 1928, 1943).
- Very small samples. Even with finite variance, n = 5 is not "asymptotic". Skewness in the population persists in the sampling distribution of at small n; the Gaussian overlay sits asymmetrically over the histogram. Confidence intervals based on the Normal approximation will under- or over-cover by visible margins. Rule of thumb: trust the Gaussian approximation when (n × population skewness²) is large.
How fast does SE shrink with n?
The √n rule says SE ∝ 1/√n, i.e. slope on log-log axes. That is the DEFAULT scaling — the one you should expect for finite-variance sample means. But it is not the only possibility, and seeing the alternatives sharply is the cleanest way to drive the rule home.
The widget below plots versus for several (estimator, population) pairs. The slope of each line is reported in the legend.
Things to verify:
- Sample mean | Normal: slope ≈ −0.5. The textbook 1/√n rule, exact for Normal data.
- Sample mean | Lognormal: slope ≈ −0.5 at large n, with measurable departures at small n (the line bends because Lognormal skewness leaks into the sampling distribution before the CLT bites).
- Sample mean | Cauchy: slope ≈ 0. The line is flat. SE does not shrink. CLT failure made visible by the slope.
- Sample median | Normal: slope ≈ −0.5, but the line sits ABOVE the mean's line — the median's SE is about 25% larger than the mean's at every n.
- Sample max | Uniform(0,1): slope ≈ −1.0. The line is twice as steep as the textbook rule. This is the canonical SUPER-EFFICIENT estimator — faster than 1/√n.
- Toggle the sample-variance series on: for Normal data the slope is ≈ −0.5, like the mean, but the prefactor is — the line is shifted up by about 0.15 on the log10-SE axis (≈ √2 in linear units).
The takeaway: the √n rule is the headline default, but it is conditional on (a) finite population variance, (b) the estimator being a sample mean or a similarly "regular" statistic. Once you step outside those conditions, the rate can be slower (Cauchy mean — no convergence), faster (Uniform max — rate 1/n), or unchanged-but-with-a-different-prefactor (Normal median — same rate, larger SE).
The empirical SE — and a preview of the bootstrap
Both widgets above estimate SE by Monte Carlo: draw R independent samples from a KNOWN generator, compute on each, take the SD of the R values. This gives you the empirical SE — a direct, formula-free estimate of .
That recipe — generate, compute, summarise — is the foundation of the bootstrap, which §1.7 develops in full. The trick the bootstrap pulls is to replace the "draw from the true population" step with "resample with replacement from the observed data". You never need to know the true population; the empirical CDF of your sample is treated AS IF it were the population. The bootstrap SE is then exactly what the widget computes — but with synthetic resamples instead of synthetic populations.
The result, due to Efron (1979), is that SE for ANY estimator — including ones with no closed-form formula — can be estimated from a single dataset, no parametric assumptions required. The exact algorithm and its theoretical underpinning land in §1.7; §1.6's point is to motivate it: every SE formula in this section can be replaced by an empirical SE from a Monte Carlo simulation, and the bootstrap is the version of that Monte Carlo that uses your data instead of a known generator.
From SE to confidence interval — a sneak preview
The standard error is the natural one-number summary of a sampling distribution's spread. It is not, on its own, a confidence interval — but it is the input to one. The simplest CI construction is the Wald interval:
where is the standard-Normal quantile (e.g. 1.96 for a 95% CI). The reasoning: IF the sampling distribution of is approximately Normal with mean and SD , then the interval covers with probability about 95% across independent samples. Plug in for the unknown and you have the working recipe.
The recipe is honest exactly when the assumption is honest: when the sampling distribution actually looks Gaussian. For the sample mean with finite-variance populations and moderate n, this is fine. For skewed sampling distributions at small n, for heavy-tailed populations, or for non-asymptotic estimators like the sample max, the Wald interval can systematically under- or over-cover. Part 3 develops bootstrap intervals (BCa, percentile) that handle these cases more honestly. For now: SE is the input; CI is the construction; the Wald form is the simplest version of that construction.
The plug-in principle
Most SE formulas in this section depend on unknown population quantities: in , the density at the median in the median's SE, the population fourth moment in the sample variance's SE. The plug-in principle says: estimate the unknown quantity from the data and substitute it into the SE formula.
For the mean: plug in for to get .
For the median: estimate the density at the median using kernel density estimation (Part 8 §8.5), then substitute. (In practice you almost always just use the bootstrap here.)
For the sample variance: the Gaussian-based formula plugs in for AND assumes Normality. Non-Gaussian populations require either the fourth-central-moment plug-in or — much more commonly — the bootstrap.
The principle is sound asymptotically: as , plug-in SEs converge to the truth at rate . But at small , plug-in SEs are themselves random and can be biased. The bootstrap (§1.7) is in some ways a more honest plug-in — it substitutes the entire empirical CDF for the unknown population CDF, instead of substituting just a single moment.
Honest scope
§1.6 is deliberately conceptual. The mechanics of producing actual confidence intervals — Wald intervals, profile-likelihood intervals, bootstrap intervals (percentile, basic, BCa) — are in Part 3. The bootstrap and jackknife as resampling engines are in §1.7. Asymptotic delta-method derivations for smooth transformations of estimators are in §1.9. The robust-estimator alternatives that make heavy-tailed populations tractable are in §1.8.
What §1.6 owes you is the conceptual foundation: a sampling distribution is the distribution of an estimator across independent samples; standard error is the SD of that distribution; for finite-variance sample means the SE scales as ; for many other estimators or populations the scaling and the shape are different; and you can ALWAYS estimate the sampling distribution by Monte Carlo when no formula is available.
Try it
- In the sampling-dist simulator, pick Normal(0,1) and the sample mean. Slide n from 5 to 1000 and watch the Normal overlay tighten while always fitting the histogram. Note the "Ratio empirical/theoretical" row — it stays near 1.0 across all n.
- Same widget, switch to Cauchy(0,1) with the sample mean. Crank n up to 1000. The Normal overlay still does not fit; the empirical SE refuses to shrink. The verdict line reports "CLT FAILS". This is the canonical CLT failure.
- Switch the estimator to MEDIAN while keeping the population at Cauchy(0,1). The histogram is suddenly Gaussian-shaped even at n = 20. The median rescues you when the mean fails — that is why robust estimators exist.
- Switch to Lognormal(0,1) with the sample mean. At n = 5 the histogram is visibly right-skewed; at n = 100 it is much closer to Gaussian; at n = 500 the overlay fits well. This is the CLT in motion — slow on heavy right tails.
- Switch to Uniform(0,1) and the sample MAX. The histogram is bunched against 1 with an exponential left tail. The Normal overlay does not fit at any n. Note the verdict: "EXTREME-VALUE LIMIT".
- In the SE vs n widget, leave all five default series on and look at the fitted slopes. Sample mean | Normal should be near −0.50. Sample mean | Cauchy should be near 0.00 (line flat). Sample max | Uniform should be near −1.00 (line twice as steep). Press "Re-run simulations" a few times and watch the slopes wobble within ±0.05 of the expected values.
- Toggle the sample-variance | Normal series on. Slope ≈ −0.5 like the mean, but the line is shifted up (larger SE prefactor). This is the formula made visible.
- Pen-and-paper: derive from and the independence of the . Two lines.
- Pen-and-paper: the textbook says SE(median) ≈ 1.2533 σ/√n for Normal data. Derive it from the general formula with at .
- Pen-and-paper, harder: for Uniform(0, θ) data, derive . Show that for large n — slope −1 on log-log, not −1/2.
Pause and reflect: the sampling distribution lives entirely in your imagination — you never actually have many independent samples in real research; you have one. So in what sense is the SE a real, measurable quantity? Where in the actual workflow does the sampling distribution enter — and where, for that matter, does the bootstrap rescue you from never being able to draw independent replicates?
What you now know
The sampling distribution of an estimator is the distribution of across independent samples of size n from a fixed population. It depends on the population, the sample size, and the estimator. The standard error is its standard deviation — the natural one-number summary of how much would wobble across replicates. SE and population SD are different quantities and must not be confused.
For the sample mean of a finite-variance population, . This √n scaling is the headline default but not the only possibility. The Central Limit Theorem extends the Gaussian shape to the sample mean of any finite-variance population at large enough n — that is what makes the standard Wald confidence interval work even for non-Gaussian data. It fails for heavy-tailed populations (Cauchy: no finite variance, no convergence), non-asymptotic estimators (Uniform max: rate 1/n, not 1/√n), and very small samples.
Empirical SE — drawing many synthetic samples, computing the estimator on each, taking the SD — works for ANY estimator and ANY population. §1.7 develops the bootstrap as the practical version of this idea using only the observed data. The plug-in principle (substitute sample estimates for unknown population quantities in SE formulas) gives closed-form SEs when they exist. §1.6 has laid the conceptual foundation; §1.7 builds the engine; Part 3 turns the engine into honest confidence intervals.
References
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 5 develops the CLT operationally; Chapter 7 covers standard errors and asymptotic normality for general estimators.)
- Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 5: "Properties of a Random Sample" — the foundational chapter on sampling distributions, plus the canonical statements of the CLT and the delta method.)
- Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall. (Chapter 9 covers asymptotic distributions of estimators and standard-error machinery in depth.)
- Efron, B., Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall. (The standard reference for the bootstrap; the §1.7 preview here points at Chapters 1-6, which develop the SE-via-resampling story this section motivates.)
- DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability. Springer. (For readers who want the rigorous version of "the CLT for the sample mean" and its extensions, including stable-distribution generalisations for heavy-tailed populations.)
- Lehmann, E.L. (1999). Elements of Large-Sample Theory. Springer. (A graduate-level companion to Cox-Hinkley; particularly clear on plug-in estimators and the delta method as a general SE machine.)
- Efron, B. (1979). "Bootstrap methods: another look at the jackknife." Ann. Stat. 7(1), 1-26. (The bootstrap's founding paper; cited here because the §1.7 preview rests on it.)