Estimators and their properties

Part 1 — Estimation

Learning objectives

Define an estimator as a function of the sample, and recognise it as a random variable
State the three classical desiderata — unbiasedness, consistency, efficiency — and what each one rules out
Decompose mean squared error as bias² + variance and use the decomposition to compare estimators
Explain why a biased estimator can dominate an unbiased one in MSE, and preview the James–Stein result
Connect §1.1 to the rest of Part 1: MoM (§1.2), MLE (§1.3), Fisher information (§1.4), bias–variance (§1.5)

You spent Part 0 learning the rules a probability has to obey and the distributions you will meet in the wild. Part 1 turns the question around. The textbook so far assumed you knew the distribution that generated your data — that you knew its mean, its variance, its parameters. In real research you almost never know those. You have data, and you have a parametric model for the data, and you have to estimate the parameters of that model from the data alone. This section gives you the vocabulary and the three quality criteria for that job, plus a hard-headed warning about which of them actually matters.

By the end of this section you will know what an estimator is, why it is more useful to think of an estimator as an entire distribution than as a single number, and why the three properties most introductory courses tell you to chase — unbiasedness, consistency, and efficiency — trade off against each other in ways that often make biased estimators the right answer.

What is an estimator?

You have a sample $X_1, X_2, \ldots, X_n$ — n independent draws from some distribution. The distribution has a parameter you care about, call it $\theta$ (the mean, the variance, the rate of an exponential, the proportion that responds to treatment, whatever you want). An estimator of $\theta$ is any function of the data:

\hat{\theta} = T(X_1, X_2, \ldots, X_n).

The notation is intentionally permissive — anything that takes the sample as input and produces a guess for $\theta$ as output is an estimator. The sample mean $\bar{X} = \frac{1}{n} \sum X_i$ is an estimator. The sample median is an estimator. The first observation $X_1$ , used alone, is an estimator. The constant function that ignores the data and always returns 17 is an estimator (a terrible one). Whether an estimator is any good is a separate question — the definition is deliberately broad so the criteria you apply afterward do the discrimination.

The single most important thing to internalise about $\hat{\theta}$ is this: it is a random variable. The data $X_1, \ldots, X_n$ are random variables, so any function of them is a random variable too. Two different samples from the same population give two different values of $\hat{\theta}$ . That means $\hat{\theta}$ has its own distribution — the sampling distribution — with its own mean, its own variance, and its own shape. Almost every later concept in this textbook is a statement about that sampling distribution.

An estimator is a distribution

The widget below makes that statement physical. Pick a population (we will estimate its mean). Pick two estimators to compare. Pick a sample size n. Click Run 1500 replicates: the widget draws 1500 independent samples of size n from your chosen population and computes both estimators on each. The two histograms are the resulting sampling distributions.

Three things to look for. First, the orange dashed line marks the truth $\theta$ — the parameter value the estimators are trying to hit. Second, the blue solid line marks the mean of each estimator across the 1500 replicates. The horizontal distance from blue to orange is the bias. Third, the width of each histogram is the variance. An estimator that puts most of its mass on the truth (narrow histogram centred on the orange line) is doing well; an estimator that is centred elsewhere (orange line in the tail) is biased; an estimator that is wide is high-variance, even if it averages to the right answer.

The three classical desiderata

Textbooks usually list three properties an estimator should have. Each one is a statement about the sampling distribution you just saw.

Unbiasedness. An estimator $\hat{\theta}$ is unbiased for $\theta$ if its expected value equals the truth:

E[\hat{\theta}] = \theta.

In the widget: the blue line sits exactly on the orange line. The sample mean is unbiased for the population mean (since $E[\bar{X}] = \mu$ ), the sample median is unbiased for the median of a symmetric distribution, the sample variance $\frac{1}{n-1}\sum (X_i - \bar{X})^2$ is unbiased for $\sigma^2$ (the $n-1$ correction is exactly what makes it so). Unbiasedness rules out estimators that systematically over- or underestimate the truth — a property worth having.

Consistency. An estimator $\hat{\theta}_n$ is consistent if it converges in probability to the truth as $n \to \infty$ :

\hat{\theta}_n \xrightarrow{P} \theta \quad \text{as} \quad n \to \infty.

In the widget: as you increase n from 10 to 30 to 100 to 300, both histograms should tighten around the truth (assuming the estimator is consistent at all — the "first observation" estimator never tightens because n = 1 of its info content does not grow with n). Consistency is the law-of-large-numbers promise that more data eventually pays off. It is a large-sample property and says nothing about what happens at the n you actually have.

Efficiency. Among unbiased estimators, the efficient one has the smallest variance. The narrowest histogram, among those centred on the orange line, wins. §1.4 will give you a lower bound on variance among unbiased estimators (the Cramér–Rao bound), which makes "as efficient as possible" a precise claim.

All three properties are real, all three are useful, and a beginner is told to chase all three. The next subsection is the warning.

Unbiasedness is overrated

The three properties trade off. They do not all point in the same direction, and pursuing one can hurt the others. The most useful single fact in this entire section is the bias–variance decomposition of mean squared error:

\text{MSE}(\hat{\theta}) = E\!\left[(\hat{\theta} - \theta)^2\right] = \text{bias}(\hat{\theta})^2 + \text{var}(\hat{\theta}).

(Derivation: write $\hat{\theta} - \theta = (\hat{\theta} - E[\hat{\theta}]) + (E[\hat{\theta}] - \theta)$ , square, take expectation, and the cross term vanishes because $E[\hat{\theta} - E[\hat{\theta}]] = 0$ .) MSE is the most common single-number summary of how good an estimator is — it is the expected squared distance from the truth — and it splits into exactly two pieces: bias squared and variance. They are the two ways an estimator can be wrong.

So if your goal is to minimise MSE — the squared distance from the truth, averaged over samples — then a biased estimator can dominate an unbiased one whenever the bias it introduces is smaller, squared, than the variance it saves. That is not a hypothetical: it is the norm in small samples, and it is what the second widget makes visible.

The bias–variance trade-off, geometrically

Consider the simplest possible non-trivial estimation problem: you observe one noisy measurement $\bar{X} \sim \text{Normal}(\theta, \sigma^2/n)$ of an unknown $\theta$ . (Think of $\bar{X}$ as the sample mean of n iid observations with population variance $\sigma^2$ .) The natural estimator is $\hat{\theta} = \bar{X}$ — unbiased, easy to compute. Now consider the family of shrinkage estimators

\hat{\theta}(\alpha) = \alpha \bar{X}, \qquad \alpha \in [0, 1.2].

For $\alpha = 1$ this is the sample mean. For $\alpha = 0$ it is the constant 0 (zero variance, but a huge bias of $-\theta$ ). For $\alpha < 1$ it is the sample mean pulled toward 0 — biased, but smaller in absolute value, and so smaller in variance.

The math is short enough to do in your head. Bias $= E[\alpha \bar{X}] - \theta = (\alpha - 1)\theta$ . Variance $= \alpha^2 \sigma^2 / n$ . MSE = $(\alpha - 1)^2 \theta^2 + \alpha^2 \sigma^2 / n$ . Differentiate, set to zero:

\alpha^* = \frac{\theta^2}{\theta^2 + \sigma^2/n} < 1.

Whenever there is any sampling noise at all (whenever $\sigma^2/n > 0$ ) the MSE-optimal shrinkage factor is strictly less than 1, which means the MSE-optimal estimator is biased. The unbiased sample mean is provably suboptimal in MSE. The widget shows exactly how much MSE you save and as a function of how much you shrink — and the percentage saved gets larger as the sampling noise grows relative to $\theta$ .

This widget is a one-dimensional version of a much more famous result. In 1956 Charles Stein showed that for estimating the mean of a multivariate normal in dimension 3 or higher, the maximum likelihood estimator (the sample mean) is inadmissible: there exists an estimator with strictly lower MSE everywhere. The James–Stein shrinkage estimator (1961) is the simplest example. The result is so counterintuitive that it is sometimes called "Stein's paradox" — it forced statisticians to give up the idea that "always unbiased" is even a sensible goal. We will see the full James–Stein machinery in §1.5.

Try it

In the estimator simulator, pick population = Normal(0, 1), sample size = 30, estimator A = Sample mean, estimator B = Sample median. Both are unbiased, but one is more efficient. Which? Read it off the empirical MSE column. (The classic answer is the sample mean: for normal data, its variance is $\sigma^2/n$ , while the median's variance is roughly $\pi \sigma^2 / (2n)$ — about 57% wider.)
Switch the population to Exponential(rate = 1). Re-run. Look at the sample mean vs the first-observation estimator. Both have the same expected value (both are unbiased for the mean), but their variances differ by a factor of n. Verify this in the empirical variances column.
Set estimator A = Sample mean and estimator B = Shrinkage toward 0 (α = 0.5). Pick population = Normal(0, 1) first. Which estimator wins on MSE? (Sample mean — truth is 0, so shrinkage barely helps.) Now switch population to Exponential(1), where truth = 1. Re-run. Has the verdict flipped? Why?
In the bias–variance trade-off widget, set θ = 2 and σ²/n = 1. Where is the MSE minimum? (Should be at $\alpha^* = 4/5 = 0.8$ .) Now drop σ²/n to 0.05. Does α* move toward 1? (Yes — as the data gets less noisy, the unbiased mean is harder to beat.)
In the bias–variance trade-off widget, set σ²/n = 4 (very noisy data, small n). What MSE percentage does the optimal biased estimator save compared to the unbiased sample mean? Notice how big the saving gets when data is scarce — this is when biased estimators matter most.

Pause and reflect: the bias–variance decomposition assumes you have committed to a loss function — squared error. If you genuinely cared about absolute error instead (|θ̂ − θ| rather than (θ̂ − θ)²) or about correctly bounding $\theta$ from above (a one-sided loss), would your favourite estimator change? Where in research workflows do you actually face squared loss, versus the alternatives?

What you now know

You have a definition: an estimator is any function of the sample, and as a function of random variables it is itself a random variable, with a distribution called the sampling distribution. You have three desiderata: unbiased means the estimator averages to the truth, consistent means it converges to the truth as n grows, efficient means it has the smallest variance among unbiased estimators. And you have the disclaimer that the rest of Part 1 will keep using: MSE = bias² + variance, and the MSE-minimising estimator is almost always biased.

The four sections to come build on this scaffolding. §1.2 (Method of Moments) and §1.3 (Maximum Likelihood) are the two great recipes for actually constructing an estimator from a parametric model — they tell you what $\hat{\theta}$ to write down. §1.4 (Fisher information and the Cramér–Rao bound) gives you a lower bound on variance among unbiased estimators, which makes "as efficient as possible" precise. §1.5 (Bias–variance, MSE, and the efficiency frontier) revisits the trade-off in full and gives you James–Stein in its proper multidimensional form. After that, §1.6 turns to sampling distributions directly (standard error), §1.7 introduces the bootstrap (estimate the sampling distribution from your one sample), §1.8 hardens the methods against outliers, and §1.9 makes the large-sample behaviour rigorous.

References

Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 6, "Models, Statistical Inference and Learning".)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 7, "Point Estimation".)
Lehmann, E.L., Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. (The reference text for the formal theory.)
Stein, C. (1956). "Inadmissibility of the usual estimator for the mean of a multivariate normal distribution." Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 197–206. (The original inadmissibility paper that motivates the bias–variance message.)
Efron, B., Morris, C. (1977). "Stein's paradox in statistics." Scientific American 236(5), 119–127. (An accessible exposition of the James–Stein shrinkage estimator and why it beats the MLE.)