Method of moments

Part 1 — Estimation

Learning objectives

State the method of moments (MoM) idea: equate the first k sample moments to the first k theoretical moments and solve for the k parameters
Derive the MoM estimators for Normal, Exponential, Gamma, and Beta, and recognise the closed-form solutions
Identify when MoM fails badly — the Uniform(0, θ) case where the MoM estimate can be physically impossible
Compare MoM to MLE on consistency, variance, and efficiency, and know that MoM is generally consistent and asymptotically normal but rarely fully efficient
Place MoM in modern practice: a fast first-pass recipe, a starting value for iterative MLE, or a fallback when the likelihood is intractable

§1.1 told you what an estimator is — a function of the sample, a random variable, with its own sampling distribution and three classical quality criteria. It did not tell you how to build one for a specific problem. If your data are gamma-distributed with unknown shape and scale, or beta-distributed with unknown α and β, or exponential with unknown rate, what estimator do you write down? §1.2 gives you the first answer: the method of moments (MoM). §1.3 will give you the second (maximum likelihood), and §1.4 will tell you how close either gets to the theoretical lower bound on variance.

The method of moments is the older of the two recipes — Karl Pearson introduced it in 1894, about twenty-five years before Fisher's likelihood machinery. It is also the simpler. The whole idea fits in one line: equate sample moments to theoretical moments, and solve for the parameters. No optimisation, no likelihood, no calculus beyond what is needed to invert a couple of simultaneous equations. Whenever you can do that algebra, you have an estimator.

The recipe

You have a sample $X_1, \ldots, X_n$ drawn from a parametric distribution with parameter vector $\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)$ . The j-th theoretical moment is

\mu_j(\boldsymbol{\theta}) = E[X^j],

and the j-th sample moment is

\hat{m}_j = \frac{1}{n} \sum_{i=1}^{n} X_i^j.

The method of moments writes down $k$ equations and solves them simultaneously:

\mu_1(\boldsymbol{\theta}) = \hat{m}_1, \quad \mu_2(\boldsymbol{\theta}) = \hat{m}_2, \quad \ldots, \quad \mu_k(\boldsymbol{\theta}) = \hat{m}_k.

Whatever values of $\boldsymbol{\theta}$ make those equalities hold are the MoM estimators $\hat{\boldsymbol{\theta}}_{\text{MoM}}$ . That is the entire prescription. The justification is the law of large numbers from §0.6: $\hat{m}_j \xrightarrow{P} \mu_j(\boldsymbol{\theta}_0)$ as $n \to \infty$ (where $\boldsymbol{\theta}_0$ is the true parameter), so the $\boldsymbol{\theta}$ you back out of the sample moments converges to the $\boldsymbol{\theta}_0$ that generated the data. Under mild conditions (continuity of $\mu_j$ , finite higher moments) this gives a consistent estimator.

Four worked examples

Normal $N(\mu, \sigma^2)$ . Two parameters, so we need two equations. The first two theoretical moments are $\mu_1 = \mu$ and $\mu_2 = \mu^2 + \sigma^2$ . Setting these equal to the sample moments:

\hat{\mu} = \bar{X}, \qquad \hat{\mu}^2 + \hat{\sigma}^2 = \frac{1}{n}\sum X_i^2 \;\Rightarrow\; \hat{\sigma}^2 = \frac{1}{n}\sum (X_i - \bar{X})^2.

Note the divisor: the MoM estimator of variance is the biased form with $1/n$ , not the unbiased $1/(n-1)$ that statistics textbooks usually call "the sample variance." The MoM recipe does not chase unbiasedness; it just inverts moment identities. (For large $n$ the two differ by a vanishingly small factor of $n/(n-1)$ , so this rarely matters in practice — but on a midterm exam, it matters.)

Exponential $\text{Exp}(\lambda)$ . One parameter, one equation. $E[X] = 1/\lambda$ , so

1/\hat{\lambda} = \bar{X} \;\Rightarrow\; \hat{\lambda}_{\text{MoM}} = \frac{1}{\bar{X}}.

Here MoM coincides with MLE (we will derive that in §1.3) — one of the rare cases where the two recipes agree. The estimator is consistent, asymptotically normal, and (asymptotically) fully efficient. Whenever the model has a single parameter and an explicit moment, MoM and MLE often agree.

Gamma $\text{Gamma}(\alpha, \beta)$ . Two parameters. Using the scale parameterisation where $E[X] = \alpha\beta$ and $\text{Var}(X) = \alpha\beta^2$ :

\alpha\beta = \bar{X}, \qquad \alpha\beta^2 = s^2 \;\Rightarrow\; \hat{\alpha}_{\text{MoM}} = \frac{\bar{X}^2}{s^2}, \quad \hat{\beta}_{\text{MoM}} = \frac{s^2}{\bar{X}},

where $s^2 = \frac{1}{n}\sum (X_i - \bar{X})^2$ . The MLE for Gamma requires iterative solution of a transcendental equation (Newton-Raphson on the digamma function) — so MoM is the natural starting value, and in practice often the only estimator you bother to compute.

Beta $\text{Beta}(\alpha, \beta)$ . Two parameters. Let $m = \bar{X}$ and $v = s^2$ . Using $E[X] = \alpha/(\alpha+\beta)$ and $\text{Var}(X) = \alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)]$ , define $\xi = m(1-m)/v - 1$ . Then

\hat{\alpha}_{\text{MoM}} = m \xi, \qquad \hat{\beta}_{\text{MoM}} = (1-m) \xi.

Again the MLE requires iteration; MoM gives a closed form. The first widget below lets you watch all four cases.

Consistency, made visible

The widget below makes one promise inescapable. Pick a distribution, set its true parameters, slide $n$ from 10 to 1000, and watch the green MoM-fit PDF collapse onto the orange true PDF. That is the law of large numbers — sample moments converge to population moments, so the MoM estimator converges to the truth.

Things to notice. At $n = 10$ the green fit can be wildly off — the histogram is sparse, sample moments are noisy, and MoM dutifully inverts that noise into a wrong-looking PDF. By $n = 100$ the fit is usually visually indistinguishable from the truth for Normal or Exponential, and broadly correct for Gamma and Beta. By $n = 1000$ all four agree. Consistency is not an asymptotic abstraction in this widget — you can see it converge under your finger.

What is harder to see in the picture but easy to measure: the MoM estimator is itself a random variable, so it has a sampling distribution, and that distribution has variance. For "regular" cases (Normal, Exponential, Gamma, Beta) the asymptotic variance shrinks like $O(1/n)$ — that is, the standard error is $O(1/\sqrt{n})$ , the same rate as the sample mean, which we will pin down in §1.6. But the constant in front of $1/n$ depends on the distribution and on the estimator. For some distributions, MoM's constant is much larger than MLE's — MoM is consistent but inefficient. §1.4 (Fisher information and CRLB) will give you the lower bound MLE asymptotically achieves.

What MoM gives you, and what it does not

Generic properties of MoM, under mild regularity (moments exist, the moment-to-parameter mapping is continuous and invertible):

Consistency. $\hat{\boldsymbol{\theta}}_{\text{MoM}} \xrightarrow{P} \boldsymbol{\theta}_0$ as $n \to \infty$ . (LLN on the sample moments, plus continuity of the inversion.)
Asymptotic normality. $\sqrt{n}(\hat{\boldsymbol{\theta}}_{\text{MoM}} - \boldsymbol{\theta}_0) \xrightarrow{d} N(0, V)$ for a covariance $V$ that depends on the distribution's higher moments. (Delta method on top of the CLT applied to the sample moments.)
Closed-form computation. No optimisation, no iteration, no priors. For the four examples above the formulas are one-liners.

What MoM does not give you:

Efficiency. In general MoM does not achieve the Cramér–Rao lower bound on asymptotic variance. MLE usually does (under regularity). When MLE is feasible, it usually has strictly smaller asymptotic variance than MoM. For the Gamma example, MLE's asymptotic variance can be substantially smaller than MoM's when the true $\alpha$ is small (heavily skewed).
Robustness. MoM relies on sample moments, which are not robust. A few outliers can move $\bar{X}$ or $s^2$ arbitrarily far, and MoM propagates that movement into the parameter estimates. §1.8 introduces robust alternatives.
A guarantee that the estimate is in the parameter space. For Beta the formula $\xi = m(1-m)/v - 1$ can be negative if $v > m(1-m)$ , in which case the formula returns $\hat{\alpha}, \hat{\beta} < 0$ — outside the parameter space $(0, \infty)$ . MoM does not police itself. The widget clamps these cases, but the underlying recipe is silent. MLE, by contrast, is constrained by construction to lie in the parameter space.

Uniform(0, θ): when MoM goes badly wrong

The most memorable example of MoM's limitations is the simplest. You have $X_1, \ldots, X_n \sim \text{Uniform}(0, \theta)$ and you want to estimate the upper bound $\theta$ . The first moment of $\text{Uniform}(0, \theta)$ is $\theta/2$ , so MoM hands you

\hat{\theta}_{\text{MoM}} = 2\bar{X}.

It is unbiased — $E[2\bar{X}] = 2(\theta/2) = \theta$ — so it satisfies the most-loved §1.1 desideratum. But its variance is $\text{Var}(2\bar{X}) = 4 \cdot \text{Var}(\bar{X}) = 4 \cdot (\theta^2/12)/n = \theta^2/(3n)$ .

The MLE is different. The likelihood $L(\theta) = (1/\theta)^n \cdot \mathbb{1}{\theta \geq \max_i X_i}$ is maximised at the smallest $\theta$ that still covers the data, namely

\hat{\theta}_{\text{MLE}} = \max_i X_i.

This is biased low — $E[\max_i X_i] = \frac{n}{n+1}\theta < \theta$ — but its variance is $\text{Var}(\max_i X_i) = \theta^2 \cdot \frac{n}{(n+1)^2(n+2)} \approx \theta^2 / n^2$ . The MLE converges at rate $1/n^2$ , not $1/n$ . In MSE, the MLE beats MoM by a factor that grows linearly in $n$ .

But the deeper problem is qualitative. The MoM estimator $\hat{\theta}_{\text{MoM}} = 2\bar{X}$ can be smaller than the largest observed value $\max_i X_i$ . That is an estimate of an upper bound that is below an observation. Impossible.

The widget simulates 2000 replicates of size $n$ and shows both sampling distributions. The MoM histogram is roughly symmetric around the true $\theta$ (unbiased!) but wide. A non-trivial fraction of its mass falls below $\max_i X_i$ — the widget counts and reports those cases. The MLE histogram is concentrated just below $\theta$ , with a long left tail truncated at zero. It is biased, but it never gives an impossible answer.

The lesson: MoM's recipe ignores the structural information in the data. For Uniform(0, θ), the maximum $\max_i X_i$ carries far more information about $\theta$ than $\bar{X}$ does — and yet MoM only consults $\bar{X}$ . Whenever the support of the distribution is bounded by a parameter (Uniform, shifted exponential, truncated normals), the order statistics matter and the moment-based recipe is fundamentally weaker than the likelihood-based one.

Where MoM lives in modern practice

Karl Pearson introduced MoM in his 1894 paper on mixtures of normal distributions, where he wanted to estimate the five parameters of a two-component mixture (two means, two variances, one mixing proportion). Maximum likelihood was not yet a paradigm and there was no general theory of optimization for such problems; Pearson wrote down the first five moments, set them equal to their sample versions, and solved a quintic. The method dominated parametric estimation for the first quarter of the twentieth century until Fisher's 1922 paper on likelihood theory began to displace it.

Today MoM is rarely the final estimator in a research paper, but it lives in three places worth knowing:

When MLE is intractable. Some likelihoods do not admit closed-form maxima and even Newton-style iteration is unstable. MoM is fast and at least gets you a consistent answer.
As starting values for iterative MLE. Numerical maximum-likelihood routines need an initial guess. MoM provides a sane initialization in a single function call, often inside the basin of attraction of the global maximum. Most statistical software does this internally.
In the generalised method of moments (GMM). Hansen's 1982 paper generalises Pearson's idea to over-identified systems — more moment conditions than parameters — by minimizing a quadratic form in the moment residuals. GMM is a workhorse in econometrics for instrumental-variable estimation and time-series models. Whenever a problem can be cast as $E[g(X, \boldsymbol{\theta})] = 0$ for a vector-valued $g$ , GMM applies. §6.5 (instrumental variables) will return to GMM in earnest.

Try it

In the MoM explorer, pick Exponential(λ = 1) with $n = 10$ . Resample several times — note how widely $\hat{\lambda}_{\text{MoM}} = 1/\bar{X}$ jumps. Now bump $n$ to 100, then 1000. Watch the estimate stabilise. This is the $1/\sqrt{n}$ shrinkage of the standard error.
In the MoM explorer, pick Gamma(α = 2, β = 1) and look at the green MoM fit. Now drop $\alpha$ to 0.7 (very skewed). At $n = 30$ , does the MoM fit reproduce the spike near zero well, or does it badly miss it? MLE would do better here, but at the cost of iterative computation.
In the MoM explorer, pick Beta(α = 0.5, β = 0.5) — the "smile" U-shape. At $n = 30$ , what does the MoM fit look like? Try resampling a few times. The MoM formula can produce small or even negative $\xi$ (which the widget clamps) — the recipe gives no warning when it strays outside the parameter space.
In the MoM-vs-MLE Uniform widget, set $\theta = 1$ and $n = 10$ . What percentage of MoM estimates fall below $\max_i X_i$ (the "impossible" region)? Now bump $n$ to 100. Does the percentage shrink? It should — the noise in $\bar{X}$ shrinks, so the MoM estimate clusters more tightly around $\theta$ .
In the same widget, look at the MSE ratio (reported in the status panel). The theory says MLE's MSE is smaller by a factor that grows like $n$ . Verify roughly: at $n = 30$ you should see a ratio of order 10; at $n = 100$ , a ratio of order 30; at $n = 200$ , of order 60.

Pause and reflect: the Uniform(0, θ) example shows MoM losing because it ignores the maximum. What information does MoM exploit that MLE does not? (Hint: it uses every observation symmetrically, while MLE here uses only one — the maximum.) Are there problems where that symmetric use is actually an advantage, perhaps because it is more robust to model misspecification at the tails?

What you now know

The method of moments is your first recipe for constructing an estimator: write down $k$ equations setting sample moments equal to population moments, solve. Under mild conditions you get a consistent, asymptotically normal estimator in closed form. You have seen the formulas for Normal, Exponential, Gamma, and Beta, and you have seen the cautionary Uniform(0, θ) example where MoM is so weak it can produce physically impossible estimates.

§1.3 introduces the second great recipe — maximum likelihood — which maximises the likelihood of the observed data over the parameter space. It is typically more efficient than MoM (often achieving the Cramér–Rao bound), but it requires solving an optimisation problem and the solution may need to be found iteratively. §1.4 will give you the Cramér–Rao lower bound on variance among unbiased estimators — the yardstick that measures both MoM and MLE against the best possible. §1.5 revisits bias-variance in full and shows when even MLE can be improved upon (Stein's paradox). After that Part 1 turns from constructing estimators to characterising their sampling distributions, with the bootstrap and asymptotic theory.

References

Pearson, K. (1894). "Contributions to the mathematical theory of evolution." Philosophical Transactions of the Royal Society of London A 185, 71–110. (The original MoM paper, for a mixture of two normals.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 9.2 covers the method of moments.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Section 10.1.1.)
Hansen, L.P. (1982). "Large sample properties of generalized method of moments estimators." Econometrica 50(4), 1029–1054. (The paper that turned moment-matching into a general estimation framework for over-identified systems.)
Hall, A.R. (2005). Generalized Method of Moments. Oxford University Press. (Modern textbook treatment of GMM and its asymptotic theory.)