Maximum likelihood

Part 1 — Estimation

Learning objectives

State the maximum likelihood principle: θ̂_MLE = argmax_θ L(θ; X) = argmax_θ Σ log f(Xᵢ; θ)
Distinguish the likelihood as a function of θ (data fixed) from the PDF as a function of x (parameter fixed)
Derive closed-form MLEs for Bernoulli, Exponential, Normal, and Uniform(0, θ), and recognise when boundary behaviour breaks the differentiation recipe
State the asymptotic properties: consistency, asymptotic normality with variance I(θ)⁻¹, and the invariance property for transformations
Compare MLE to MoM in efficiency, computability, and finite-sample bias, and place MLE as the modern default with honest caveats (misspecification, optimisation failure)

§1.2 gave you the first recipe for constructing an estimator: equate sample moments to population moments, solve. It works, it gives you closed forms, and the Uniform(0, θ) example showed where it can fail badly. §1.3 introduces the second recipe, due to Ronald Fisher in his 1922 paper that founded modern statistics: maximum likelihood. The idea is one of the cleanest in all of statistical theory — write down the probability that the data you actually observed would arise as a function of the unknown parameters, and pick the parameters that maximise that probability. You will use MLE more often than any other estimator in your research career; in default settings of most statistical software, "fit this model" means "find the MLE."

This section gives you the definition, the algebra of the log-likelihood that makes the recipe practical, the closed-form MLEs for the four workhorse single-parameter distributions, the boundary counterexample where the textbook derivation fails, the three theoretical properties that make MLE special (consistency, asymptotic normality, invariance), and the honest list of caveats — biased in finite samples, sensitive to misspecification, occasionally hard to compute. §1.4 will sharpen the asymptotic-normality statement by giving you the Cramér–Rao bound and Fisher information; §1.5 will revisit bias-variance and show when even MLE can be improved upon by shrinkage.

The likelihood, and the flip

You have iid data $X_1, X_2, \ldots, X_n$ drawn from a parametric family with density $f(x; \theta)$ . The joint density of the observed sample is the product

f(X_1, X_2, \ldots, X_n; \theta) = \prod_{i=1}^{n} f(X_i; \theta).

As a function of the data $X$ with $\theta$ held fixed, this is the joint probability density — the standard thing from Part 0. The likelihood function is the same algebraic expression viewed the other way around: data held fixed, parameter $\theta$ varying.

L(\theta; X_1, \ldots, X_n) = \prod_{i=1}^{n} f(X_i; \theta).

The notational distinction matters. The PDF $f(x; \theta)$ is a function of $x$ for fixed $\theta$ — it integrates to 1. The likelihood $L(\theta; X)$ is a function of $\theta$ for fixed data — it does not integrate to 1 (in $\theta$ ), and it is not a probability density over $\theta$ at all. It is just a relative scoring of how "compatible" each candidate $\theta$ is with the observed data. The flip — same algebra, different argument — is the conceptual leap you need to internalise. Read the likelihood out loud as "the probability that THIS data would arise IF the parameter were θ" and you have the right mental model.

The log-likelihood and why you always use it

Three things go wrong with $L(\theta; X)$ as written. First, it is a product of $n$ numbers each between 0 and 1 (or, for continuous densities, each on whatever scale $f$ produces) — for moderate $n$ it underflows to floating-point zero. Second, derivatives of products are messy. Third, intuition about products is bad and intuition about sums is good. All three problems are solved by taking the logarithm. The log-likelihood

\ell(\theta) = \log L(\theta; X) = \sum_{i=1}^{n} \log f(X_i; \theta)

is a sum of $n$ terms (numerically stable), is differentiable term by term, and shares the same maximiser as $L$ because the logarithm is a monotone increasing function. Always work with the log-likelihood. The MLE is defined indifferently in either form:

\hat{\theta}_{\text{MLE}} = \underset{\theta}{\arg\max} \; L(\theta; X) = \underset{\theta}{\arg\max} \; \ell(\theta).

The score function is the derivative of the log-likelihood, $\partial \ell / \partial \theta$ . Setting it to zero gives a candidate maximum:

\frac{\partial \ell}{\partial \theta}\bigg|_{\theta = \hat{\theta}} = 0.

(Check second-order conditions or compare interior critical points to boundary values to confirm you have a maximum, not a minimum or saddle.) For nicely-behaved single-parameter problems this gives the MLE in one line of algebra. For everything else — multi-parameter problems, transcendental score equations, boundary-supported distributions — you solve it numerically.

MLE as peak-finding on a curve

The widget below makes the geometry of MLE physical. Pick a distribution (Bernoulli, Exponential, Normal with σ fixed at 1, or Uniform(0, θ)) and a sample size $n$ . The widget draws $n$ iid samples from the true distribution, then plots $\ell(\theta)$ as a function of $\theta$ on a slider. The slider lets you scrub $\theta$ by hand and watch $\ell(\theta)$ go up and down. The MLE θ̂ (green vertical line) is the argmax of the curve. The true $\theta$ (orange dashed) is the value that generated the data. Click Freeze new sample to draw new data and see the landscape shift.

Four things to look for. First, the curve always has its maximum at the closed-form MLE — drag the slider toward it to make the value $\ell(\theta_{\text{slider}})$ approach $\ell(\hat{\theta})$ . Second, the peak moves around with the data: hit Freeze new sample and watch θ̂ wander while the true $\theta$ stays put. That wandering is the sampling distribution of θ̂. Third, the peak gets sharper as $n$ grows from 5 to 300 — the curvature at the peak is what §1.4 will call Fisher information, and asymptotic variance scales like one over it. Fourth, the Uniform(0, θ) curve is qualitatively different: a vertical cliff at $\theta = \max(X_i)$ on the left (likelihood is zero, so log-likelihood is $-\infty$ ) and a smooth decay to the right. The MLE is the cliff edge. You cannot find this peak by setting a derivative to zero — the recipe fails for the Uniform because the support of the distribution depends on the parameter. We will come back to this.

Four closed-form MLEs

Bernoulli $\text{Bern}(p)$ . Each $X_i \in {0, 1}$ with $P(X_i = 1) = p$ . Let $k = \sum X_i$ be the count of 1s. Then

\ell(p) = k \log p + (n - k) \log(1 - p), \qquad \frac{\partial \ell}{\partial p} = \frac{k}{p} - \frac{n - k}{1 - p}.

Setting the score to zero and solving:

\hat{p}_{\text{MLE}} = \frac{k}{n} = \bar{X}.

The sample proportion. The same answer MoM gives. (Whenever a single-parameter distribution has its parameter equal to its mean, MoM and MLE coincide.)

Exponential $\text{Exp}(\lambda)$ . Density $f(x; \lambda) = \lambda e^{-\lambda x}$ .

\ell(\lambda) = n \log \lambda - \lambda \sum X_i, \qquad \frac{\partial \ell}{\partial \lambda} = \frac{n}{\lambda} - \sum X_i.

Setting to zero: $\hat{\lambda}_{\text{MLE}} = n / \sum X_i = 1 / \bar{X}$ . Again the same as MoM. The peak of $\ell$ at $1/\bar{X}$ is what you scrub through with the slider in the widget.

Normal $N(\mu, \sigma^2)$ (both parameters). The log-likelihood is

\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum (X_i - \mu)^2.

Taking partial derivatives:

\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum(X_i - \mu), \qquad \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum (X_i - \mu)^2.

Setting both to zero:

\hat{\mu}_{\text{MLE}} = \bar{X}, \qquad \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum (X_i - \bar{X})^2.

Look at the variance MLE: it is the biased form with $1/n$ , identical to the MoM estimator from §1.2. The unbiased form $1/(n-1)\sum (X_i - \bar{X})^2$ — the one statistics textbooks call "the sample variance" — is not the MLE. The MLE is biased low, by a factor of $(n-1)/n$ . This is a recurring lesson: MLE in finite samples can be biased, sometimes substantially so. Asymptotically the bias vanishes (the factor goes to 1 as $n \to \infty$ ), which is consistent with the consistency property we will prove below.

Uniform $U(0, \theta)$ . This is the case where the differentiation recipe breaks. The density is

f(x; \theta) = \frac{1}{\theta} \, \mathbb{1}\{0 \leq x \leq \theta\}.

The likelihood is $L(\theta) = \theta^{-n}$ if $\theta \geq \max_i X_i$ , and $L(\theta) = 0$ if $\theta < \max_i X_i$ (because at least one $X_i$ would fall outside the support, giving zero density there). Equivalently $\ell(\theta) = -n \log \theta$ on $[\max_i X_i, \infty)$ and $-\infty$ otherwise. The function is monotonically decreasing on its support, so the maximum is at the LEFT boundary:

\hat{\theta}_{\text{MLE}} = \max_i X_i.

You cannot get this by setting a derivative to zero. The derivative $-n / \theta$ is never zero. The maximum is at a non-differentiable boundary point — at the smallest $\theta$ that the data does not rule out. This is the boundary case that §1.4 will identify as a violation of the "regularity conditions" needed for the CRLB and asymptotic normality results.

Three big properties of the MLE

Under regularity conditions — the parameter space is open, the density is differentiable, the support does not depend on $\theta$ , expectations of certain derivatives exist and can be interchanged with integrals — the MLE has three remarkable large-sample properties. They are the reason MLE is the default in modern statistics.

Consistency. As the sample size $n$ grows, the MLE converges in probability to the true parameter:

\hat{\theta}_{\text{MLE}, n} \xrightarrow{P} \theta_0 \quad \text{as} \quad n \to \infty.

Intuitively: the log-likelihood divided by $n$ converges (by the LLN) to the expected log-likelihood $E_{\theta_0}[\log f(X; \theta)]$ , and that expected log-likelihood is maximised at $\theta = \theta_0$ (by Jensen / non-negativity of KL divergence — see Cox & Hinkley 1974 for the rigorous argument). So the empirical maximum converges to the population maximum.

Asymptotic normality. More precisely, the MLE has a normal limiting distribution with variance equal to the inverse Fisher information:

\sqrt{n}\,(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} N(0, \, I(\theta_0)^{-1}).

where $I(\theta) = -E[\partial^2 \ell_1 / \partial \theta^2]$ is the Fisher information per observation (§1.4 will define it carefully). This is one of the great results in statistical theory. For a single observation the MLE has whatever variance it has; asymptotically, it concentrates around the truth at the canonical $1/\sqrt{n}$ rate, with a covariance matrix you can write down from a single integral. It is also asymptotically efficient — its asymptotic variance achieves the Cramér–Rao lower bound, the smallest variance possible among (regular) unbiased estimators. §1.4 is the section that makes this efficiency claim precise.

Invariance. If $\hat{\theta}$ is the MLE of $\theta$ and $g$ is any function, then $g(\hat{\theta})$ is the MLE of $g(\theta)$ . In symbols:

\widehat{g(\theta)}_{\text{MLE}} = g(\hat{\theta}_{\text{MLE}}).

This is the easy-to-prove and constantly-used property. If you have the MLE of a Bernoulli $p$ , then the MLE of the log-odds $\log(p / (1-p))$ is just $\log(\hat{p} / (1 - \hat{p}))$ — no separate optimisation required. If you have the MLE of the variance $\sigma^2$ , the MLE of the standard deviation is $\sqrt{\hat{\sigma}^2}$ . Invariance is what makes MLE a clean estimator across reparameterisations; it does not depend on how you choose to write down the parameter. (MoM does not share this — its result depends on which moments you choose to match.)

The Uniform(0, θ) counterexample to asymptotic normality

The asymptotic-normality statement above is true under regularity. The Uniform(0, θ) MLE is a case where regularity fails — the support depends on $\theta$ , so the density is not differentiable in $\theta$ at the boundary $\theta = X_i$ for each observation. The MLE $\hat{\theta} = \max_i X_i$ still exists and is still consistent (the maximum of $n$ iid Uniform(0, $\theta$ ) variables converges to $\theta$ as $n \to \infty$ ), but its asymptotic distribution is not normal, and its rate of convergence is $1/n$ rather than $1/\sqrt{n}$ :

n(\theta - \hat{\theta}_{\text{MLE}}) \xrightarrow{d} \text{Exp}(1/\theta) \quad \text{as} \quad n \to \infty.

This is a faster rate than the canonical $1/\sqrt{n}$ , which makes the Uniform MLE "super-efficient" — it concentrates around the truth more rapidly than any regular estimator can. The price is that the standard asymptotic-normality machinery does not apply: you cannot read off variance from inverse Fisher information; the limiting distribution is exponential, not normal; and the CRLB (which assumes regularity) gives a lower bound the MLE blows past. The Uniform example is the textbook reminder that "MLE is asymptotically normal" requires regularity, and that boundary cases are genuinely different.

MLE vs MoM: efficiency, made visible

Section §1.2 ended with the Uniform widget where MLE absolutely crushed MoM. That was a special case. The general claim is milder: for "regular" problems (smooth densities, parameter not at the support boundary) MLE has lower variance than MoM, sometimes by a meaningful factor. The widget below quantifies that for the Gamma distribution, where MoM has a closed-form and MLE needs Newton iteration.

For $X_i \sim \text{Gamma}(\alpha, \beta)$ (scale parameterisation, mean $\alpha\beta$ , variance $\alpha\beta^2$ ): the MoM estimator of $\alpha$ is $\hat{\alpha}_{\text{MoM}} = \bar{X}^2 / s^2$ (closed-form, one line). The MLE is the $\alpha$ that satisfies the score equation $\log \alpha - \psi(\alpha) = \log \bar{X} - \overline{\log X}$ , where $\psi$ is the digamma function. There is no closed form; Newton-Raphson on this equation converges in 3–5 iterations. The widget runs 1500 replicates of size $n$ , computes both estimators on each, and plots the two empirical sampling distributions of $\hat{\alpha}$ side by side.

Two things to look for. First, the MLE histogram (green) is narrower than the MoM histogram (red). The variance ratio is reported in the status panel — typically 1.5× to 4× depending on $\alpha$ . Second, the asymptotic-CRLB approximation $\alpha / [n (\alpha \psi'(\alpha) - 1)]$ (§1.4 will derive this for you) is shown alongside the empirical variance of the MLE — they should agree closely once $n$ is moderate (say 30+). That equality is the cash value of "MLE is asymptotically efficient": it tells you in advance what variance to expect for the MLE without simulating, just from the model.

Where MLE lives in practice, and where it bites

Three honest caveats follow the three big properties:

Finite-sample bias. MLE is not generally unbiased. The Normal-variance MLE $(1/n)\sum (X_i - \bar{X})^2$ is biased low by a factor $(n-1)/n$ ; this is why textbooks define "the sample variance" with $1/(n-1)$ . The Uniform MLE $\max_i X_i$ is biased low by a factor $n/(n+1)$ . Many GLM and mixed-model MLEs have bias that does not vanish even asymptotically (for the variance components in random-effects models, REML is preferred for exactly this reason). Asymptotically these biases shrink, but small samples can be problematic.
Model misspecification. If the model is wrong, the MLE still converges — but to the "least false" parameter $\theta^*$ that minimises the Kullback–Leibler divergence between the assumed model and the true distribution, not to anything you actually wanted. Worse, the asymptotic-normality variance formula breaks: the "sandwich" estimator $A^{-1} B A^{-1}$ (with $A = -E[\partial^2 \ell]$ and $B = E[(\partial \ell)^2]$ ) replaces $I(\theta)^{-1}$ . Under correct specification $A = B$ and they cancel; under misspecification they do not. Robust standard errors (Huber, Eicker, White) implement the sandwich correction — they are crucial in practice and are revisited in §1.8 and §4.5.
Optimisation pathology. Closed forms only exist for a few distributions. Most modern MLE problems require numerical optimisation (Newton-Raphson, BFGS, Fisher scoring, EM for missing-data and mixture problems). The log-likelihood can be non-concave (mixtures, neural networks) with multiple local optima, ridges where the curvature collapses, or boundaries where the optimisation stalls. Identifiability failures (the model is invariant to relabelling components in a mixture, for example) produce the famous "label-switching" pathology. Robust starting values — often from MoM — and multi-start optimisation help. There is no silver bullet.

Despite the caveats, MLE is the modern default for parametric estimation in research statistics. It is the estimator the software gives you when you fit a linear regression (least squares is the MLE under iid Normal errors), a logistic regression (iteratively reweighted least squares is Fisher scoring on the binomial likelihood), or any GLM. The three properties — consistency, asymptotic normality, asymptotic efficiency — combine to make MLE a one-size-fits-most recipe with a coherent theory of inference attached. §1.4 makes the efficiency claim precise; §3.3 gives you likelihood-ratio confidence intervals (often better-calibrated than Wald CIs based on $I(\hat{\theta})^{-1}$ ); §4.7 gives you AIC and BIC (model-selection criteria built directly on the log-likelihood). MLE is not just one estimator — it is a framework.

Try it

In the likelihood-landscape widget, pick Bernoulli(p) with $n = 30$ . Drag the slider for $p$ and watch $\ell(p)$ . Where is the peak? Read it off the green MLE marker, then verify by computing $\bar{X}$ from the empirical numbers reported. Now click Freeze new sample a few times — does the peak move around the true $p = 0.4$ ? That movement is the sampling distribution of $\hat{p}_{\text{MLE}}$ .
Same widget, switch to Exponential(λ) with $n = 30$ and the true λ = 1. Slide θ from 0.1 up to 5 and watch ℓ(θ). The curve is unimodal and smooth. Now drop $n$ to 5. How much wider is the peak? (You should see the slider has to move much further before ℓ drops noticeably — small $n$ means little curvature, which §1.4 will identify as low Fisher information.)
Switch to Uniform(0, θ). Slide θ from 0.05 up to 3 and look at the shape of ℓ(θ). What happens on the LEFT side of the MLE θ̂ = max(Xᵢ)? (Sharp cliff to −∞.) Try to maximise ℓ(θ) using only the slider — note that you cannot get above the green line because anywhere right of max(Xᵢ) is sub-optimal and anywhere left is impossible.
In the MLE-vs-MoM-multi widget, set α = 2, β = 1, n = 30. Re-run replicates a couple of times. What is the empirical variance ratio Var(MoM) / Var(MLE)? Now bump α up to 6 (closer to Normal). Does the ratio shrink? (Yes — for high-α Gamma the distribution becomes near-Normal and MoM is nearly as good as MLE; for low α the MLE wins by more.)
Same widget, set α = 0.7 (very skewed Gamma), β = 1, n = 30. Look at the empirical Var(MLE) and the asymptotic-CRLB value $\alpha / [n (\alpha \psi'(\alpha) - 1)]$ . Are they close? Now bump $n$ to 100. Should be closer. This is asymptotic efficiency landing on a finite-sample estimate.
Pen-and-paper: derive the Bernoulli MLE. Show that the log-likelihood $\ell(p) = k \log p + (n - k) \log(1 - p)$ has its maximum at $\hat{p} = k/n$ by setting the score to zero. Then apply invariance: write down the MLE of the log-odds $\log(p / (1 - p))$ without doing any new optimisation.

Pause and reflect: the likelihood $L(\theta; X)$ is the joint density of the data viewed as a function of the parameter. It is not a probability distribution over $\theta$ — frequentist statistics treats $\theta$ as fixed but unknown. Bayesian statistics (§7) treats $\theta$ as random and multiplies the likelihood by a prior $\pi(\theta)$ to get a posterior $\pi(\theta \mid X) \propto L(\theta; X) \pi(\theta)$ . What does setting $\pi(\theta) = \text{const}$ (the "uniform prior") do to the relationship between the posterior mode and the MLE? When does that uniform-prior choice make sense, and when does it depend on the parameterisation in a way the MLE itself does not?

What you now know

Maximum likelihood is your second great recipe for constructing an estimator: write down the joint density of the data as a function of $\theta$ , take logs, find the $\theta$ that maximises the resulting log-likelihood. You have the closed-form MLEs for Bernoulli ( $\bar{X}$ ), Exponential ( $1/\bar{X}$ ), Normal ( $\bar{X}$ and the biased $(1/n)\sum(X_i - \bar{X})^2$ ), and Uniform(0, θ) ( $\max_i X_i$ , the boundary case where the derivative recipe breaks). You have the three big properties under regularity — consistency, asymptotic normality with inverse-Fisher-information variance, and invariance. And you have the three honest caveats — finite-sample bias, sensitivity to misspecification, optimisation pathology — and a sense of when MLE is the right default and when MoM, robust alternatives, or Bayesian methods do better.

§1.4 turns "asymptotically efficient" into a calculation by introducing the Cramér–Rao lower bound and Fisher information. You will be able to compute the lower bound on variance among unbiased estimators and verify that MLE asymptotically achieves it. §1.5 revisits bias-variance in full and shows when even MLE can be beaten by shrinkage — the James–Stein result you previewed in §1.1. After that Part 1 turns from constructing estimators to characterising them: §1.6 makes "standard error" precise, §1.7 introduces the bootstrap (a way to estimate the sampling distribution without leaning on asymptotic theory), §1.8 hardens the methods against outliers, and §1.9 makes the large-sample machinery rigorous.

References

Fisher, R.A. (1922). "On the mathematical foundations of theoretical statistics." Philosophical Transactions of the Royal Society of London A 222, 309–368. (The foundational paper. Fisher introduces the likelihood, the MLE, and the concepts of consistency, efficiency, and sufficiency in one extraordinary essay.)
Fisher, R.A. (1925). "Theory of statistical estimation." Proceedings of the Cambridge Philosophical Society 22(5), 700–725. (The follow-up that pins down asymptotic efficiency and introduces Fisher information explicitly.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 9, "Parametric Inference"; the modern textbook presentation of MLE and its asymptotics.)
Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 7, "Point Estimation"; a thorough but careful treatment with worked examples.)
Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. (A book-length argument for putting the likelihood at the centre of applied statistics; especially useful for the philosophical questions about what likelihood is and is not.)