Maximum likelihood

Part 1 — Estimation

Learning objectives

  • State the maximum likelihood principle: θ̂_MLE = argmax_θ L(θ; X) = argmax_θ Σ log f(Xᵢ; θ)
  • Distinguish the likelihood as a function of θ (data fixed) from the PDF as a function of x (parameter fixed)
  • Derive closed-form MLEs for Bernoulli, Exponential, Normal, and Uniform(0, θ), and recognise when boundary behaviour breaks the differentiation recipe
  • State the asymptotic properties: consistency, asymptotic normality with variance I(θ)⁻¹, and the invariance property for transformations
  • Compare MLE to MoM in efficiency, computability, and finite-sample bias, and place MLE as the modern default with honest caveats (misspecification, optimisation failure)

§1.2 gave you the first recipe for constructing an estimator: equate sample moments to population moments, solve. It works, it gives you closed forms, and the Uniform(0, θ) example showed where it can fail badly. §1.3 introduces the second recipe, due to Ronald Fisher in his 1922 paper that founded modern statistics: maximum likelihood. The idea is one of the cleanest in all of statistical theory — write down the probability that the data you actually observed would arise as a function of the unknown parameters, and pick the parameters that maximise that probability. You will use MLE more often than any other estimator in your research career; in default settings of most statistical software, "fit this model" means "find the MLE."

This section gives you the definition, the algebra of the log-likelihood that makes the recipe practical, the closed-form MLEs for the four workhorse single-parameter distributions, the boundary counterexample where the textbook derivation fails, the three theoretical properties that make MLE special (consistency, asymptotic normality, invariance), and the honest list of caveats — biased in finite samples, sensitive to misspecification, occasionally hard to compute. §1.4 will sharpen the asymptotic-normality statement by giving you the Cramér–Rao bound and Fisher information; §1.5 will revisit bias-variance and show when even MLE can be improved upon by shrinkage.

The likelihood, and the flip

You have iid data X1,X2,,XnX_1, X_2, \ldots, X_n drawn from a parametric family with density f(x;θ)f(x; \theta). The joint density of the observed sample is the product

f(X1,X2,,Xn;θ)=i=1nf(Xi;θ).f(X_1, X_2, \ldots, X_n; \theta) = \prod_{i=1}^{n} f(X_i; \theta).

As a function of the data XX with θ\theta held fixed, this is the joint probability density — the standard thing from Part 0. The likelihood function is the same algebraic expression viewed the other way around: data held fixed, parameter θ\theta varying.

L(θ;X1,,Xn)=i=1nf(Xi;θ).L(\theta; X_1, \ldots, X_n) = \prod_{i=1}^{n} f(X_i; \theta).

The notational distinction matters. The PDF f(x;θ)f(x; \theta) is a function of xx for fixed θ\theta — it integrates to 1. The likelihood L(θ;X)L(\theta; X) is a function of θ\theta for fixed data — it does not integrate to 1 (in θ\theta), and it is not a probability density over θ\theta at all. It is just a relative scoring of how "compatible" each candidate θ\theta is with the observed data. The flip — same algebra, different argument — is the conceptual leap you need to internalise. Read the likelihood out loud as "the probability that THIS data would arise IF the parameter were θ" and you have the right mental model.

The log-likelihood and why you always use it

Three things go wrong with L(θ;X)L(\theta; X) as written. First, it is a product of nn numbers each between 0 and 1 (or, for continuous densities, each on whatever scale ff produces) — for moderate nn it underflows to floating-point zero. Second, derivatives of products are messy. Third, intuition about products is bad and intuition about sums is good. All three problems are solved by taking the logarithm. The log-likelihood

(θ)=logL(θ;X)=i=1nlogf(Xi;θ)\ell(\theta) = \log L(\theta; X) = \sum_{i=1}^{n} \log f(X_i; \theta)

is a sum of nn terms (numerically stable), is differentiable term by term, and shares the same maximiser as LL because the logarithm is a monotone increasing function. Always work with the log-likelihood. The MLE is defined indifferently in either form:

θ^MLE=argmaxθ  L(θ;X)=argmaxθ  (θ).\hat{\theta}_{\text{MLE}} = \underset{\theta}{\arg\max} \; L(\theta; X) = \underset{\theta}{\arg\max} \; \ell(\theta).

The score function is the derivative of the log-likelihood, /θ\partial \ell / \partial \theta. Setting it to zero gives a candidate maximum:

θθ=θ^=0.\frac{\partial \ell}{\partial \theta}\bigg|_{\theta = \hat{\theta}} = 0.

(Check second-order conditions or compare interior critical points to boundary values to confirm you have a maximum, not a minimum or saddle.) For nicely-behaved single-parameter problems this gives the MLE in one line of algebra. For everything else — multi-parameter problems, transcendental score equations, boundary-supported distributions — you solve it numerically.

MLE as peak-finding on a curve

The widget below makes the geometry of MLE physical. Pick a distribution (Bernoulli, Exponential, Normal with σ fixed at 1, or Uniform(0, θ)) and a sample size nn. The widget draws nn iid samples from the true distribution, then plots (θ)\ell(\theta) as a function of θ\theta on a slider. The slider lets you scrub θ\theta by hand and watch (θ)\ell(\theta) go up and down. The MLE θ̂ (green vertical line) is the argmax of the curve. The true θ\theta (orange dashed) is the value that generated the data. Click Freeze new sample to draw new data and see the landscape shift.

Likelihood LandscapeInteractive figure — enable JavaScript to interact.

Four things to look for. First, the curve always has its maximum at the closed-form MLE — drag the slider toward it to make the value (θslider)\ell(\theta_{\text{slider}}) approach (θ^)\ell(\hat{\theta}). Second, the peak moves around with the data: hit Freeze new sample and watch θ̂ wander while the true θ\theta stays put. That wandering is the sampling distribution of θ̂. Third, the peak gets sharper as nn grows from 5 to 300 — the curvature at the peak is what §1.4 will call Fisher information, and asymptotic variance scales like one over it. Fourth, the Uniform(0, θ) curve is qualitatively different: a vertical cliff at θ=max(Xi)\theta = \max(X_i) on the left (likelihood is zero, so log-likelihood is -\infty) and a smooth decay to the right. The MLE is the cliff edge. You cannot find this peak by setting a derivative to zero — the recipe fails for the Uniform because the support of the distribution depends on the parameter. We will come back to this.

Four closed-form MLEs

Bernoulli Bern(p)\text{Bern}(p). Each Xi{0,1}X_i \in {0, 1} with P(Xi=1)=pP(X_i = 1) = p. Let k=Xik = \sum X_i be the count of 1s. Then

(p)=klogp+(nk)log(1p),p=kpnk1p.\ell(p) = k \log p + (n - k) \log(1 - p), \qquad \frac{\partial \ell}{\partial p} = \frac{k}{p} - \frac{n - k}{1 - p}.

Setting the score to zero and solving:

p^MLE=kn=Xˉ.\hat{p}_{\text{MLE}} = \frac{k}{n} = \bar{X}.

The sample proportion. The same answer MoM gives. (Whenever a single-parameter distribution has its parameter equal to its mean, MoM and MLE coincide.)

Exponential Exp(λ)\text{Exp}(\lambda). Density f(x;λ)=λeλxf(x; \lambda) = \lambda e^{-\lambda x}.

(λ)=nlogλλXi,λ=nλXi.\ell(\lambda) = n \log \lambda - \lambda \sum X_i, \qquad \frac{\partial \ell}{\partial \lambda} = \frac{n}{\lambda} - \sum X_i.

Setting to zero: λ^MLE=n/Xi=1/Xˉ\hat{\lambda}_{\text{MLE}} = n / \sum X_i = 1 / \bar{X}. Again the same as MoM. The peak of \ell at 1/Xˉ1/\bar{X} is what you scrub through with the slider in the widget.

Normal N(μ,σ2)N(\mu, \sigma^2) (both parameters). The log-likelihood is

(μ,σ2)=n2log(2π)n2logσ212σ2(Xiμ)2.\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum (X_i - \mu)^2.

Taking partial derivatives:

μ=1σ2(Xiμ),σ2=n2σ2+12(σ2)2(Xiμ)2.\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum(X_i - \mu), \qquad \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum (X_i - \mu)^2.

Setting both to zero:

μ^MLE=Xˉ,σ^MLE2=1n(XiXˉ)2.\hat{\mu}_{\text{MLE}} = \bar{X}, \qquad \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum (X_i - \bar{X})^2.

Look at the variance MLE: it is the biased form with 1/n1/n, identical to the MoM estimator from §1.2. The unbiased form 1/(n1)(XiXˉ)21/(n-1)\sum (X_i - \bar{X})^2 — the one statistics textbooks call "the sample variance" — is not the MLE. The MLE is biased low, by a factor of (n1)/n(n-1)/n. This is a recurring lesson: MLE in finite samples can be biased, sometimes substantially so. Asymptotically the bias vanishes (the factor goes to 1 as nn \to \infty), which is consistent with the consistency property we will prove below.

Uniform U(0,θ)U(0, \theta). This is the case where the differentiation recipe breaks. The density is

f(x;θ)=1θ1{0xθ}.f(x; \theta) = \frac{1}{\theta} \, \mathbb{1}\{0 \leq x \leq \theta\}.

The likelihood is L(θ)=θnL(\theta) = \theta^{-n} if θmaxiXi\theta \geq \max_i X_i, and L(θ)=0L(\theta) = 0 if θ<maxiXi\theta < \max_i X_i (because at least one XiX_i would fall outside the support, giving zero density there). Equivalently (θ)=nlogθ\ell(\theta) = -n \log \theta on [maxiXi,)[\max_i X_i, \infty) and -\infty otherwise. The function is monotonically decreasing on its support, so the maximum is at the LEFT boundary:

θ^MLE=maxiXi.\hat{\theta}_{\text{MLE}} = \max_i X_i.

You cannot get this by setting a derivative to zero. The derivative n/θ-n / \theta is never zero. The maximum is at a non-differentiable boundary point — at the smallest θ\theta that the data does not rule out. This is the boundary case that §1.4 will identify as a violation of the "regularity conditions" needed for the CRLB and asymptotic normality results.

Three big properties of the MLE

Under regularity conditions — the parameter space is open, the density is differentiable, the support does not depend on θ\theta, expectations of certain derivatives exist and can be interchanged with integrals — the MLE has three remarkable large-sample properties. They are the reason MLE is the default in modern statistics.

Consistency. As the sample size nn grows, the MLE converges in probability to the true parameter:

θ^MLE,nPθ0asn.\hat{\theta}_{\text{MLE}, n} \xrightarrow{P} \theta_0 \quad \text{as} \quad n \to \infty.

Intuitively: the log-likelihood divided by nn converges (by the LLN) to the expected log-likelihood Eθ0[logf(X;θ)]E_{\theta_0}[\log f(X; \theta)], and that expected log-likelihood is maximised at θ=θ0\theta = \theta_0 (by Jensen / non-negativity of KL divergence — see Cox & Hinkley 1974 for the rigorous argument). So the empirical maximum converges to the population maximum.

Asymptotic normality. More precisely, the MLE has a normal limiting distribution with variance equal to the inverse Fisher information:

n(θ^MLEθ0)dN(0,I(θ0)1).\sqrt{n}\,(\hat{\theta}_{\text{MLE}} - \theta_0) \xrightarrow{d} N(0, \, I(\theta_0)^{-1}).

where I(θ)=E[21/θ2]I(\theta) = -E[\partial^2 \ell_1 / \partial \theta^2] is the Fisher information per observation (§1.4 will define it carefully). This is one of the great results in statistical theory. For a single observation the MLE has whatever variance it has; asymptotically, it concentrates around the truth at the canonical 1/n1/\sqrt{n} rate, with a covariance matrix you can write down from a single integral. It is also asymptotically efficient — its asymptotic variance achieves the Cramér–Rao lower bound, the smallest variance possible among (regular) unbiased estimators. §1.4 is the section that makes this efficiency claim precise.

Invariance. If θ^\hat{\theta} is the MLE of θ\theta and gg is any function, then g(θ^)g(\hat{\theta}) is the MLE of g(θ)g(\theta). In symbols:

g(θ)^MLE=g(θ^MLE).\widehat{g(\theta)}_{\text{MLE}} = g(\hat{\theta}_{\text{MLE}}).

This is the easy-to-prove and constantly-used property. If you have the MLE of a Bernoulli pp, then the MLE of the log-odds log(p/(1p))\log(p / (1-p)) is just log(p^/(1p^))\log(\hat{p} / (1 - \hat{p})) — no separate optimisation required. If you have the MLE of the variance σ2\sigma^2, the MLE of the standard deviation is σ^2\sqrt{\hat{\sigma}^2}. Invariance is what makes MLE a clean estimator across reparameterisations; it does not depend on how you choose to write down the parameter. (MoM does not share this — its result depends on which moments you choose to match.)

The Uniform(0, θ) counterexample to asymptotic normality

The asymptotic-normality statement above is true under regularity. The Uniform(0, θ) MLE is a case where regularity fails — the support depends on θ\theta, so the density is not differentiable in θ\theta at the boundary θ=Xi\theta = X_i for each observation. The MLE θ^=maxiXi\hat{\theta} = \max_i X_i still exists and is still consistent (the maximum of nn iid Uniform(0, θ\theta) variables converges to θ\theta as nn \to \infty), but its asymptotic distribution is not normal, and its rate of convergence is 1/n1/n rather than 1/n1/\sqrt{n}:

n(θθ^MLE)dExp(1/θ)asn.n(\theta - \hat{\theta}_{\text{MLE}}) \xrightarrow{d} \text{Exp}(1/\theta) \quad \text{as} \quad n \to \infty.

This is a faster rate than the canonical 1/n1/\sqrt{n}, which makes the Uniform MLE "super-efficient" — it concentrates around the truth more rapidly than any regular estimator can. The price is that the standard asymptotic-normality machinery does not apply: you cannot read off variance from inverse Fisher information; the limiting distribution is exponential, not normal; and the CRLB (which assumes regularity) gives a lower bound the MLE blows past. The Uniform example is the textbook reminder that "MLE is asymptotically normal" requires regularity, and that boundary cases are genuinely different.

MLE vs MoM: efficiency, made visible

Section §1.2 ended with the Uniform widget where MLE absolutely crushed MoM. That was a special case. The general claim is milder: for "regular" problems (smooth densities, parameter not at the support boundary) MLE has lower variance than MoM, sometimes by a meaningful factor. The widget below quantifies that for the Gamma distribution, where MoM has a closed-form and MLE needs Newton iteration.

For XiGamma(α,β)X_i \sim \text{Gamma}(\alpha, \beta) (scale parameterisation, mean αβ\alpha\beta, variance αβ2\alpha\beta^2): the MoM estimator of α\alpha is α^MoM=Xˉ2/s2\hat{\alpha}_{\text{MoM}} = \bar{X}^2 / s^2 (closed-form, one line). The MLE is the α\alpha that satisfies the score equation logαψ(α)=logXˉlogX\log \alpha - \psi(\alpha) = \log \bar{X} - \overline{\log X}, where ψ\psi is the digamma function. There is no closed form; Newton-Raphson on this equation converges in 3–5 iterations. The widget runs 1500 replicates of size nn, computes both estimators on each, and plots the two empirical sampling distributions of α^\hat{\alpha} side by side.

Mle Vs Mom MultiInteractive figure — enable JavaScript to interact.

Two things to look for. First, the MLE histogram (green) is narrower than the MoM histogram (red). The variance ratio is reported in the status panel — typically 1.5× to 4× depending on α\alpha. Second, the asymptotic-CRLB approximation α/[n(αψ(α)1)]\alpha / [n (\alpha \psi'(\alpha) - 1)] (§1.4 will derive this for you) is shown alongside the empirical variance of the MLE — they should agree closely once nn is moderate (say 30+). That equality is the cash value of "MLE is asymptotically efficient": it tells you in advance what variance to expect for the MLE without simulating, just from the model.

Where MLE lives in practice, and where it bites

Three honest caveats follow the three big properties:

  • Finite-sample bias. MLE is not generally unbiased. The Normal-variance MLE (1/n)(XiXˉ)2(1/n)\sum (X_i - \bar{X})^2 is biased low by a factor (n1)/n(n-1)/n; this is why textbooks define "the sample variance" with 1/(n1)1/(n-1). The Uniform MLE maxiXi\max_i X_i is biased low by a factor n/(n+1)n/(n+1). Many GLM and mixed-model MLEs have bias that does not vanish even asymptotically (for the variance components in random-effects models, REML is preferred for exactly this reason). Asymptotically these biases shrink, but small samples can be problematic.
  • Model misspecification. If the model is wrong, the MLE still converges — but to the "least false" parameter θ\theta^* that minimises the Kullback–Leibler divergence between the assumed model and the true distribution, not to anything you actually wanted. Worse, the asymptotic-normality variance formula breaks: the "sandwich" estimator A1BA1A^{-1} B A^{-1} (with A=E[2]A = -E[\partial^2 \ell] and B=E[()2]B = E[(\partial \ell)^2]) replaces I(θ)1I(\theta)^{-1}. Under correct specification A=BA = B and they cancel; under misspecification they do not. Robust standard errors (Huber, Eicker, White) implement the sandwich correction — they are crucial in practice and are revisited in §1.8 and §4.5.
  • Optimisation pathology. Closed forms only exist for a few distributions. Most modern MLE problems require numerical optimisation (Newton-Raphson, BFGS, Fisher scoring, EM for missing-data and mixture problems). The log-likelihood can be non-concave (mixtures, neural networks) with multiple local optima, ridges where the curvature collapses, or boundaries where the optimisation stalls. Identifiability failures (the model is invariant to relabelling components in a mixture, for example) produce the famous "label-switching" pathology. Robust starting values — often from MoM — and multi-start optimisation help. There is no silver bullet.

Despite the caveats, MLE is the modern default for parametric estimation in research statistics. It is the estimator the software gives you when you fit a linear regression (least squares is the MLE under iid Normal errors), a logistic regression (iteratively reweighted least squares is Fisher scoring on the binomial likelihood), or any GLM. The three properties — consistency, asymptotic normality, asymptotic efficiency — combine to make MLE a one-size-fits-most recipe with a coherent theory of inference attached. §1.4 makes the efficiency claim precise; §3.3 gives you likelihood-ratio confidence intervals (often better-calibrated than Wald CIs based on I(θ^)1I(\hat{\theta})^{-1}); §4.7 gives you AIC and BIC (model-selection criteria built directly on the log-likelihood). MLE is not just one estimator — it is a framework.

Try it

  • In the likelihood-landscape widget, pick Bernoulli(p) with n=30n = 30. Drag the slider for pp and watch (p)\ell(p). Where is the peak? Read it off the green MLE marker, then verify by computing Xˉ\bar{X} from the empirical numbers reported. Now click Freeze new sample a few times — does the peak move around the true p=0.4p = 0.4? That movement is the sampling distribution of p^MLE\hat{p}_{\text{MLE}}.
  • Same widget, switch to Exponential(λ) with n=30n = 30 and the true λ = 1. Slide θ from 0.1 up to 5 and watch ℓ(θ). The curve is unimodal and smooth. Now drop nn to 5. How much wider is the peak? (You should see the slider has to move much further before ℓ drops noticeably — small nn means little curvature, which §1.4 will identify as low Fisher information.)
  • Switch to Uniform(0, θ). Slide θ from 0.05 up to 3 and look at the shape of ℓ(θ). What happens on the LEFT side of the MLE θ̂ = max(Xᵢ)? (Sharp cliff to −∞.) Try to maximise ℓ(θ) using only the slider — note that you cannot get above the green line because anywhere right of max(Xᵢ) is sub-optimal and anywhere left is impossible.
  • In the MLE-vs-MoM-multi widget, set α = 2, β = 1, n = 30. Re-run replicates a couple of times. What is the empirical variance ratio Var(MoM) / Var(MLE)? Now bump α up to 6 (closer to Normal). Does the ratio shrink? (Yes — for high-α Gamma the distribution becomes near-Normal and MoM is nearly as good as MLE; for low α the MLE wins by more.)
  • Same widget, set α = 0.7 (very skewed Gamma), β = 1, n = 30. Look at the empirical Var(MLE) and the asymptotic-CRLB value α/[n(αψ(α)1)]\alpha / [n (\alpha \psi'(\alpha) - 1)]. Are they close? Now bump nn to 100. Should be closer. This is asymptotic efficiency landing on a finite-sample estimate.
  • Pen-and-paper: derive the Bernoulli MLE. Show that the log-likelihood (p)=klogp+(nk)log(1p)\ell(p) = k \log p + (n - k) \log(1 - p) has its maximum at p^=k/n\hat{p} = k/n by setting the score to zero. Then apply invariance: write down the MLE of the log-odds log(p/(1p))\log(p / (1 - p)) without doing any new optimisation.

Pause and reflect: the likelihood L(θ;X)L(\theta; X) is the joint density of the data viewed as a function of the parameter. It is not a probability distribution over θ\theta — frequentist statistics treats θ\theta as fixed but unknown. Bayesian statistics (§7) treats θ\theta as random and multiplies the likelihood by a prior π(θ)\pi(\theta) to get a posterior π(θX)L(θ;X)π(θ)\pi(\theta \mid X) \propto L(\theta; X) \pi(\theta). What does setting π(θ)=const\pi(\theta) = \text{const} (the "uniform prior") do to the relationship between the posterior mode and the MLE? When does that uniform-prior choice make sense, and when does it depend on the parameterisation in a way the MLE itself does not?

What you now know

Maximum likelihood is your second great recipe for constructing an estimator: write down the joint density of the data as a function of θ\theta, take logs, find the θ\theta that maximises the resulting log-likelihood. You have the closed-form MLEs for Bernoulli (Xˉ\bar{X}), Exponential (1/Xˉ1/\bar{X}), Normal (Xˉ\bar{X} and the biased (1/n)(XiXˉ)2(1/n)\sum(X_i - \bar{X})^2), and Uniform(0, θ) (maxiXi\max_i X_i, the boundary case where the derivative recipe breaks). You have the three big properties under regularity — consistency, asymptotic normality with inverse-Fisher-information variance, and invariance. And you have the three honest caveats — finite-sample bias, sensitivity to misspecification, optimisation pathology — and a sense of when MLE is the right default and when MoM, robust alternatives, or Bayesian methods do better.

§1.4 turns "asymptotically efficient" into a calculation by introducing the Cramér–Rao lower bound and Fisher information. You will be able to compute the lower bound on variance among unbiased estimators and verify that MLE asymptotically achieves it. §1.5 revisits bias-variance in full and shows when even MLE can be beaten by shrinkage — the James–Stein result you previewed in §1.1. After that Part 1 turns from constructing estimators to characterising them: §1.6 makes "standard error" precise, §1.7 introduces the bootstrap (a way to estimate the sampling distribution without leaning on asymptotic theory), §1.8 hardens the methods against outliers, and §1.9 makes the large-sample machinery rigorous.

References

  • Fisher, R.A. (1922). "On the mathematical foundations of theoretical statistics." Philosophical Transactions of the Royal Society of London A 222, 309–368. (The foundational paper. Fisher introduces the likelihood, the MLE, and the concepts of consistency, efficiency, and sufficiency in one extraordinary essay.)
  • Fisher, R.A. (1925). "Theory of statistical estimation." Proceedings of the Cambridge Philosophical Society 22(5), 700–725. (The follow-up that pins down asymptotic efficiency and introduces Fisher information explicitly.)
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 9, "Parametric Inference"; the modern textbook presentation of MLE and its asymptotics.)
  • Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 7, "Point Estimation"; a thorough but careful treatment with worked examples.)
  • Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. (A book-length argument for putting the likelihood at the centre of applied statistics; especially useful for the philosophical questions about what likelihood is and is not.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.