Maximum likelihood
Learning objectives
- State the maximum likelihood principle: θ̂_MLE = argmax_θ L(θ; X) = argmax_θ Σ log f(Xᵢ; θ)
- Distinguish the likelihood as a function of θ (data fixed) from the PDF as a function of x (parameter fixed)
- Derive closed-form MLEs for Bernoulli, Exponential, Normal, and Uniform(0, θ), and recognise when boundary behaviour breaks the differentiation recipe
- State the asymptotic properties: consistency, asymptotic normality with variance I(θ)⁻¹, and the invariance property for transformations
- Compare MLE to MoM in efficiency, computability, and finite-sample bias, and place MLE as the modern default with honest caveats (misspecification, optimisation failure)
§1.2 gave you the first recipe for constructing an estimator: equate sample moments to population moments, solve. It works, it gives you closed forms, and the Uniform(0, θ) example showed where it can fail badly. §1.3 introduces the second recipe, due to Ronald Fisher in his 1922 paper that founded modern statistics: maximum likelihood. The idea is one of the cleanest in all of statistical theory — write down the probability that the data you actually observed would arise as a function of the unknown parameters, and pick the parameters that maximise that probability. You will use MLE more often than any other estimator in your research career; in default settings of most statistical software, "fit this model" means "find the MLE."
This section gives you the definition, the algebra of the log-likelihood that makes the recipe practical, the closed-form MLEs for the four workhorse single-parameter distributions, the boundary counterexample where the textbook derivation fails, the three theoretical properties that make MLE special (consistency, asymptotic normality, invariance), and the honest list of caveats — biased in finite samples, sensitive to misspecification, occasionally hard to compute. §1.4 will sharpen the asymptotic-normality statement by giving you the Cramér–Rao bound and Fisher information; §1.5 will revisit bias-variance and show when even MLE can be improved upon by shrinkage.
The likelihood, and the flip
You have iid data drawn from a parametric family with density . The joint density of the observed sample is the product
As a function of the data with held fixed, this is the joint probability density — the standard thing from Part 0. The likelihood function is the same algebraic expression viewed the other way around: data held fixed, parameter varying.
The notational distinction matters. The PDF is a function of for fixed — it integrates to 1. The likelihood is a function of for fixed data — it does not integrate to 1 (in ), and it is not a probability density over at all. It is just a relative scoring of how "compatible" each candidate is with the observed data. The flip — same algebra, different argument — is the conceptual leap you need to internalise. Read the likelihood out loud as "the probability that THIS data would arise IF the parameter were θ" and you have the right mental model.
The log-likelihood and why you always use it
Three things go wrong with as written. First, it is a product of numbers each between 0 and 1 (or, for continuous densities, each on whatever scale produces) — for moderate it underflows to floating-point zero. Second, derivatives of products are messy. Third, intuition about products is bad and intuition about sums is good. All three problems are solved by taking the logarithm. The log-likelihood
is a sum of terms (numerically stable), is differentiable term by term, and shares the same maximiser as because the logarithm is a monotone increasing function. Always work with the log-likelihood. The MLE is defined indifferently in either form:
The score function is the derivative of the log-likelihood, . Setting it to zero gives a candidate maximum:
(Check second-order conditions or compare interior critical points to boundary values to confirm you have a maximum, not a minimum or saddle.) For nicely-behaved single-parameter problems this gives the MLE in one line of algebra. For everything else — multi-parameter problems, transcendental score equations, boundary-supported distributions — you solve it numerically.
MLE as peak-finding on a curve
The widget below makes the geometry of MLE physical. Pick a distribution (Bernoulli, Exponential, Normal with σ fixed at 1, or Uniform(0, θ)) and a sample size . The widget draws iid samples from the true distribution, then plots as a function of on a slider. The slider lets you scrub by hand and watch go up and down. The MLE θ̂ (green vertical line) is the argmax of the curve. The true (orange dashed) is the value that generated the data. Click Freeze new sample to draw new data and see the landscape shift.
Four things to look for. First, the curve always has its maximum at the closed-form MLE — drag the slider toward it to make the value approach . Second, the peak moves around with the data: hit Freeze new sample and watch θ̂ wander while the true stays put. That wandering is the sampling distribution of θ̂. Third, the peak gets sharper as grows from 5 to 300 — the curvature at the peak is what §1.4 will call Fisher information, and asymptotic variance scales like one over it. Fourth, the Uniform(0, θ) curve is qualitatively different: a vertical cliff at on the left (likelihood is zero, so log-likelihood is ) and a smooth decay to the right. The MLE is the cliff edge. You cannot find this peak by setting a derivative to zero — the recipe fails for the Uniform because the support of the distribution depends on the parameter. We will come back to this.
Four closed-form MLEs
Bernoulli . Each with . Let be the count of 1s. Then
Setting the score to zero and solving:
The sample proportion. The same answer MoM gives. (Whenever a single-parameter distribution has its parameter equal to its mean, MoM and MLE coincide.)
Exponential . Density .
Setting to zero: . Again the same as MoM. The peak of at is what you scrub through with the slider in the widget.
Normal (both parameters). The log-likelihood is
Taking partial derivatives:
Setting both to zero:
Look at the variance MLE: it is the biased form with , identical to the MoM estimator from §1.2. The unbiased form — the one statistics textbooks call "the sample variance" — is not the MLE. The MLE is biased low, by a factor of . This is a recurring lesson: MLE in finite samples can be biased, sometimes substantially so. Asymptotically the bias vanishes (the factor goes to 1 as ), which is consistent with the consistency property we will prove below.
Uniform . This is the case where the differentiation recipe breaks. The density is
The likelihood is if , and if (because at least one would fall outside the support, giving zero density there). Equivalently on and otherwise. The function is monotonically decreasing on its support, so the maximum is at the LEFT boundary:
You cannot get this by setting a derivative to zero. The derivative is never zero. The maximum is at a non-differentiable boundary point — at the smallest that the data does not rule out. This is the boundary case that §1.4 will identify as a violation of the "regularity conditions" needed for the CRLB and asymptotic normality results.
Three big properties of the MLE
Under regularity conditions — the parameter space is open, the density is differentiable, the support does not depend on , expectations of certain derivatives exist and can be interchanged with integrals — the MLE has three remarkable large-sample properties. They are the reason MLE is the default in modern statistics.
Consistency. As the sample size grows, the MLE converges in probability to the true parameter:
Intuitively: the log-likelihood divided by converges (by the LLN) to the expected log-likelihood , and that expected log-likelihood is maximised at (by Jensen / non-negativity of KL divergence — see Cox & Hinkley 1974 for the rigorous argument). So the empirical maximum converges to the population maximum.
Asymptotic normality. More precisely, the MLE has a normal limiting distribution with variance equal to the inverse Fisher information:
where is the Fisher information per observation (§1.4 will define it carefully). This is one of the great results in statistical theory. For a single observation the MLE has whatever variance it has; asymptotically, it concentrates around the truth at the canonical rate, with a covariance matrix you can write down from a single integral. It is also asymptotically efficient — its asymptotic variance achieves the Cramér–Rao lower bound, the smallest variance possible among (regular) unbiased estimators. §1.4 is the section that makes this efficiency claim precise.
Invariance. If is the MLE of and is any function, then is the MLE of . In symbols:
This is the easy-to-prove and constantly-used property. If you have the MLE of a Bernoulli , then the MLE of the log-odds is just — no separate optimisation required. If you have the MLE of the variance , the MLE of the standard deviation is . Invariance is what makes MLE a clean estimator across reparameterisations; it does not depend on how you choose to write down the parameter. (MoM does not share this — its result depends on which moments you choose to match.)
The Uniform(0, θ) counterexample to asymptotic normality
The asymptotic-normality statement above is true under regularity. The Uniform(0, θ) MLE is a case where regularity fails — the support depends on , so the density is not differentiable in at the boundary for each observation. The MLE still exists and is still consistent (the maximum of iid Uniform(0, ) variables converges to as ), but its asymptotic distribution is not normal, and its rate of convergence is rather than :
This is a faster rate than the canonical , which makes the Uniform MLE "super-efficient" — it concentrates around the truth more rapidly than any regular estimator can. The price is that the standard asymptotic-normality machinery does not apply: you cannot read off variance from inverse Fisher information; the limiting distribution is exponential, not normal; and the CRLB (which assumes regularity) gives a lower bound the MLE blows past. The Uniform example is the textbook reminder that "MLE is asymptotically normal" requires regularity, and that boundary cases are genuinely different.
MLE vs MoM: efficiency, made visible
Section §1.2 ended with the Uniform widget where MLE absolutely crushed MoM. That was a special case. The general claim is milder: for "regular" problems (smooth densities, parameter not at the support boundary) MLE has lower variance than MoM, sometimes by a meaningful factor. The widget below quantifies that for the Gamma distribution, where MoM has a closed-form and MLE needs Newton iteration.
For (scale parameterisation, mean , variance ): the MoM estimator of is (closed-form, one line). The MLE is the that satisfies the score equation , where is the digamma function. There is no closed form; Newton-Raphson on this equation converges in 3–5 iterations. The widget runs 1500 replicates of size , computes both estimators on each, and plots the two empirical sampling distributions of side by side.
Two things to look for. First, the MLE histogram (green) is narrower than the MoM histogram (red). The variance ratio is reported in the status panel — typically 1.5× to 4× depending on . Second, the asymptotic-CRLB approximation (§1.4 will derive this for you) is shown alongside the empirical variance of the MLE — they should agree closely once is moderate (say 30+). That equality is the cash value of "MLE is asymptotically efficient": it tells you in advance what variance to expect for the MLE without simulating, just from the model.
Where MLE lives in practice, and where it bites
Three honest caveats follow the three big properties:
- Finite-sample bias. MLE is not generally unbiased. The Normal-variance MLE is biased low by a factor ; this is why textbooks define "the sample variance" with . The Uniform MLE is biased low by a factor . Many GLM and mixed-model MLEs have bias that does not vanish even asymptotically (for the variance components in random-effects models, REML is preferred for exactly this reason). Asymptotically these biases shrink, but small samples can be problematic.
- Model misspecification. If the model is wrong, the MLE still converges — but to the "least false" parameter that minimises the Kullback–Leibler divergence between the assumed model and the true distribution, not to anything you actually wanted. Worse, the asymptotic-normality variance formula breaks: the "sandwich" estimator (with and ) replaces . Under correct specification and they cancel; under misspecification they do not. Robust standard errors (Huber, Eicker, White) implement the sandwich correction — they are crucial in practice and are revisited in §1.8 and §4.5.
- Optimisation pathology. Closed forms only exist for a few distributions. Most modern MLE problems require numerical optimisation (Newton-Raphson, BFGS, Fisher scoring, EM for missing-data and mixture problems). The log-likelihood can be non-concave (mixtures, neural networks) with multiple local optima, ridges where the curvature collapses, or boundaries where the optimisation stalls. Identifiability failures (the model is invariant to relabelling components in a mixture, for example) produce the famous "label-switching" pathology. Robust starting values — often from MoM — and multi-start optimisation help. There is no silver bullet.
Despite the caveats, MLE is the modern default for parametric estimation in research statistics. It is the estimator the software gives you when you fit a linear regression (least squares is the MLE under iid Normal errors), a logistic regression (iteratively reweighted least squares is Fisher scoring on the binomial likelihood), or any GLM. The three properties — consistency, asymptotic normality, asymptotic efficiency — combine to make MLE a one-size-fits-most recipe with a coherent theory of inference attached. §1.4 makes the efficiency claim precise; §3.3 gives you likelihood-ratio confidence intervals (often better-calibrated than Wald CIs based on ); §4.7 gives you AIC and BIC (model-selection criteria built directly on the log-likelihood). MLE is not just one estimator — it is a framework.
Try it
- In the likelihood-landscape widget, pick Bernoulli(p) with . Drag the slider for and watch . Where is the peak? Read it off the green MLE marker, then verify by computing from the empirical numbers reported. Now click Freeze new sample a few times — does the peak move around the true ? That movement is the sampling distribution of .
- Same widget, switch to Exponential(λ) with and the true λ = 1. Slide θ from 0.1 up to 5 and watch ℓ(θ). The curve is unimodal and smooth. Now drop to 5. How much wider is the peak? (You should see the slider has to move much further before ℓ drops noticeably — small means little curvature, which §1.4 will identify as low Fisher information.)
- Switch to Uniform(0, θ). Slide θ from 0.05 up to 3 and look at the shape of ℓ(θ). What happens on the LEFT side of the MLE θ̂ = max(Xᵢ)? (Sharp cliff to −∞.) Try to maximise ℓ(θ) using only the slider — note that you cannot get above the green line because anywhere right of max(Xᵢ) is sub-optimal and anywhere left is impossible.
- In the MLE-vs-MoM-multi widget, set α = 2, β = 1, n = 30. Re-run replicates a couple of times. What is the empirical variance ratio Var(MoM) / Var(MLE)? Now bump α up to 6 (closer to Normal). Does the ratio shrink? (Yes — for high-α Gamma the distribution becomes near-Normal and MoM is nearly as good as MLE; for low α the MLE wins by more.)
- Same widget, set α = 0.7 (very skewed Gamma), β = 1, n = 30. Look at the empirical Var(MLE) and the asymptotic-CRLB value . Are they close? Now bump to 100. Should be closer. This is asymptotic efficiency landing on a finite-sample estimate.
- Pen-and-paper: derive the Bernoulli MLE. Show that the log-likelihood has its maximum at by setting the score to zero. Then apply invariance: write down the MLE of the log-odds without doing any new optimisation.
Pause and reflect: the likelihood is the joint density of the data viewed as a function of the parameter. It is not a probability distribution over — frequentist statistics treats as fixed but unknown. Bayesian statistics (§7) treats as random and multiplies the likelihood by a prior to get a posterior . What does setting (the "uniform prior") do to the relationship between the posterior mode and the MLE? When does that uniform-prior choice make sense, and when does it depend on the parameterisation in a way the MLE itself does not?
What you now know
Maximum likelihood is your second great recipe for constructing an estimator: write down the joint density of the data as a function of , take logs, find the that maximises the resulting log-likelihood. You have the closed-form MLEs for Bernoulli (), Exponential (), Normal ( and the biased ), and Uniform(0, θ) (, the boundary case where the derivative recipe breaks). You have the three big properties under regularity — consistency, asymptotic normality with inverse-Fisher-information variance, and invariance. And you have the three honest caveats — finite-sample bias, sensitivity to misspecification, optimisation pathology — and a sense of when MLE is the right default and when MoM, robust alternatives, or Bayesian methods do better.
§1.4 turns "asymptotically efficient" into a calculation by introducing the Cramér–Rao lower bound and Fisher information. You will be able to compute the lower bound on variance among unbiased estimators and verify that MLE asymptotically achieves it. §1.5 revisits bias-variance in full and shows when even MLE can be beaten by shrinkage — the James–Stein result you previewed in §1.1. After that Part 1 turns from constructing estimators to characterising them: §1.6 makes "standard error" precise, §1.7 introduces the bootstrap (a way to estimate the sampling distribution without leaning on asymptotic theory), §1.8 hardens the methods against outliers, and §1.9 makes the large-sample machinery rigorous.
References
- Fisher, R.A. (1922). "On the mathematical foundations of theoretical statistics." Philosophical Transactions of the Royal Society of London A 222, 309–368. (The foundational paper. Fisher introduces the likelihood, the MLE, and the concepts of consistency, efficiency, and sufficiency in one extraordinary essay.)
- Fisher, R.A. (1925). "Theory of statistical estimation." Proceedings of the Cambridge Philosophical Society 22(5), 700–725. (The follow-up that pins down asymptotic efficiency and introduces Fisher information explicitly.)
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (Chapter 9, "Parametric Inference"; the modern textbook presentation of MLE and its asymptotics.)
- Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 7, "Point Estimation"; a thorough but careful treatment with worked examples.)
- Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press. (A book-length argument for putting the likelihood at the centre of applied statistics; especially useful for the philosophical questions about what likelihood is and is not.)