Fisher information and the Cramér–Rao bound
Learning objectives
- Define the score U(θ) = ∂ℓ/∂θ as the gradient of the log-likelihood and recognise it as a random variable with mean zero at the true θ
- Define Fisher information two ways — as the variance of the score and as the negative expected Hessian — and connect both to the curvature of ℓ at the MLE
- State the Cramér–Rao lower bound Var(θ̂) ≥ 1/[n·I(θ)] for unbiased estimators and explain when it can be achieved
- Compute I(θ) and CRLB for Bernoulli, Exponential, and Normal (mean and variance) by hand
- Distinguish observed Fisher info -ℓ''(θ̂) from expected info I(θ), and use observed info to report standard errors in practice
- Recognise the regularity conditions and identify the Uniform(0, θ) violation; know the generalised bound for biased estimators
§1.3 closed with three big properties of the maximum likelihood estimator — consistency, asymptotic normality, and asymptotic efficiency — and a single unanswered question: efficient compared to what bound? §1.4 is the answer. Ronald Fisher's 1925 follow-up paper introduces an information quantity that turns out to be the natural unit for measuring how much a sample tells you about a parameter, and a lower bound — derived independently by Harald Cramér in 1946 and Calyampudi Rao in 1945, and now called the Cramér–Rao lower bound — that pins down the smallest variance any unbiased estimator can have. The MLE, under regularity, achieves this bound asymptotically. That is the precise content of "asymptotically efficient."
This section gives you the score function (the gradient of the log-likelihood, whose zero is the MLE), Fisher information defined two equivalent ways (variance of the score, or negative expected Hessian), the CRLB statement and proof sketch, the worked examples for the four workhorse single-parameter families, the regularity conditions and the Uniform(0, θ) violation, the practical observed-versus-expected-information distinction every applied statistician uses for standard errors, the multivariate extension, and the generalised bound for biased estimators that §1.5 will use to set up bias-variance trade-off properly.
The score function
You met the log-likelihood and its derivative in §1.3. Recall: for iid data with density , the log-likelihood is , and the MLE is the that maximises it. Setting the derivative to zero gives the score equation. The score function is the derivative itself, viewed as a quantity in its own right:
By construction, the MLE satisfies . The score crosses zero at the peak of the log-likelihood — that is just calculus.
What is new in §1.4 is to treat as a random variable, the way §1.1 told you to treat estimators. The data are random, so is a function of random variables, so has its own distribution — its own mean, its own variance — depending on the parameter and on the sample size .
The first useful fact: under regularity (densities differentiable in , expectations and derivatives interchangeable; technical conditions made precise below), the expected value of the score at the true parameter is zero:
Sketch: for every ; differentiate under the integral sign and write , which says . Sum over iid observations and the same identity holds for . The score has mean zero at the truth. This is sometimes called the first Bartlett identity.
Read that sentence twice. It says: if you knew the true and computed the gradient of the log-likelihood on average across all possible samples, you would land on zero on average. The MLE — defined by the score being zero on the actual sample — is therefore the value of that makes the actual-sample-gradient agree with the population-gradient. Consistency of the MLE (§1.3) is exactly this argument made rigorous: as the empirical score converges to its expectation, so the root of the empirical score converges to the root of the population score, which is the truth.
Fisher information: the variance of the score
The score has mean zero — fine. What is its variance?
For a single observation, define
(The last equality uses that the mean of the per-observation score is zero, so variance equals expected square.) This is the Fisher information per observation. For iid observations the total Fisher information is
because the score is a sum of iid mean-zero terms whose variances add.
Fisher information has units of (1 / θ²) and grows linearly in . Intuition: each new data point adds the same expected amount of "information" about . Doubling the sample doubles the information. This is the basic asymptotic-statistics fact about why -rates of convergence are universal — variance scales like , so standard error scales like .
The other formula: negative expected Hessian
There is a second formula for that is often easier to compute and is the one numerical software actually uses:
Same quantity, totally different-looking expression. Sketch: differentiate once more with respect to . After the chain rule and a bit of bookkeeping (the second Bartlett identity) you get
which gives . Two equivalent characterisations.
The negative-expected-Hessian form has a clean geometric reading. The Hessian at the MLE measures the curvature of the log-likelihood at its peak. Sharp peak ⇒ large negative second derivative ⇒ large Fisher info ⇒ small variance for the MLE. Flat peak ⇒ small Fisher info ⇒ wide variance ⇒ data tells you little. The widget below makes this exact statement visible.
Curvature is information: scrub the score and the Hessian
The widget below draws three stacked plots for a one-parameter family of your choice (Bernoulli, Exponential, or Normal mean with σ = 1). All three plots share the same horizontal axis (θ); each shows a different function of computed on a single fixed sample. TOP: the log-likelihood . MIDDLE: the score . BOTTOM: the negative second derivative .
Five things to verify:
- The score (middle) crosses zero exactly at the MLE (green vertical line in all three plots). That is the definition of the MLE — the value of where the gradient vanishes.
- The negative second derivative (bottom plot) at the MLE θ̂ is the observed Fisher information. The status panel reports it as "−ℓ''(θ̂)" — and reports its reciprocal as the estimated asymptotic variance of θ̂.
- The status panel also reports the theoretical Fisher information per observation for Bernoulli, for Exponential, for the Normal mean with σ = 1 fixed. Compare this to — they agree up to sample noise.
- The CRLB = is the variance the MLE achieves asymptotically. The status panel reports it explicitly. For Bernoulli , for Exponential , for the Normal mean .
- The peak gets sharper as grows from 5 to 300. Curvature scales linearly in ; the bottom plot rises linearly with ; the CRLB drops as . Drag the n slider and watch.
The widget makes the equivalence concrete: Fisher information is curvature, the CRLB is one-over-curvature, asymptotic variance is the CRLB.
Four worked examples
Bernoulli . Single-observation log-density . First derivative ; second derivative . Taking the negative expectation under (so ):
Total info and CRLB = . The MLE has variance exactly, at every — it achieves CRLB exactly, not just asymptotically. Bernoulli is the rare case where the bound is hit for every sample size.
Exponential . Single-observation log-density . Second derivative (deterministic — no dependence). Negative expectation:
Total info and CRLB = . The MLE is biased but asymptotically efficient: .
Normal mean (σ² known). Single-observation log-density . Second derivative . Negative expectation:
CRLB = . The MLE has variance exactly — achieves the bound for every , just like Bernoulli.
Normal variance (μ known). Let . Log-density . Differentiate twice in : first derivative ; second derivative . Negative expectation using :
CRLB = . The MLE is biased (factor relative to the unbiased sample variance) but . Asymptotically efficient.
The Cramér–Rao lower bound
The statement, for iid data and a single parameter, with regularity assumed:
For any unbiased estimator of ,
Two-line sketch of the proof (this is the cleanest derivation in classical statistics). Let be unbiased: . Differentiate both sides in :
(The score U has mean zero, so this is also the covariance .) Now apply Cauchy–Schwarz: , i.e. , so . Done.
The CRLB sits at the foundation of every standard-error calculation. Three things to internalise:
- It is a lower bound. An unbiased estimator can match it or exceed it; it cannot beat it. If you compute the variance of your MLE and find it equals , you have hit the bound. If you find it higher, you are leaving information on the table.
- The bound is achieved exactly for some estimators (Bernoulli , Normal mean with σ known). For most others — including most MLEs — the bound is achieved only asymptotically: . That is the precise statement of "asymptotic efficiency" §1.3 referenced.
- The bound is only for unbiased estimators. Biased estimators can — and routinely do — have variance below the CRLB. The Normal-variance MLE is biased low and has variance below the CRLB-for-unbiased; the James–Stein shrinkage estimator (which §1.5 will revisit) is biased and beats the unbiased estimator in MSE. Trading bias for variance can pay off, and the next section is exactly about that trade-off.
Observed versus expected Fisher information
In theory you compute : an expectation over , evaluated at a specific . Two things make this awkward in practice:
- You do not know the true . You have only an estimate .
- The expectation involves an integral that may have no closed form even when the per-observation Hessian does.
Both problems are solved by the observed Fisher information: just take the negative Hessian of the actual log-likelihood at the actual MLE, on the actual sample, without taking expectations.
For the Normal mean and many other families the observed and expected information agree algebraically (the expectation just collapses because the Hessian is non-random in ). For Bernoulli and most other models they differ. Both consistent estimates of ; both produce asymptotically valid Wald standard errors . In modern practice the observed form is preferred: it is what every numerical optimiser computes anyway (it falls out of the Newton-Raphson Hessian), and Efron and Hinkley (1978) showed that observed info gives better-conditioned standard errors than the expected version when the score is itself random.
This is what your statistical software does. When you call glm() in R, statsmodels in Python, or any GLM/MLE fitter, the "Std. Error" column in the output is — observed Fisher information at the fitted MLE, inverted, the th diagonal element square-rooted. The Wald 95% CI in the output is . That whole pipeline is just §1.4 implemented automatically. (§3.1–§3.3 will revisit when the Wald approximation can be poor and what to do then.)
CRLB on simulated MLEs
The first widget showed you Fisher info as curvature on a single sample. The second widget shows the bound being approached in distribution. We simulate 1500 replicates of size , compute the MLE on each, histogram the resulting values, and overlay the asymptotic Gaussian — the distribution the MLE should converge to.
Three things to do:
- Start with Bernoulli or Normal-mean. At the empirical histogram already hugs the orange Gaussian almost exactly. These are the exact-efficiency cases.
- Switch to Exponential and start at . The empirical histogram is visibly shifted RIGHT of the truth — the MLE is biased low for small samples (since and Jensen pushes above ). Push to 100 or 400 and the histogram centres on the truth and matches the Gaussian. That is consistency + asymptotic efficiency in action.
- Switch to Uniform(0, θ). The empirical histogram is sharply LEFT-SKEWED and concentrated AT or BELOW the truth (since always). The Normal overlay does NOT match — the true asymptotic distribution is Exponential, and the rate of convergence is , not . This is the regularity violation we will discuss next.
The status panel reports the empirical variance , the theoretical CRLB, and their ratio. Under regularity the ratio should be near 1 for moderate . For the Uniform case it diverges because the comparison is meaningless — the CRLB derivation assumed regularity.
Regularity conditions, and the Uniform pathology
Every theorem in this section — score has mean zero, the two formulas for Fisher info agree, CRLB holds, MLE is asymptotically normal — depends on a list of regularity conditions. The standard short version (Wasserman 2004 §9.4, Casella–Berger §10.6):
- The parameter space is an open set.
- The density is differentiable in for almost every .
- The support of does not depend on .
- You can interchange with (dominated convergence applies).
- The Fisher information exists and is finite.
For Bernoulli, Exponential, Normal, Gamma, Poisson, Binomial — all the standard families — these hold. For Uniform(0, θ), condition (3) fails: the support depends on . As a consequence:
- The score on the support and undefined at the boundary — it does NOT have mean zero in any useful sense.
- The negative second derivative is , but interchanging differentiation and integration fails at the boundary , so this is not the same as the variance of the (non-existent) score.
- The CRLB derivation (the Cauchy–Schwarz step) requires the score-mean-zero property and breaks here.
- The MLE still exists and is consistent, but its asymptotic distribution is Exponential, not Normal, and converges at rate . In particular , which beats any scaling — the "naive" CRLB. The MLE is "super-efficient" but irregularly so.
The widget makes the regularity violation visible: the asymptotic Normal overlay simply does not fit the empirical histogram for Uniform. The whole CRLB/asymptotic-Normal apparatus presumes regularity. When the support depends on the parameter — Uniform, truncated distributions, shifted Exponential — you need different tools (extreme-value asymptotics for boundary cases, profile likelihood for general boundary-supported problems). §1.9 makes this comparison precise.
Generalised CRLB and the multivariate version
Two refinements you should know about, even if §1.4 does not derive them in detail.
Biased estimators. The CRLB statement above is for unbiased . If is biased with bias function , the same Cauchy–Schwarz argument gives
The numerator is the squared derivative of with respect to . For unbiased , and we recover . For shrinkage estimators (Stein, ridge) the bias derivative can drop the numerator dramatically, so the bound on variance shrinks — and the actual variance shrinks with it, often enough to beat unbiased estimators in MSE. §1.5 returns to this with explicit examples.
Multivariate. For a parameter vector , the Fisher information becomes a matrix with entries
The multivariate CRLB says that for any unbiased estimator of ,
where means "is greater than or equal to in the positive-semidefinite ordering": is positive semidefinite. In particular every diagonal entry of — every individual coordinate's variance — is bounded below by , AND every linear combination has variance bounded below by .
The asymptotic distribution of the multivariate MLE is multivariate Normal:
This is the result that powers every default standard-error and confidence-interval column in regression and GLM output. The reported covariance matrix of the coefficient estimates is . Standard errors are diagonal entries' square roots. 95% Wald CIs are coefficient ± 1.96·SE. Even the joint hypothesis tests (likelihood-ratio, Wald, score) you will meet in Part 2 are built directly on Fisher information of the MLE.
Try it
- In the score-fisher explorer, pick Bernoulli with and . Slide from 0.05 up to 0.95. Where does the score cross zero? Verify that it matches the MLE marker — green vertical line — and that this MLE equals on the displayed sample. Now click "Freeze new sample" five times and watch the green line wander. That is the sampling distribution of ; its theoretical variance is exactly .
- Same widget, switch to Exponential with . Slide from 5 up to 300. What happens to ? (It scales LINEARLY with .) What does that imply about the CRLB? (Linear in — variance shrinks like .) Read the status panel's "n · I(true θ)" and "CRLB" fields and verify they track that scaling.
- Same widget, switch to Normal mean with , . The bottom plot is a HORIZONTAL LINE at . Why? (Because for fixed σ, the second derivative is , independent of both and the data.) That is unusual — most distributions have an information that depends on the parameter. Bernoulli (try it again) does: is highest near 0 and 1.
- In the CRLB vs empirical widget, pick Exponential and set . Note the empirical mean of in the status panel — it is biased high. Now push to 100, then 400. The bias shrinks (consistency); the variance ratio (empirical / CRLB) tightens toward 1 (asymptotic efficiency).
- Same widget, switch to Uniform(0, 1). Note that the histogram is concentrated below 1 (the truth), and the Normal overlay does not fit. Compute the empirical mean of from the status panel and verify it is approximately (the known bias of the Uniform MLE). Push to 200 — the histogram contracts MUCH faster than the Normal overlay does, because the true rate of convergence is , not .
- Pen-and-paper: derive for the Bernoulli. Then write down the CRLB. Then verify that equals the CRLB exactly. Now do the same for Exponential: derive , CRLB = , and show that the asymptotic variance of is using the delta method.
- Pen-and-paper, harder: compute the Fisher information for the Poisson distribution . Show that . Verify that the MLE achieves CRLB exactly (variance ). This is a third exact-efficiency case alongside Bernoulli and Normal-mean.
Pause and reflect: Fisher information measures how much a sample tells you about a parameter. Two different parameterisations of the same model produce two different Fisher informations — for the Bernoulli, ; for the same model parameterised by log-odds , what is ? (Hint: use the chain rule. .) Why does this matter when you build confidence intervals on versus on ?
What you now know
The score function is the gradient of the log-likelihood; it has mean zero at the true parameter, crosses zero at the MLE, and its variance is the Fisher information . Fisher information per observation can be computed either way (the second Bartlett identity). High Fisher info means a sharply curved log-likelihood at its peak; the Cramér–Rao bound holds for any unbiased estimator, and the MLE achieves this bound asymptotically (and sometimes exactly, as in Bernoulli, Poisson, and Normal-mean cases).
You know the four worked examples in closed form, you have seen the regularity conditions and the Uniform pathology where they fail, you understand the practical observed-vs-expected-information distinction every statistical-software package implements, and you have seen the multivariate extension that underlies every standard-error column in every regression output. The CRLB and Fisher information are not optional theory — they are the substrate the rest of inferential statistics is built on. §1.5 turns the bias-variance trade-off into its own section and makes the case for trading some bias for a lot of variance (the shrinkage estimators §1.1 previewed). §1.6 makes "standard error" precise in the language of sampling distributions. §3 builds confidence intervals on top of all of this; §4.5 and §5 build regression and GLM inference; Part 2 builds hypothesis testing. Everything downstream uses what you just learned.
References
- Fisher, R.A. (1925). "Theory of statistical estimation." Proceedings of the Cambridge Philosophical Society 22(5), 700–725. (Fisher introduces information and pins down asymptotic efficiency for the MLE.)
- Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. (The form of the bound that bears Cramér's name; the textbook that consolidated the early classical estimation theory.)
- Rao, C.R. (1945). "Information and the accuracy attainable in the estimation of statistical parameters." Bulletin of the Calcutta Mathematical Society 37, 81–91. (Rao's independent derivation of the bound, reaching the same conclusion via a slightly different route.)
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (§9.4–9.7 cover Fisher info, CRLB, and asymptotic normality of the MLE in a compact modern presentation.)
- Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 7 on point estimation, with careful proofs of Bartlett identities and CRLB; ideal companion to this section.)
- Lehmann, E.L., Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. (The deeper reference; Chapter 2 develops information and efficiency from first principles, including the multivariate matrix CRLB.)
- Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall. (Classic. Especially good on the observed-versus-expected information distinction and on regularity-condition failures.)