Fisher information and the Cramér–Rao bound

Part 1 — Estimation

Learning objectives

  • Define the score U(θ) = ∂ℓ/∂θ as the gradient of the log-likelihood and recognise it as a random variable with mean zero at the true θ
  • Define Fisher information two ways — as the variance of the score and as the negative expected Hessian — and connect both to the curvature of ℓ at the MLE
  • State the Cramér–Rao lower bound Var(θ̂) ≥ 1/[n·I(θ)] for unbiased estimators and explain when it can be achieved
  • Compute I(θ) and CRLB for Bernoulli, Exponential, and Normal (mean and variance) by hand
  • Distinguish observed Fisher info -ℓ''(θ̂) from expected info I(θ), and use observed info to report standard errors in practice
  • Recognise the regularity conditions and identify the Uniform(0, θ) violation; know the generalised bound for biased estimators

§1.3 closed with three big properties of the maximum likelihood estimator — consistency, asymptotic normality, and asymptotic efficiency — and a single unanswered question: efficient compared to what bound? §1.4 is the answer. Ronald Fisher's 1925 follow-up paper introduces an information quantity that turns out to be the natural unit for measuring how much a sample tells you about a parameter, and a lower bound — derived independently by Harald Cramér in 1946 and Calyampudi Rao in 1945, and now called the Cramér–Rao lower bound — that pins down the smallest variance any unbiased estimator can have. The MLE, under regularity, achieves this bound asymptotically. That is the precise content of "asymptotically efficient."

This section gives you the score function (the gradient of the log-likelihood, whose zero is the MLE), Fisher information defined two equivalent ways (variance of the score, or negative expected Hessian), the CRLB statement and proof sketch, the worked examples for the four workhorse single-parameter families, the regularity conditions and the Uniform(0, θ) violation, the practical observed-versus-expected-information distinction every applied statistician uses for standard errors, the multivariate extension, and the generalised bound for biased estimators that §1.5 will use to set up bias-variance trade-off properly.

The score function

You met the log-likelihood and its derivative in §1.3. Recall: for iid data X1,,XnX_1, \ldots, X_n with density f(x;θ)f(x; \theta), the log-likelihood is (θ)=ilogf(Xi;θ)\ell(\theta) = \sum_i \log f(X_i; \theta), and the MLE θ^\hat{\theta} is the θ\theta that maximises it. Setting the derivative to zero gives the score equation. The score function is the derivative itself, viewed as a quantity in its own right:

U(θ)=(θ)θ=i=1nlogf(Xi;θ)θ.U(\theta) = \frac{\partial \ell(\theta)}{\partial \theta} = \sum_{i=1}^{n} \frac{\partial \log f(X_i; \theta)}{\partial \theta}.

By construction, the MLE satisfies U(θ^)=0U(\hat{\theta}) = 0. The score crosses zero at the peak of the log-likelihood — that is just calculus.

What is new in §1.4 is to treat U(θ)U(\theta) as a random variable, the way §1.1 told you to treat estimators. The data XiX_i are random, so U(θ)U(\theta) is a function of random variables, so U(θ)U(\theta) has its own distribution — its own mean, its own variance — depending on the parameter θ\theta and on the sample size nn.

The first useful fact: under regularity (densities differentiable in θ\theta, expectations and derivatives interchangeable; technical conditions made precise below), the expected value of the score at the true parameter θ0\theta_0 is zero:

Eθ0 ⁣[U(θ0)]=0.E_{\theta_0}\!\left[U(\theta_0)\right] = 0.

Sketch: 1=f(x;θ)dx1 = \int f(x; \theta) , dx for every θ\theta; differentiate under the integral sign and write θf=fθ=logfθf\frac{\partial}{\partial \theta} \int f = \int \frac{\partial f}{\partial \theta} = \int \frac{\partial \log f}{\partial \theta} \cdot f, which says Eθ[logf/θ]=0E_{\theta}[\partial \log f / \partial \theta] = 0. Sum over nn iid observations and the same identity holds for U(θ)=ilogf(Xi;θ)/θU(\theta) = \sum_i \partial \log f(X_i; \theta) / \partial \theta. The score has mean zero at the truth. This is sometimes called the first Bartlett identity.

Read that sentence twice. It says: if you knew the true θ0\theta_0 and computed the gradient of the log-likelihood on average across all possible samples, you would land on zero on average. The MLE — defined by the score being zero on the actual sample — is therefore the value of θ\theta that makes the actual-sample-gradient agree with the population-gradient. Consistency of the MLE (§1.3) is exactly this argument made rigorous: as nn \to \infty the empirical score converges to its expectation, so the root of the empirical score converges to the root of the population score, which is the truth.

Fisher information: the variance of the score

The score has mean zero — fine. What is its variance?

For a single observation, define

I(θ)=Varθ ⁣[logf(X;θ)θ]=Eθ ⁣[(logf(X;θ)θ)2].I(\theta) = \operatorname{Var}_{\theta}\!\left[\frac{\partial \log f(X; \theta)}{\partial \theta}\right] = E_{\theta}\!\left[\left(\frac{\partial \log f(X; \theta)}{\partial \theta}\right)^2\right].

(The last equality uses that the mean of the per-observation score is zero, so variance equals expected square.) This is the Fisher information per observation. For nn iid observations the total Fisher information is

In(θ)=Varθ[U(θ)]=nI(θ),I_n(\theta) = \operatorname{Var}_{\theta}[U(\theta)] = n \cdot I(\theta),

because the score is a sum of nn iid mean-zero terms whose variances add.

Fisher information has units of (1 / θ²) and grows linearly in nn. Intuition: each new data point adds the same expected amount of "information" about θ\theta. Doubling the sample doubles the information. This is the basic asymptotic-statistics fact about why n\sqrt{n}-rates of convergence are universal — variance scales like 1/In=1/(nI)1/I_n = 1/(n \cdot I), so standard error scales like 1/n1/\sqrt{n}.

The other formula: negative expected Hessian

There is a second formula for I(θ)I(\theta) that is often easier to compute and is the one numerical software actually uses:

I(θ)=Eθ ⁣[2logf(X;θ)θ2].I(\theta) = -E_{\theta}\!\left[\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2}\right].

Same quantity, totally different-looking expression. Sketch: differentiate Eθ[logf/θ]=0E_{\theta}[\partial \log f / \partial \theta] = 0 once more with respect to θ\theta. After the chain rule and a bit of bookkeeping (the second Bartlett identity) you get

Eθ ⁣[2logfθ2]+Eθ ⁣[(logfθ)2]=0,E_{\theta}\!\left[\frac{\partial^2 \log f}{\partial \theta^2}\right] + E_{\theta}\!\left[\left(\frac{\partial \log f}{\partial \theta}\right)^2\right] = 0,

which gives I(θ)=E[(logf/θ)2]=E[2logf/θ2]I(\theta) = E[(\partial \log f / \partial \theta)^2] = -E[\partial^2 \log f / \partial \theta^2]. Two equivalent characterisations.

The negative-expected-Hessian form has a clean geometric reading. The Hessian 2/θ2\partial^2 \ell / \partial \theta^2 at the MLE measures the curvature of the log-likelihood at its peak. Sharp peak ⇒ large negative second derivative ⇒ large Fisher info ⇒ small variance for the MLE. Flat peak ⇒ small Fisher info ⇒ wide variance ⇒ data tells you little. The widget below makes this exact statement visible.

Curvature is information: scrub the score and the Hessian

The widget below draws three stacked plots for a one-parameter family of your choice (Bernoulli, Exponential, or Normal mean with σ = 1). All three plots share the same horizontal axis (θ); each shows a different function of θ\theta computed on a single fixed sample. TOP: the log-likelihood (θ)\ell(\theta). MIDDLE: the score U(θ)=(θ)U(\theta) = \ell'(\theta). BOTTOM: the negative second derivative (θ)-\ell''(\theta).

Score Fisher ExplorerInteractive figure — enable JavaScript to interact.

Five things to verify:

  • The score (middle) crosses zero exactly at the MLE (green vertical line in all three plots). That is the definition of the MLE — the value of θ\theta where the gradient vanishes.
  • The negative second derivative (θ)-\ell''(\theta) (bottom plot) at the MLE θ̂ is the observed Fisher information. The status panel reports it as "−ℓ''(θ̂)" — and reports its reciprocal as the estimated asymptotic variance of θ̂.
  • The status panel also reports the theoretical Fisher information per observation I(θ0)=1/[p(1p)]I(\theta_0) = 1/[p(1-p)] for Bernoulli, 1/λ21/\lambda^2 for Exponential, 1/σ2=11/\sigma^2 = 1 for the Normal mean with σ = 1 fixed. Compare this to (θ^)/n-\ell''(\hat{\theta}) / n — they agree up to sample noise.
  • The CRLB = 1/[nI(θ0)]1/[n \cdot I(\theta_0)] is the variance the MLE achieves asymptotically. The status panel reports it explicitly. For Bernoulli p(1p)/np \cdot (1-p)/n, for Exponential λ2/n\lambda^2 / n, for the Normal mean 1/n1/n.
  • The peak gets sharper as nn grows from 5 to 300. Curvature scales linearly in nn; the bottom plot rises linearly with nn; the CRLB drops as 1/n1/n. Drag the n slider and watch.

The widget makes the equivalence concrete: Fisher information is curvature, the CRLB is one-over-curvature, asymptotic variance is the CRLB.

Four worked examples

Bernoulli Bern(p)\text{Bern}(p). Single-observation log-density logf(x;p)=xlogp+(1x)log(1p)\log f(x; p) = x \log p + (1 - x) \log(1 - p). First derivative x/p(1x)/(1p)x/p - (1-x)/(1-p); second derivative x/p2(1x)/(1p)2-x/p^2 - (1-x)/(1-p)^2. Taking the negative expectation under XBern(p)X \sim \text{Bern}(p) (so E[X]=pE[X] = p):

I(p)=pp2+1p(1p)2=1p+11p=1p(1p).I(p) = \frac{p}{p^2} + \frac{1 - p}{(1-p)^2} = \frac{1}{p} + \frac{1}{1-p} = \frac{1}{p(1-p)}.

Total info In(p)=n/[p(1p)]I_n(p) = n/[p(1-p)] and CRLB = p(1p)/np(1-p)/n. The MLE p^=Xˉ\hat{p} = \bar{X} has variance Var(Xˉ)=p(1p)/n\operatorname{Var}(\bar{X}) = p(1-p)/n exactly, at every nn — it achieves CRLB exactly, not just asymptotically. Bernoulli is the rare case where the bound is hit for every sample size.

Exponential Exp(λ)\text{Exp}(\lambda). Single-observation log-density logf(x;λ)=logλλx\log f(x; \lambda) = \log \lambda - \lambda x. Second derivative 1/λ2-1/\lambda^2 (deterministic — no XX dependence). Negative expectation:

I(λ)=1λ2.I(\lambda) = \frac{1}{\lambda^2}.

Total info n/λ2n / \lambda^2 and CRLB = λ2/n\lambda^2 / n. The MLE λ^=1/Xˉ\hat{\lambda} = 1/\bar{X} is biased but asymptotically efficient: n(λ^λ)dN(0,λ2)\sqrt{n}(\hat{\lambda} - \lambda) \to_d N(0, \lambda^2).

Normal mean N(μ,σ2)N(\mu, \sigma^2) (σ² known). Single-observation log-density 12log(2πσ2)(xμ)2/(2σ2)-\tfrac{1}{2}\log(2\pi\sigma^2) - (x - \mu)^2/(2\sigma^2). Second derivative 1/σ2-1/\sigma^2. Negative expectation:

I(μ)=1σ2.I(\mu) = \frac{1}{\sigma^2}.

CRLB = σ2/n\sigma^2 / n. The MLE μ^=Xˉ\hat{\mu} = \bar{X} has variance σ2/n\sigma^2 / n exactly — achieves the bound for every nn, just like Bernoulli.

Normal variance N(μ,σ2)N(\mu, \sigma^2) (μ known). Let η=σ2\eta = \sigma^2. Log-density 12log(2πη)(xμ)2/(2η)-\tfrac{1}{2}\log(2\pi\eta) - (x - \mu)^2/(2\eta). Differentiate twice in η\eta: first derivative 1/(2η)+(xμ)2/(2η2)-1/(2\eta) + (x - \mu)^2 / (2\eta^2); second derivative 1/(2η2)(xμ)2/η31/(2\eta^2) - (x - \mu)^2 / \eta^3. Negative expectation using E[(Xμ)2]=ηE[(X - \mu)^2] = \eta:

I(σ2)=(12η2ηη3)=12σ4.I(\sigma^2) = -\left(\frac{1}{2\eta^2} - \frac{\eta}{\eta^3}\right) = \frac{1}{2\sigma^4}.

CRLB = 2σ4/n2\sigma^4 / n. The MLE σ^2=(1/n)(Xiμ)2\hat{\sigma}^2 = (1/n)\sum (X_i - \mu)^2 is biased (factor (n1)/n(n-1)/n relative to the unbiased sample variance) but n(σ^2σ2)dN(0,2σ4)\sqrt{n}(\hat{\sigma}^2 - \sigma^2) \to_d N(0, 2\sigma^4). Asymptotically efficient.

The Cramér–Rao lower bound

The statement, for iid data and a single parameter, with regularity assumed:

For any unbiased estimator T(X1,,Xn)T(X_1, \ldots, X_n) of θ\theta,

Varθ(T)    1nI(θ).\operatorname{Var}_{\theta}(T) \;\geq\; \frac{1}{n \cdot I(\theta)}.

Two-line sketch of the proof (this is the cleanest derivation in classical statistics). Let TT be unbiased: Eθ[T]=θE_{\theta}[T] = \theta. Differentiate both sides in θ\theta:

1=θT(x)f(x;θ)dx=T(x)logfθf(x;θ)dx=Eθ[TU(θ)].1 = \frac{\partial}{\partial \theta} \int T(x) f(x; \theta) \, dx = \int T(x) \frac{\partial \log f}{\partial \theta} f(x; \theta) \, dx = E_{\theta}[T \cdot U(\theta)].

(The score U has mean zero, so this is also the covariance Covθ(T,U(θ))=1\operatorname{Cov}_{\theta}(T, U(\theta)) = 1.) Now apply Cauchy–Schwarz: Cov(T,U)2Var(T)Var(U)\operatorname{Cov}(T, U)^2 \leq \operatorname{Var}(T) \cdot \operatorname{Var}(U), i.e. 1Var(T)In(θ)1 \leq \operatorname{Var}(T) \cdot I_n(\theta), so Var(T)1/In(θ)=1/[nI(θ)]\operatorname{Var}(T) \geq 1/I_n(\theta) = 1/[n \cdot I(\theta)]. Done.

The CRLB sits at the foundation of every standard-error calculation. Three things to internalise:

  • It is a lower bound. An unbiased estimator can match it or exceed it; it cannot beat it. If you compute the variance of your MLE and find it equals 1/[nI(θ)]1/[nI(\theta)], you have hit the bound. If you find it higher, you are leaving information on the table.
  • The bound is achieved exactly for some estimators (Bernoulli p^=Xˉ\hat{p} = \bar{X}, Normal mean μ^=Xˉ\hat{\mu} = \bar{X} with σ known). For most others — including most MLEs — the bound is achieved only asymptotically: n(θ^MLEθ)dN(0,1/I(θ))\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta) \to_d N(0, 1/I(\theta)). That is the precise statement of "asymptotic efficiency" §1.3 referenced.
  • The bound is only for unbiased estimators. Biased estimators can — and routinely do — have variance below the CRLB. The Normal-variance MLE (1/n)(Xiμ)2(1/n)\sum (X_i - \mu)^2 is biased low and has variance below the CRLB-for-unbiased; the James–Stein shrinkage estimator (which §1.5 will revisit) is biased and beats the unbiased estimator in MSE. Trading bias for variance can pay off, and the next section is exactly about that trade-off.

Observed versus expected Fisher information

In theory you compute I(θ)=Eθ[2logf/θ2]I(\theta) = -E_{\theta}[\partial^2 \log f / \partial \theta^2]: an expectation over XX, evaluated at a specific θ\theta. Two things make this awkward in practice:

  • You do not know the true θ\theta. You have only an estimate θ^\hat{\theta}.
  • The expectation involves an integral that may have no closed form even when the per-observation Hessian does.

Both problems are solved by the observed Fisher information: just take the negative Hessian of the actual log-likelihood at the actual MLE, on the actual sample, without taking expectations.

Iobs(θ^)=2(θ)θ2θ=θ^.I_{\text{obs}}(\hat{\theta}) = -\frac{\partial^2 \ell(\theta)}{\partial \theta^2}\bigg|_{\theta = \hat{\theta}}.

For the Normal mean and many other families the observed and expected information agree algebraically (the expectation just collapses because the Hessian is non-random in XX). For Bernoulli and most other models they differ. Both consistent estimates of In(θ0)I_n(\theta_0); both produce asymptotically valid Wald standard errors SE^(θ^j)=[Iobs1]jj\widehat{\operatorname{SE}}(\hat{\theta}j) = \sqrt{[I{\text{obs}}^{-1}]_{jj}}. In modern practice the observed form is preferred: it is what every numerical optimiser computes anyway (it falls out of the Newton-Raphson Hessian), and Efron and Hinkley (1978) showed that observed info gives better-conditioned standard errors than the expected version when the score is itself random.

This is what your statistical software does. When you call glm() in R, statsmodels in Python, or any GLM/MLE fitter, the "Std. Error" column in the output is [Iobs(θ^)1]jj\sqrt{[I_{\text{obs}}(\hat{\theta})^{-1}]_{jj}} — observed Fisher information at the fitted MLE, inverted, the jjth diagonal element square-rooted. The Wald 95% CI in the output is θ^j±1.96SE^(θ^j)\hat{\theta}_j \pm 1.96 \cdot \widehat{\operatorname{SE}}(\hat{\theta}_j). That whole pipeline is just §1.4 implemented automatically. (§3.1–§3.3 will revisit when the Wald approximation can be poor and what to do then.)

CRLB on simulated MLEs

The first widget showed you Fisher info as curvature on a single sample. The second widget shows the bound being approached in distribution. We simulate 1500 replicates of size nn, compute the MLE on each, histogram the resulting θ^\hat{\theta} values, and overlay the asymptotic Gaussian N(θ0,1/[nI(θ0)])N(\theta_0, , 1/[n I(\theta_0)]) — the distribution the MLE should converge to.

Crlb Vs EmpiricalInteractive figure — enable JavaScript to interact.

Three things to do:

  • Start with Bernoulli or Normal-mean. At n=20n = 20 the empirical histogram already hugs the orange Gaussian almost exactly. These are the exact-efficiency cases.
  • Switch to Exponential and start at n=10n = 10. The empirical histogram is visibly shifted RIGHT of the truth — the MLE is biased low for small samples (since λ^=1/Xˉ\hat{\lambda} = 1/\bar{X} and Jensen pushes E[1/Xˉ]E[1/\bar{X}] above 1/E[Xˉ]1/E[\bar{X}]). Push nn to 100 or 400 and the histogram centres on the truth and matches the Gaussian. That is consistency + asymptotic efficiency in action.
  • Switch to Uniform(0, θ). The empirical histogram is sharply LEFT-SKEWED and concentrated AT or BELOW the truth (since θ^=maxiXiθ\hat{\theta} = \max_i X_i \leq \theta always). The Normal overlay does NOT match — the true asymptotic distribution is Exponential, and the rate of convergence is 1/n1/n, not 1/n1/\sqrt{n}. This is the regularity violation we will discuss next.

The status panel reports the empirical variance Var^(θ^)\widehat{\operatorname{Var}}(\hat{\theta}), the theoretical CRLB, and their ratio. Under regularity the ratio should be near 1 for moderate nn. For the Uniform case it diverges because the comparison is meaningless — the CRLB derivation assumed regularity.

Regularity conditions, and the Uniform pathology

Every theorem in this section — score has mean zero, the two formulas for Fisher info agree, CRLB holds, MLE is asymptotically normal — depends on a list of regularity conditions. The standard short version (Wasserman 2004 §9.4, Casella–Berger §10.6):

  • The parameter space Θ\Theta is an open set.
  • The density f(x;θ)f(x; \theta) is differentiable in θ\theta for almost every xx.
  • The support of f(;θ)f(\cdot; \theta) does not depend on θ\theta.
  • You can interchange /θ\partial / \partial \theta with dx\int dx (dominated convergence applies).
  • The Fisher information I(θ)I(\theta) exists and is finite.

For Bernoulli, Exponential, Normal, Gamma, Poisson, Binomial — all the standard families — these hold. For Uniform(0, θ), condition (3) fails: the support [0,θ][0, \theta] depends on θ\theta. As a consequence:

  • The score logf/θ=1/θ\partial \log f / \partial \theta = -1/\theta on the support and undefined at the boundary — it does NOT have mean zero in any useful sense.
  • The negative second derivative is 1/θ21/\theta^2, but interchanging differentiation and integration fails at the boundary x=θx = \theta, so this is not the same as the variance of the (non-existent) score.
  • The CRLB derivation (the Cauchy–Schwarz step) requires the score-mean-zero property and breaks here.
  • The MLE θ^=maxiXi\hat{\theta} = \max_i X_i still exists and is consistent, but its asymptotic distribution is Exponential, not Normal, and converges at rate 1/n1/n. In particular Var(θ^)=θ2/[n(n+2)]θ2/n2\operatorname{Var}(\hat{\theta}) = \theta^2 / [n(n+2)] \approx \theta^2 / n^2, which beats any θ2/n\theta^2/n scaling — the "naive" CRLB. The MLE is "super-efficient" but irregularly so.

The widget makes the regularity violation visible: the asymptotic Normal overlay simply does not fit the empirical histogram for Uniform. The whole CRLB/asymptotic-Normal apparatus presumes regularity. When the support depends on the parameter — Uniform, truncated distributions, shifted Exponential — you need different tools (extreme-value asymptotics for boundary cases, profile likelihood for general boundary-supported problems). §1.9 makes this comparison precise.

Generalised CRLB and the multivariate version

Two refinements you should know about, even if §1.4 does not derive them in detail.

Biased estimators. The CRLB statement above is for unbiased TT. If TT is biased with bias function b(θ)=Eθ[T]θb(\theta) = E_{\theta}[T] - \theta, the same Cauchy–Schwarz argument gives

Varθ(T)[1+b(θ)]2nI(θ).\operatorname{Var}_{\theta}(T) \geq \frac{[1 + b'(\theta)]^2}{n \cdot I(\theta)}.

The numerator (1+b)2(1 + b')^2 is the squared derivative of E[T]E[T] with respect to θ\theta. For unbiased TT, b0b \equiv 0 and we recover 1/[nI]1/[nI]. For shrinkage estimators (Stein, ridge) the bias derivative can drop the numerator dramatically, so the bound on variance shrinks — and the actual variance shrinks with it, often enough to beat unbiased estimators in MSE. §1.5 returns to this with explicit examples.

Multivariate. For a parameter vector θ=(θ1,,θk)\theta = (\theta_1, \ldots, \theta_k), the Fisher information becomes a k×kk \times k matrix with entries

I(θ)jk=Eθ ⁣[logfθjlogfθk]=Eθ ⁣[2logfθjθk].I(\theta)_{jk} = E_{\theta}\!\left[\frac{\partial \log f}{\partial \theta_j} \cdot \frac{\partial \log f}{\partial \theta_k}\right] = -E_{\theta}\!\left[\frac{\partial^2 \log f}{\partial \theta_j \, \partial \theta_k}\right].

The multivariate CRLB says that for any unbiased estimator T(X1,,Xn)T(X_1, \ldots, X_n) of θ\theta,

Covθ(T)1nI(θ)1,\operatorname{Cov}_{\theta}(T) \succeq \frac{1}{n} I(\theta)^{-1},

where \succeq means "is greater than or equal to in the positive-semidefinite ordering": Cov(T)I(θ)1/n\operatorname{Cov}(T) - I(\theta)^{-1}/n is positive semidefinite. In particular every diagonal entry of Cov(T)\operatorname{Cov}(T) — every individual coordinate's variance — is bounded below by [I1]jj/n[I^{-1}]_{jj}/n, AND every linear combination aTTa^T T has variance bounded below by aTI1a/na^T I^{-1} a / n.

The asymptotic distribution of the multivariate MLE is multivariate Normal:

n(θ^MLEθ0)dNk(0,I(θ0)1).\sqrt{n}(\hat{\theta}_{\text{MLE}} - \theta_0) \to_d N_k(\mathbf{0}, \, I(\theta_0)^{-1}).

This is the result that powers every default standard-error and confidence-interval column in regression and GLM output. The reported covariance matrix of the coefficient estimates is Iobs(θ^)1I_{\text{obs}}(\hat{\theta})^{-1}. Standard errors are diagonal entries' square roots. 95% Wald CIs are coefficient ± 1.96·SE. Even the joint hypothesis tests (likelihood-ratio, Wald, score) you will meet in Part 2 are built directly on Fisher information of the MLE.

Try it

  • In the score-fisher explorer, pick Bernoulli with p=0.4p = 0.4 and n=30n = 30. Slide pp from 0.05 up to 0.95. Where does the score U(p)U(p) cross zero? Verify that it matches the MLE marker — green vertical line — and that this MLE equals Xˉ\bar{X} on the displayed sample. Now click "Freeze new sample" five times and watch the green line wander. That is the sampling distribution of p^MLE\hat{p}_{\text{MLE}}; its theoretical variance is exactly p(1p)/np(1-p)/n.
  • Same widget, switch to Exponential with λ=1\lambda = 1. Slide nn from 5 up to 300. What happens to (λ)-\ell''(\lambda)? (It scales LINEARLY with nn.) What does that imply about the CRLB? (Linear in 1/n1/n — variance shrinks like 1/n1/n.) Read the status panel's "n · I(true θ)" and "CRLB" fields and verify they track that scaling.
  • Same widget, switch to Normal mean with μ=0.5\mu = 0.5, n=30n = 30. The bottom plot (μ)-\ell''(\mu) is a HORIZONTAL LINE at nn. Why? (Because for fixed σ, the second derivative is n/σ2=n-n/\sigma^2 = -n, independent of both μ\mu and the data.) That is unusual — most distributions have an information that depends on the parameter. Bernoulli (try it again) does: I(p)=1/[p(1p)]I(p) = 1/[p(1-p)] is highest near 0 and 1.
  • In the CRLB vs empirical widget, pick Exponential and set n=10n = 10. Note the empirical mean of λ^\hat{\lambda} in the status panel — it is biased high. Now push nn to 100, then 400. The bias shrinks (consistency); the variance ratio (empirical / CRLB) tightens toward 1 (asymptotic efficiency).
  • Same widget, switch to Uniform(0, 1). Note that the histogram is concentrated below 1 (the truth), and the Normal overlay does not fit. Compute the empirical mean of θ^\hat{\theta} from the status panel and verify it is approximately n/(n+1)θn/(n+1) \cdot \theta (the known bias of the Uniform MLE). Push nn to 200 — the histogram contracts MUCH faster than the Normal overlay does, because the true rate of convergence is 1/n1/n, not 1/n1/\sqrt{n}.
  • Pen-and-paper: derive I(p)=1/[p(1p)]I(p) = 1/[p(1-p)] for the Bernoulli. Then write down the CRLB. Then verify that Var(Xˉ)=p(1p)/n\operatorname{Var}(\bar{X}) = p(1-p)/n equals the CRLB exactly. Now do the same for Exponential: derive I(λ)=1/λ2I(\lambda) = 1/\lambda^2, CRLB = λ2/n\lambda^2/n, and show that the asymptotic variance of λ^=1/Xˉ\hat{\lambda} = 1/\bar{X} is λ2/n\lambda^2/n using the delta method.
  • Pen-and-paper, harder: compute the Fisher information for the Poisson distribution P(X=k;λ)=eλλk/k!P(X = k; \lambda) = e^{-\lambda} \lambda^k / k!. Show that I(λ)=1/λI(\lambda) = 1/\lambda. Verify that the MLE λ^=Xˉ\hat{\lambda} = \bar{X} achieves CRLB exactly (variance λ/n\lambda/n). This is a third exact-efficiency case alongside Bernoulli and Normal-mean.

Pause and reflect: Fisher information measures how much a sample tells you about a parameter. Two different parameterisations of the same model produce two different Fisher informations — for the Bernoulli, I(p)=1/[p(1p)]I(p) = 1/[p(1-p)]; for the same model parameterised by log-odds η=log[p/(1p)]\eta = \log[p/(1-p)], what is I(η)I(\eta)? (Hint: use the chain rule. I(η)=I(p)(dp/dη)2I(\eta) = I(p) \cdot (dp/d\eta)^2.) Why does this matter when you build confidence intervals on pp versus on η\eta?

What you now know

The score function U(θ)=/θU(\theta) = \partial \ell / \partial \theta is the gradient of the log-likelihood; it has mean zero at the true parameter, crosses zero at the MLE, and its variance is the Fisher information In(θ)=nI(θ)I_n(\theta) = n \cdot I(\theta). Fisher information per observation I(θ)=E[(logf/θ)2]=E[2logf/θ2]I(\theta) = E[(\partial \log f / \partial \theta)^2] = -E[\partial^2 \log f / \partial \theta^2] can be computed either way (the second Bartlett identity). High Fisher info means a sharply curved log-likelihood at its peak; the Cramér–Rao bound Var(θ^)1/[nI(θ)]\operatorname{Var}(\hat{\theta}) \geq 1/[n \cdot I(\theta)] holds for any unbiased estimator, and the MLE achieves this bound asymptotically (and sometimes exactly, as in Bernoulli, Poisson, and Normal-mean cases).

You know the four worked examples in closed form, you have seen the regularity conditions and the Uniform pathology where they fail, you understand the practical observed-vs-expected-information distinction every statistical-software package implements, and you have seen the multivariate extension that underlies every standard-error column in every regression output. The CRLB and Fisher information are not optional theory — they are the substrate the rest of inferential statistics is built on. §1.5 turns the bias-variance trade-off into its own section and makes the case for trading some bias for a lot of variance (the shrinkage estimators §1.1 previewed). §1.6 makes "standard error" precise in the language of sampling distributions. §3 builds confidence intervals on top of all of this; §4.5 and §5 build regression and GLM inference; Part 2 builds hypothesis testing. Everything downstream uses what you just learned.

References

  • Fisher, R.A. (1925). "Theory of statistical estimation." Proceedings of the Cambridge Philosophical Society 22(5), 700–725. (Fisher introduces information and pins down asymptotic efficiency for the MLE.)
  • Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. (The form of the bound that bears Cramér's name; the textbook that consolidated the early classical estimation theory.)
  • Rao, C.R. (1945). "Information and the accuracy attainable in the estimation of statistical parameters." Bulletin of the Calcutta Mathematical Society 37, 81–91. (Rao's independent derivation of the bound, reaching the same conclusion via a slightly different route.)
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer. (§9.4–9.7 cover Fisher info, CRLB, and asymptotic normality of the MLE in a compact modern presentation.)
  • Casella, G., Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. (Chapter 7 on point estimation, with careful proofs of Bartlett identities and CRLB; ideal companion to this section.)
  • Lehmann, E.L., Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. (The deeper reference; Chapter 2 develops information and efficiency from first principles, including the multivariate matrix CRLB.)
  • Cox, D.R., Hinkley, D.V. (1974). Theoretical Statistics. Chapman & Hall. (Classic. Especially good on the observed-versus-expected information distinction and on regularity-condition failures.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.