OLS as geometry

Part 4 — Linear regression, done seriously

Learning objectives

  • State the LINEAR-REGRESSION SETUP in matrix form: Y = Xβ + ε with Y ∈ ℝⁿ, X ∈ ℝⁿˣᵖ (design matrix; rows = observations, columns = predictors, typically including a column of 1s for the intercept), β ∈ ℝᵖ (coefficient vector), ε ∈ ℝⁿ (error vector). Identify the four objects' dimensions and how p ≪ n is the regime where OLS is well-posed
  • Recognise that OLS minimises ‖Y − Xβ‖² and that the GEOMETRIC interpretation — Ŷ = Xβ̂ is the orthogonal projection of Y onto col(X), the column space of X — is more fundamental than the calculus derivation. The proof requires no calculus; the closest-point theorem in Euclidean space (Hilbert space) gives the projection
  • Define the RESIDUAL e = Y − Ŷ and the ORTHOGONALITY PROPERTY: e ⊥ every column of X, equivalently X′e = 0. Recognise this as the SAME condition as the first-order condition from minimising Σ residuals² with calculus — geometry and calculus give the same answer
  • State the HAT MATRIX H = X(X′X)⁻¹X′. Verify its properties: (i) symmetric (H′ = H); (ii) idempotent (H² = H); (iii) Ŷ = HY; (iv) the residual-maker M = I − H is also symmetric and idempotent with e = MY; (v) trace(H) = p, the column-space dimension. Recognise H as THE matrix of the orthogonal projection onto col(X)
  • Derive the NORMAL EQUATIONS X′Xβ̂ = X′Y by setting X′e = 0 and substituting e = Y − Xβ̂. Recognise that when X has full column rank, X′X is p × p and invertible, giving the closed-form OLS estimator β̂ = (X′X)⁻¹X′Y. State that the estimator is LINEAR in Y — β̂ = (X′X)⁻¹X′·Y is a linear map from ℝⁿ to ℝᵖ — which Gauss-Markov uses
  • Read off the GEOMETRIC R² FORMULA: after centring Y and Ŷ around Ȳ, R² = ‖Ŷ − Ȳ𝟏‖² / ‖Y − Ȳ𝟏‖² = cos² of the angle between (Y − Ȳ𝟏) and col(X) projected onto the orthogonal complement of 𝟏. State that R² ∈ [0, 1] is the fraction of variation in Y explained by the projection. Note its three failure modes: (i) it always increases when adjusting predictors; (ii) it is not a model-quality measure on its own; (iii) adjusted R² and AIC/BIC (§4.7) are the responsible alternatives for model comparison
  • Define LEVERAGE h_ii = (H)_ii, the i-th diagonal entry of the hat matrix. State the identity Σ h_ii = trace(H) = p, hence the AVERAGE LEVERAGE = p/n. State the Belsley-Kuh-Welsch (1980) RULE OF THUMB: h_ii > 2p/n flags a point worth a second look. Interpret geometrically: high leverage = the row of X is far from the centroid in covariate space, in the design's metric
  • State the MULTICOLLINEARITY DIAGNOSTIC as the geometric statement: when two or more columns of X are nearly parallel (nearly linearly dependent), X′X is nearly singular and (X′X)⁻¹ has very large entries, making β̂ unstable (small changes in Y produce large changes in β̂). Recognise that this is structural, not a small-sample artefact; §4.3 develops VIFs / condition indices
  • State the FIVE GAUSS-MARKOV ASSUMPTIONS: (1) LINEARITY — 𝔼[Y | X] = Xβ; (2) EXOGENEITY / strict exogeneity — 𝔼[ε | X] = 0; (3) HOMOSCEDASTICITY — Var(ε_i | X) = σ² constant; (4) NO AUTOCORRELATION — Cov(ε_i, ε_j | X) = 0 for i ≠ j; (5) NO PERFECT MULTICOLLINEARITY — rank(X) = p. State the GAUSS-MARKOV THEOREM: under these five assumptions, the OLS estimator β̂ is BLUE (Best Linear Unbiased Estimator) — minimum-variance among all linear unbiased estimators of β. Note that BLUE concerns LINEAR unbiased estimators only; biased estimators (ridge, §4.7) and nonlinear estimators (LASSO, §9.2) can beat OLS in MSE
  • Recognise the optional SIXTH ASSUMPTION (NORMALITY): ε | X ~ 𝒩(0, σ²I). Under Normality, β̂ is exactly Normal in finite samples, σ̂²(X′X)⁻¹ gives exact small-sample standard errors, and (n − p)σ̂²/σ² ~ χ²_{n−p} — yielding the exact small-sample t- and F-tests of §4.3. Without Normality, the CLT carries the same inference asymptotically. Distinguish the five Gauss-Markov assumptions (about MOMENTS) from the Normality assumption (about the DISTRIBUTION)
  • Articulate why GEOMETRY FIRST helps: the projection picture makes diagnostics (§4.3), GLS (§4.4 — change the inner product), robust regression (§4.5 — change the loss away from ‖·‖²), ridge / lasso (Part 9 — add a penalty whose level sets carve the projection differently), and the causal-interpretation warnings (§4.8 — the geometry says nothing about cause) all snap into place as different perturbations of the same geometric object
  • Read the catalogue of seminal references: Gauss (1809) and Legendre (1805) for the historical origin; Plackett (1972) for the priority history; Hastie-Tibshirani-Friedman (2009) ch.3 and James-Witten-Hastie-Tibshirani (2021) ch.3 as the modern textbook treatments; Greene (2018) as the econometric reference; Belsley-Kuh-Welsch (1980) for leverage and diagnostics; Wasserman (2004) ch.13 and Casella-Berger (2002) ch.11 for the mathematical-statistics treatment

Part 3 closed with the empirical-testability backbone (calibration) and the communication side of UNIVARIATE inference. Part 4 turns to the keystone tool of applied statistics across every science: linear regression, the procedure that takes a response YY and a vector of predictors XX and asks "what linear combination of the predictors best explains the response?" The simplicity is deceptive — when its assumptions hold, OLS is the unique optimum among an entire family of estimators; when they fail, the literature on what to do next fills shelves.

§4.1 sets the foundation. The angle we take is GEOMETRIC. Most introductory presentations open with calculus: write YXβ2|Y - X\beta|^2, take the derivative with respect to β\beta, set it to zero, solve. That works and gives the right formula, but it hides the underlying object. The MORE FUNDAMENTAL view — and the one that pays for itself in every later section — is that OLS computes the ORTHOGONAL PROJECTION of YY onto the column space of XX. Once you see the picture, the hat matrix, the normal equations, leverage, R², the Gauss-Markov theorem, multicollinearity, ridge regression, robust regression, and generalised least squares all reveal themselves as geometric statements about projections — sometimes onto col(X), sometimes onto a tilted or penalised version of it.

The §4.1 arc has eight stops. First, the matrix-form setup Y=Xβ+εY = X\beta + \varepsilon and the dimensions to keep in mind. Second, the geometric closest-point argument that defines β^\hat\beta. Third, the orthogonality of residuals to columns of X, expressed as Xe=0X^\top e = 0. Fourth, the hat matrix H=X(XX)1XH = X(X^\top X)^{-1}X^\top and its three structural properties (symmetric, idempotent, trace = p). Fifth, the normal equations and the closed-form OLS estimator. Sixth, geometric implications: R² as cos2\cos^2, leverage as the diagonal of HH, multicollinearity as near-parallel columns. Seventh, the Gauss-Markov theorem (BLUE under five assumptions) and the role of the optional Normality assumption. Eighth, the two widgets — ols-projection-geometry (a 3-D view of the projection) and hat-matrix-leverage (a leverage scatter where you drag points and watch h_ii update).

The setup: Y = Xβ + ε in matrix form

The data live in four objects of fixed shape:

  • YRnY \in \mathbb{R}^n — the response (outcome) vector. One number per observation. In a study of nn patients, YiY_i might be patient ii's recovery time. The whole-vector view treats YY as a single point in nn-dimensional space.
  • XRn×pX \in \mathbb{R}^{n \times p} — the design matrix. Row ii holds the pp predictors for observation ii; column jj holds predictor jj's values across all nn observations. The first column is typically a vector of 11s (the intercept). For two predictors plus intercept: row i=(1,xi1,xi2)i = (1, x_{i1}, x_{i2}), so p=3p = 3.
  • βRp\beta \in \mathbb{R}^p — the coefficient vector. One number per predictor. In population terms, the unknown parameter we want to estimate.
  • εRn\varepsilon \in \mathbb{R}^n — the error vector. One number per observation, the part of YiY_i not explained by the linear combination XiβX_i \beta. In population terms, εi=YiXiβ\varepsilon_i = Y_i - X_i\beta.

The MODEL is the matrix equation

Y  =  Xβ  +  ε.Y \;=\; X\beta \;+\; \varepsilon.

Read it slowly. The left side is observed (YY). The right side decomposes that observation into (i) a SYSTEMATIC piece XβX\beta, a linear combination of the columns of XX, and (ii) a RANDOM piece ε\varepsilon. The estimation problem is: given Y,XY, X, find a β^\hat\beta such that Xβ^X\hat\beta explains as much of YY as possible while the unexplained residual e=YXβ^e = Y - X\hat\beta behaves like the errors should under whatever assumptions are appropriate. The criterion that defines OLS is "make the residual as short as possible in Euclidean length":

β^  =  argminβRpYXβ2.\hat\beta \;=\; \arg\min_{\beta \in \mathbb{R}^p} \|Y - X\beta\|^2.

The square is convenient — it makes the calculus tractable — but the choice of Euclidean norm is what locks OLS to its projection-geometry interpretation. Change the norm (e.g., to Ω1|\cdot|_{\Omega^{-1}} for a known covariance Ω\Omega) and you get GLS; change 2|\cdot|^2 to a bounded loss and you get M-estimators / robust regression. The keystone is the Euclidean choice. §4.1 keeps it.

OLS is the orthogonal projection of Y onto col(X)

Here is the geometric argument, which uses NO calculus. Let col(X)={Xβ:βRp}\mathrm{col}(X) = {X\beta : \beta \in \mathbb{R}^p} be the pp-dimensional subspace of Rn\mathbb{R}^n spanned by the columns of XX. The minimisation minβYXβ2\min_\beta |Y - X\beta|^2 asks: over all points XβX\beta in col(X)\mathrm{col}(X), which one is closest (in Euclidean distance) to YY? The CLOSEST-POINT THEOREM in Euclidean space (or more generally in any Hilbert space) gives the answer:

The closest point to YY in a closed subspace S\mathcal{S} is the orthogonal projection of YY onto S\mathcal{S}. It is characterised uniquely by the condition that the residual YY^Y - \hat Y is orthogonal to every vector in S\mathcal{S}.

Apply this with S=col(X)\mathcal{S} = \mathrm{col}(X). The closest point in col(X)\mathrm{col}(X) to YY is the orthogonal projection — call it Y^=Xβ^\hat Y = X\hat\beta. The residual e=YY^e = Y - \hat Y is perpendicular to every column of XX:

e,xj  =  xje  =  0for every column xj of X,\langle e, x_j \rangle \;=\; x_j^\top e \;=\; 0 \qquad \text{for every column } x_j \text{ of } X,

or, stacking the pp equations into a single matrix equation,

  Xe  =  0.  \boxed{\;X^\top e \;=\; 0.\;}

This is the GEOMETRIC OLS condition. Substituting e=YXβ^e = Y - X\hat\beta gives X(YXβ^)=0X^\top(Y - X\hat\beta) = 0, i.e.

XXβ^  =  XY.X^\top X \,\hat\beta \;=\; X^\top Y.

These are the NORMAL EQUATIONS. When XX has full column rank (the no-perfect-multicollinearity assumption), the p×pp \times p matrix XXX^\top X is invertible and the unique OLS estimator is

  β^  =  (XX)1XY.  \boxed{\;\hat\beta \;=\; (X^\top X)^{-1} X^\top Y.\;}

Two things to absorb. First, the calculus version — minimise YXβ2|Y - X\beta|^2 by differentiating in β\beta and setting the gradient to zero — gives 2X(XβY)=02X^\top(X\beta - Y) = 0, i.e. exactly the same normal equations. Geometry and calculus agree, but the geometric argument needs neither vector calculus nor convex optimisation — only the closest-point theorem. Second, the estimator β^=(XX)1XY\hat\beta = (X^\top X)^{-1} X^\top Y is LINEAR IN Y: it is the matrix (XX)1XRp×n(X^\top X)^{-1} X^\top \in \mathbb{R}^{p \times n} applied to the nn-vector YY. The Gauss-Markov theorem, below, uses this linearity in an essential way.

The hat matrix H = X(X′X)⁻¹X′

Multiply both sides of β^=(XX)1XY\hat\beta = (X^\top X)^{-1} X^\top Y on the left by XX:

Y^  =  Xβ^  =  X(XX)1XY  =  HY,H    X(XX)1X.\hat Y \;=\; X\hat\beta \;=\; X(X^\top X)^{-1} X^\top \, Y \;=\; H\,Y, \qquad H \;\equiv\; X(X^\top X)^{-1} X^\top.

The matrix HH — universally called the HAT MATRIX (because it "puts the hat on" YY) — is n×nn \times n and depends ONLY on XX, not on YY. It is THE matrix representation of the orthogonal projection from Rn\mathbb{R}^n onto col(X)\mathrm{col}(X). Three structural properties make everything in Parts 4–5 click:

  • Symmetric: H=HH^\top = H. Direct from the definition: H=(X(XX)1X)=X((XX)1)X=X(XX)1X=HH^\top = \bigl(X(X^\top X)^{-1}X^\top\bigr)^\top = X((X^\top X)^{-1})^\top X^\top = X(X^\top X)^{-1} X^\top = H, using (A1)=(A)1(A^{-1})^\top = (A^\top)^{-1} for symmetric A=XXA = X^\top X.
  • Idempotent: H2=HH^2 = H. H2=X(XX)1XX(XX)1X=X(XX)1(XX)(XX)1X=X(XX)1X=HH^2 = X(X^\top X)^{-1}X^\top \cdot X(X^\top X)^{-1}X^\top = X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top = X(X^\top X)^{-1}X^\top = H. Applying the projection twice gives the same result as applying it once — the geometric reading is that projecting an already-projected vector leaves it unchanged.
  • trace(HH) = pp. tr(H)=tr(X(XX)1X)=tr((XX)1XX)=tr(Ip)=p\mathrm{tr}(H) = \mathrm{tr}(X(X^\top X)^{-1}X^\top) = \mathrm{tr}((X^\top X)^{-1}X^\top X) = \mathrm{tr}(I_p) = p, using the cyclic property of trace. This identity is the heart of degrees-of-freedom counting in regression: the pp columns of XX consume pp degrees of freedom; the residual sits in an (np)(n - p)-dimensional subspace.

The RESIDUAL-MAKER matrix is M=IHM = I - H. It is also symmetric (M=IH=IH=MM^\top = I - H^\top = I - H = M) and idempotent (M2=(IH)(IH)=I2H+H2=I2H+H=IH=MM^2 = (I - H)(I - H) = I - 2H + H^2 = I - 2H + H = I - H = M), so MM is the orthogonal projection onto the (np)(n - p)-dimensional orthogonal complement of col(X)\mathrm{col}(X). The residual is e=YY^=(IH)Y=MYe = Y - \hat Y = (I - H)Y = MY. The fitted values and residuals decompose YY orthogonally:

Y  =  HY+(IH)Y  =  Y^+e,Y^e=(HY)(MY)=YHMY=0Y \;=\; HY + (I - H)Y \;=\; \hat Y + e, \qquad \hat Y^\top e = (HY)^\top (MY) = Y^\top H M Y = 0

because HM=H(IH)=HH2=0HM = H(I - H) = H - H^2 = 0. The fitted values and the residuals are PERPENDICULAR vectors in Rn\mathbb{R}^n, and their squared lengths add by Pythagoras:

Y2  =  Y^2  +  e2.\|Y\|^2 \;=\; \|\hat Y\|^2 \;+\; \|e\|^2.

The widget below makes this concrete: as you drag YY around, watch Y^2+e2|\hat Y|^2 + |e|^2 track Y2|Y|^2 to within rounding. That identity is not a sample fact — it is a geometric necessity of orthogonal projection.

The projection-geometry widget

The first widget makes the picture inhabitable. To keep the visualisation in 3-D, we fix n=3n = 3 and p=2p = 2: YY is a vector in R3\mathbb{R}^3, the column space col(X)\mathrm{col}(X) is a 2-D plane through the origin, and the projection Y^\hat Y is the foot of the perpendicular from YY to that plane. Drag the canvas to rotate the view; move YY with the three sliders; or "Re-roll columns" to draw a fresh random orthonormal pair so the plane visibly tilts.

Ols Projection GeometryInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • The vector YY (orange) starts off the plane. Y^\hat Y (cyan) sits on the plane. The residual e=YY^e = Y - \hat Y (red dashed) is the perpendicular from YY to the plane. The right-angle tick mark at Y^\hat Y is not decoration — it is the geometric statement ecol(X)e \perp \mathrm{col}(X).
  • The status table reports e,x1\langle e, x_1 \rangle and e,x2\langle e, x_2 \rangle — the inner products of the residual with each column. They are both numerically 00 (to within 101510^{-15} floating-point error). This is the Xe=0X^\top e = 0 statement, verified numerically.
  • The table also reports Y2|Y|^2, Y^2|\hat Y|^2, e2|e|^2, and the SUM Y^2+e2|\hat Y|^2 + |e|^2. The sum matches Y2|Y|^2 exactly — Pythagoras for orthogonal projection.
  • Click "Snap Y onto plane". The widget animates YY onto col(X)\mathrm{col}(X) and back. When YY sits inside the plane, the residual collapses to zero, R21R^2 \to 1, and the fit is exact. This is the "lucky" or "trivial" case where the data lie exactly in the assumed model.
  • Slide YY so it is roughly perpendicular to the plane. Y^\hat Y shrinks toward the origin; R20R^2 \to 0. The columns of XX carry no information about the direction of YY; the projection is (nearly) zero. This is the "the predictors do not predict YY" case.
  • Click "Re-roll columns" to draw a fresh orthonormal pair. The plane tilts; Y^\hat Y jumps to the new projection; the residual reorients. The numeric R2R^2 changes — different column spaces explain different fractions of Y2|Y|^2.
  • Read the "geometric R²" row of the table: R2=cos2θR^2 = \cos^2 \theta, where θ\theta is the angle between YY and the plane. Small θ\theta (Y close to the plane) gives R² close to 1; large θ\theta (Y nearly orthogonal to the plane) gives R² close to 0. This is the GEOMETRIC INTERPRETATION OF R², independent of any sample statistic.

R² is cos² of an angle

The widget exposes a formula that introductory regression texts often state without context: R2=cos2θR^2 = \cos^2 \theta. Here is the full statement. Let Yˉ\bar Y be the sample mean of YY and 1\mathbf{1} the nn-vector of ones. The CENTRED response is YYˉ1Y - \bar Y \mathbf{1} and the CENTRED fitted values are Y^Yˉ1\hat Y - \bar Y \mathbf{1}. When the design includes an intercept (so 1col(X)\mathbf{1} \in \mathrm{col}(X)), we have iei=0\sum_i e_i = 0 (the residuals sum to zero), hence Y^ˉ=Yˉ\bar{\hat Y} = \bar Y, and the standard decomposition holds:

YYˉ12SST (total)  =  Y^Yˉ12SSR (regression)  +  e2SSE (error).\underbrace{\|Y - \bar Y \mathbf{1}\|^2}_{\text{SST (total)}} \;=\; \underbrace{\|\hat Y - \bar Y \mathbf{1}\|^2}_{\text{SSR (regression)}} \;+\; \underbrace{\|e\|^2}_{\text{SSE (error)}}.

The coefficient of determination is

R2  =  SSRSST  =  1SSESST.R^2 \;=\; \frac{\mathrm{SSR}}{\mathrm{SST}} \;=\; 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}.

Geometrically: after centring, both YYˉ1Y - \bar Y \mathbf{1} and Y^Yˉ1\hat Y - \bar Y \mathbf{1} are vectors orthogonal to 1\mathbf{1}, and Y^Yˉ1\hat Y - \bar Y \mathbf{1} is the orthogonal projection of YYˉ1Y - \bar Y \mathbf{1} onto col(X)1\mathrm{col}(X) \cap \mathbf{1}^\perp. By the definition of orthogonal projection, the cosine of the angle θ\theta between YYˉ1Y - \bar Y \mathbf{1} and its projection equals Y^Yˉ1/YYˉ1|\hat Y - \bar Y \mathbf{1}| / |Y - \bar Y \mathbf{1}|. Squaring gives R2=cos2θR^2 = \cos^2 \theta. The widget displays both R2R^2 and θ\theta in the table; the relationship is exact, not an approximation.

Three honest caveats about R². First, R² is MONOTONE in pp: adding a column to XX enlarges col(X)\mathrm{col}(X), so the projection cannot move further from YY — R² can only increase. R² alone is therefore not a model-quality measure (a model with more predictors always has a higher R²; that does not make it a better model). Adjusted R² (penalises pp) and information criteria (AIC, BIC; §4.7) are the responsible alternatives for model comparison. Second, R² says nothing about whether the assumed linear functional form is correct — a high R² with badly nonlinear true relationship can still produce systematic residual patterns (§4.3). Third, on a designed experiment where the predictors are fixed and the response is the random object, R² has a clean variance-decomposition interpretation; on an observational study where both are random, R² is the squared sample correlation between fitted and observed values — a different object that requires the §4.8 causal-warnings care.

Leverage h_ii = diag(H)

Each diagonal entry of HH is called the LEVERAGE of observation ii:

hii  =  (H)ii  =  xi(XX)1xi,h_{ii} \;=\; (H)_{ii} \;=\; x_i^\top (X^\top X)^{-1} x_i,

where xiRpx_i \in \mathbb{R}^p is the ii-th row of XX. Three facts to internalise. First, hii[0,1]h_{ii} \in [0, 1] for any row in a model with an intercept (an inequality that comes from idempotency: H2=HH^2 = H implies hii=jhij2hii2h_{ii} = \sum_j h_{ij}^2 \ge h_{ii}^2, hence hii(1hii)0h_{ii}(1 - h_{ii}) \ge 0). Second, i=1nhii=tr(H)=p\sum_{i=1}^n h_{ii} = \mathrm{tr}(H) = p, so the AVERAGE LEVERAGE is exactly p/np/n. Third, hiih_{ii} measures how much observation ii's own value YiY_i contributes to its own fitted value Y^i\hat Y_i: from Y^=HY\hat Y = HY,

Y^i  =  hiiYi  +  jihijYj.\hat Y_i \;=\; h_{ii} \, Y_i \;+\; \sum_{j \ne i} h_{ij}\, Y_j.

When hii=1h_{ii} = 1, Y^i=Yi\hat Y_i = Y_i exactly — the fit is forced to pass through observation ii; the point has total leverage and the fitted line is at its mercy. When hii=p/nh_{ii} = p/n (the average), observation ii contributes its FAIR SHARE of the pp-dimensional explanatory budget. When hii>2p/nh_{ii} > 2p/n — the BELSLEY-KUH-WELSCH (1980) RULE OF THUMB — observation ii has unusually high leverage and is worth a second look. For simple regression with intercept ++ one slope (p=2p = 2), the average leverage is 2/n2/n and the threshold is 4/n4/n.

Geometrically, hiih_{ii} measures how far row ii of XX sits from the centroid of the design in COVARIATE SPACE, in the metric (XX/n)1(X^\top X / n)^{-1}. For simple regression with X=[1,x]X = [\mathbf{1}, x], the formula simplifies to

hii  =  1n  +  (xixˉ)2j=1n(xjxˉ)2.h_{ii} \;=\; \frac{1}{n} \;+\; \frac{(x_i - \bar x)^2}{\sum_{j=1}^n (x_j - \bar x)^2}.

Points far from xˉ\bar x in xx-coordinates have high leverage. The second widget makes this immediate.

The hat-matrix-leverage widget

The second widget shows a 2-D scatter of n=30n = 30 points generated from a clean linear trend. Each point is sized and colour-coded by its leverage hiih_{ii}. The OLS line is overlaid, and a vertical dashed line marks xˉ\bar x (where leverage is at its minimum). The right panel shows a sorted bar chart of the leverages, with vertical reference lines at the average p/np/n and the Belsley-Kuh-Welsch threshold 2p/n2p/n.

Hat Matrix LeverageInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • On a clean sample, leverages are all near p/n=2/300.067p/n = 2/30 \approx 0.067. The status panel reports tr(H)=hii=2.000\mathrm{tr}(H) = \sum h_{ii} = 2.000 (within rounding): the trace-equals-p identity, verified numerically.
  • Drag a point far to the right or left of the cloud — well past xˉ\bar x — and watch its leverage spike. It glows orange (above 2p/n2p/n), then red (above 3p/n3p/n). The other points' leverages adjust slightly because xˉ\bar x and (xjxˉ)2\sum(x_j - \bar x)^2 both move when you drag.
  • Click "Add high-leverage point". A new point is dropped near the right edge AT THE OLS LINE'S CURRENT yy VALUE. The point has very high leverage but a residual of approximately zero — so the OLS line BARELY MOVES. Now drag that point vertically away from the line. The OLS line ROTATES dramatically. This is the textbook high-leverage / high-residual scenario where OLS gets pulled around. §4.3 turns this into Cook's distance and DFFITS, the influence diagnostics that combine leverage and residual size.
  • Right-click any point to remove it. Re-roll the sample to start over.
  • Verify the sum-of-leverages identity: as you add or remove points, the status panel's "trace(H) = Σ h_ii" line always reads close to 2.002.00. The trace is a STRUCTURAL property of any orthogonal projection onto a 2-D subspace — it does not depend on the specific points, only on the dimension pp.
  • Observation: the leverage formula hii=1/n+(xixˉ)2/(xjxˉ)2h_{ii} = 1/n + (x_i - \bar x)^2 / \sum(x_j - \bar x)^2 has NO yiy_i in it. Leverage is a property of the DESIGN matrix XX alone, not of the response. This is why high leverage is "potential to influence", not influence itself — a high-leverage point with a tiny residual barely affects the fit. The §4.3 influence diagnostics combine leverage with residual size to identify points that ACTUALLY influence the fit.

Multicollinearity as a geometric condition

When two or more columns of XX are nearly parallel (nearly linearly dependent), the p×pp \times p matrix XXX^\top X becomes nearly singular. Its smallest eigenvalue approaches zero; its condition number κ(XX)=λmax/λmin\kappa(X^\top X) = \lambda_{\max}/\lambda_{\min} explodes; the inverse (XX)1(X^\top X)^{-1} has very large entries. The OLS formula β^=(XX)1XY\hat\beta = (X^\top X)^{-1} X^\top Y then propagates SMALL changes in YY into LARGE changes in β^\hat\beta — the estimator is UNSTABLE.

The geometric reading is direct. If columns x1,x2x_1, x_2 are nearly parallel, the 2-D parallelogram they span is nearly DEGENERATE — almost a 1-D line. To express the projection of YY as a linear combination of x1x_1 and x2x_2, the coefficients β^1,β^2\hat\beta_1, \hat\beta_2 must be enormous (and of opposite signs), with most of the magnitude cancelling. Small perturbations of YY shift which "side" of the cancellation wins, so β^\hat\beta jumps wildly.

The §4.3 diagnostics — VIFs (variance inflation factors) and condition indices — are quantitative versions of "how nearly parallel are the columns?" The standard formulas come from the same geometry: VIF(xj)=1/(1Rj2)\mathrm{VIF}(x_j) = 1/(1 - R_j^2) where Rj2R_j^2 is the R² from regressing xjx_j on the OTHER predictors. A large VIF means a large fraction of xjx_j's variation is explained by the other columns — i.e., xjx_j is nearly in the span of the rest. The remedies are also geometric: drop a redundant column (reduce pp), centre and orthogonalise the predictors, or move to ridge regression (Part 9 §9.2), which adds λI\lambda I to XXX^\top X to push the smallest eigenvalue safely away from zero — a geometric modification of the projection itself.

The Gauss-Markov theorem: OLS is BLUE

So far the discussion has been purely about the SAMPLE FACTS — the geometry of one observed (X,Y)(X, Y). To talk about properties like UNBIASEDNESS and MINIMUM VARIANCE we need an assumed sampling-distribution model. The Gauss-Markov (1809–1821, formalised by Markov 1900) framework places five assumptions on the error vector ε\varepsilon:

  • Linearity: E[YX]=Xβ\mathbb{E}[Y \mid X] = X\beta. The conditional mean of YY is exactly the linear combination XβX\beta. No omitted nonlinear terms.
  • Strict exogeneity: E[εX]=0\mathbb{E}[\varepsilon \mid X] = 0. The errors have mean zero given the predictors. (Equivalently, the predictors carry no information about the errors' direction — they are uncorrelated, in a stronger conditional-mean sense.)
  • Homoscedasticity: Var(εiX)=σ2\mathrm{Var}(\varepsilon_i \mid X) = \sigma^2 for every ii. The errors have the same variance regardless of the predictor values.
  • No autocorrelation: Cov(εi,εjX)=0\mathrm{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0 for iji \ne j. The errors are uncorrelated across observations.
  • No perfect multicollinearity: rank(X)=p(X) = p. The columns of XX are linearly independent, so XXX^\top X is invertible and β^\hat\beta is well-defined.

Under these five assumptions, the GAUSS-MARKOV THEOREM (Gauss 1809; Markov 1900; cf. Lehmann-Casella 1998 ch.3) states:

The OLS estimator β^OLS=(XX)1XY\hat\beta_{\mathrm{OLS}} = (X^\top X)^{-1} X^\top Y is the BLUE — Best Linear Unbiased Estimator — of β\beta. That is: among all estimators that (i) are LINEAR in YY and (ii) are UNBIASED for β\beta (have E[β~]=β\mathbb{E}[\tilde\beta] = \beta for every value of β\beta), the OLS estimator has the minimum variance.

"Linear in YY" means β~=AY+a\tilde\beta = AY + a for some non-random matrix AA and vector aa; the natural example is β~=AY\tilde\beta = AY with A=(XX)1XA = (X^\top X)^{-1} X^\top for OLS, or any other AA for a different linear estimator. "Minimum variance" is in the matrix sense: Var(β^OLS)Var(β~)\mathrm{Var}(\hat\beta_{\mathrm{OLS}}) \preceq \mathrm{Var}(\tilde\beta) as p×pp \times p matrices (i.e., the difference is positive semi-definite).

Three caveats worth keeping in mind. First, BLUE concerns LINEAR UNBIASED estimators only. BIASED estimators (ridge, lasso) can have lower mean-squared error than OLS (this is the §1.5 bias-variance trade-off again; Part 9 develops it for regression). NONLINEAR estimators (median regression, M-estimators) can also beat OLS in MSE when the error distribution has heavy tails. Gauss-Markov says only that among the linear-unbiased class, OLS wins. Second, the theorem requires assumptions (3) and (4) — homoscedasticity and no autocorrelation. When they fail, OLS is still unbiased but no longer BLUE: GLS (§4.4) is the BLUE in that broader setting. Third, the theorem says nothing about the DISTRIBUTION of β^\hat\beta. For exact small-sample inference (t-tests, F-tests, confidence intervals) we need the optional sixth assumption.

The optional Normality assumption

If we add the sixth assumption

εX    N(0,σ2In),\varepsilon \mid X \;\sim\; \mathcal{N}(0, \sigma^2 I_n),

two exact small-sample consequences follow. First, β^=(XX)1XY\hat\beta = (X^\top X)^{-1} X^\top Y is a linear function of the Normal vector Y=Xβ+εY = X\beta + \varepsilon, hence is itself Normal:

β^    N ⁣(β,  σ2(XX)1).\hat\beta \;\sim\; \mathcal{N}\!\left(\beta, \;\sigma^2 (X^\top X)^{-1}\right).

Second, the residual sum of squares scales to a chi-square:

(np)σ^2σ2  =  e2σ2    χnp2,\frac{(n - p)\,\hat\sigma^2}{\sigma^2} \;=\; \frac{\|e\|^2}{\sigma^2} \;\sim\; \chi^2_{n - p},

independent of β^\hat\beta (Cochran's theorem, 1934, on independence of quadratic forms in Normal vectors). These two facts give the EXACT small-sample t-distribution for testing single coefficients ((β^jβj)/SE^(β^j)tnp(\hat\beta_j - \beta_j) / \widehat{\mathrm{SE}}(\hat\beta_j) \sim t_{n-p}) and the EXACT F-distribution for testing joint hypotheses about multiple coefficients (§4.3 develops both).

Without the Normality assumption, the CLT (§0.7) takes over asymptotically. As nn \to \infty with pp fixed and the design well-behaved, n(β^β)dN(0,σ2(XX/n)1)\sqrt n(\hat\beta - \beta) \xrightarrow{d} \mathcal{N}(0, \sigma^2 (X^\top X / n)^{-1}), and the t- and F-tests are APPROXIMATELY valid for large nn. The §4.3 diagnostics include Q-Q plots of the standardised residuals (§3.5) precisely to check whether the Normality assumption is empirically defensible for any given dataset; the more departure, the more we rely on nn being large.

Why "geometry first" pays off downstream

The geometric picture pays for itself in every later section of Part 4:

  • §4.2 (assumptions fail). Each Gauss-Markov assumption corresponds to a GEOMETRIC property of the projection. Homoscedasticity says Var(ε)=σ2I\mathrm{Var}(\varepsilon) = \sigma^2 I — the errors are SPHERICALLY symmetric about the column space. Autocorrelation distorts the sphere into an ellipsoid. Endogeneity tilts the column space relative to the truth. Each failure has a geometric signature.
  • §4.3 (diagnostics). Residuals, leverage, and influence are all read off the hat matrix HH. Cook's distance, DFFITS, and DFBETAS are scalar summaries of the SAME projection geometry.
  • §4.4 (GLS). When Var(ε)=σ2Ω\mathrm{Var}(\varepsilon) = \sigma^2 \Omega for known ΩI\Omega \ne I, the BLUE is obtained by projecting Ω1/2Y\Omega^{-1/2} Y onto col(Ω1/2X)\mathrm{col}(\Omega^{-1/2} X). Same picture, different inner product.
  • §4.5 (robust regression). Replace 2|\cdot|^2 with a bounded loss ρ()\rho(\cdot). The "projection" is no longer Euclidean orthogonal, but the same closest-point logic recovers the M-estimator.
  • §4.6 (interactions). Adding an interaction term enlarges col(X)\mathrm{col}(X) by one dimension. The projection moves into the bigger subspace.
  • §4.7 (model selection — AIC/BIC/CV). Selection chooses which columns to include — i.e., which sub-spaces of Rn\mathbb{R}^n to project onto. The trade-off is between fitting YY closely (big col(X)\mathrm{col}(X)) and over-fitting noise (the bias-variance frontier of §1.5).
  • §4.8 (causal warnings). The projection geometry tells you the linear combination of predictors that best matches YY. It tells you NOTHING about which predictors CAUSE YY. Confounders, instruments, mediators, colliders — all of those concepts (Part 6) live at the population / causal level, not in the geometry of one sample.
  • Part 9 (ridge / lasso). Ridge adds λI\lambda I to XXX^\top X, geometrically SHRINKING the projection toward the origin (already seen in §1.5's shrinkage-cv-tuner widget). Lasso uses a non-Euclidean penalty whose level sets have CORNERS, so the projection lands on the boundary with some coefficients set to zero. Both are geometric modifications of OLS.

Try it

  • In the ols-projection-geometry, leave the defaults and rotate the view. Confirm that Y^\hat Y (cyan) sits inside the blue plane and that the red dashed residual e=YY^e = Y - \hat Y is perpendicular to the plane (the right-angle tick at Y^\hat Y marks this). Read off e,x1\langle e, x_1 \rangle and e,x2\langle e, x_2 \rangle — both should be 0 to within 101510^{-15}.
  • Same widget. Verify Y2=Y^2+e2|Y|^2 = |\hat Y|^2 + |e|^2 to 3 decimal places. Slide Y3Y_3 from 1.21.2 to 1.2-1.2 and watch the three squared norms re-balance. The sum Y^2+e2|\hat Y|^2 + |e|^2 always equals Y2|Y|^2 — Pythagoras for orthogonal projection.
  • Same widget. Click "Snap Y onto plane". Watch the residual collapse to zero, R² → 1, and θ0\theta \to 0. State why: when Ycol(X)Y \in \mathrm{col}(X), the projection is YY itself and the fit is exact.
  • Same widget. Click "Re-roll columns" repeatedly. Notice that R2=cos2θR^2 = \cos^2 \theta can change dramatically across different column choices. State why: different column choices give different col(X)\mathrm{col}(X), and the angle between a fixed YY and different planes varies.
  • In the hat-matrix-leverage, observe that on a clean sample of n=30n = 30 points, the leverage bar chart shows all bars below the 2p/n2p/n threshold and most bars near p/n=0.067p/n = 0.067. Confirm: hii=tr(H)=2.000\sum h_{ii} = \mathrm{tr}(H) = 2.000 (within rounding).
  • Same widget. Drag a point far to the right of the cloud (x ≈ 14). Read off its new leverage hiih_{ii}. Verify by hand: hii=1/30+(14xˉ)2/(xjxˉ)21/30+70/2700.29h_{ii} = 1/30 + (14 - \bar x)^2 / \sum(x_j - \bar x)^2 \approx 1/30 + 70/270 \approx 0.29 (precise numbers depend on the sample). Should be well above 2p/n=0.1332p/n = 0.133. The bar chart highlights it in orange / red.
  • Same widget. Click "Add high-leverage point". The point lands ON the OLS line, so its residual is approximately zero and the line barely moves. NOW drag that point vertically away from the line. The OLS line ROTATES dramatically. Conclude: leverage is the POTENTIAL to influence, realised influence requires also a non-zero residual — the §4.3 Cook's distance combines both.
  • Pen-and-paper. With X=[1,x]X = [\mathbf{1}, x] for simple regression and the hii=1/n+(xixˉ)2/(xjxˉ)2h_{ii} = 1/n + (x_i - \bar x)^2 / \sum(x_j - \bar x)^2 formula, verify i=1nhii=p=2\sum_{i=1}^n h_{ii} = p = 2. Hint: expand (xixˉ)2(x_i - \bar x)^2, sum over ii, and use (xixˉ)=0\sum(x_i - \bar x) = 0. The 1/n terms sum to n1/n=1n \cdot 1/n = 1; the (x_i - x̄)² terms sum to (xixˉ)2/(xjxˉ)2=1\sum(x_i - \bar x)^2 / \sum(x_j - \bar x)^2 = 1. Total = 2 = p. ✓
  • Pen-and-paper. Prove that the OLS residuals ee are orthogonal to every column xjx_j of XX. Hint: differentiate YXβ2|Y - X\beta|^2 in β\beta; setting the gradient to zero gives X(YXβ^)=0X^\top(Y - X\hat\beta) = 0, i.e. Xe=0X^\top e = 0. The jj-th row says xje=0x_j^\top e = 0. Geometric statement of the same fact: the residual is perpendicular to the column space, hence to every spanning vector.
  • Pen-and-paper. State the Gauss-Markov theorem precisely. List the five assumptions. State what BLUE means (Best Linear Unbiased Estimator: minimum variance among the linear-unbiased class). State two regimes where OLS can be IMPROVED on in MSE: (i) by adding bias (ridge, lasso) when the bias-variance trade-off favours it; (ii) by using nonlinear estimators (M-estimators) when the error distribution is heavy-tailed.

Pause and reflect: §4.1 has reframed OLS as GEOMETRY. The data live in Rn\mathbb{R}^n. The columns of XX span a pp-dimensional subspace col(X)\mathrm{col}(X). OLS computes the orthogonal projection of YY onto that subspace — that is the WHOLE algorithm. The hat matrix H=X(XX)1XH = X(X^\top X)^{-1} X^\top is the matrix of the projection; its diagonal entries hiih_{ii} are the leverages; trace(HH) = pp counts the degrees of freedom consumed by the pp columns. The normal equations and the closed-form β^=(XX)1XY\hat\beta = (X^\top X)^{-1} X^\top Y fall out of the geometric orthogonality condition Xe=0X^\top e = 0 (or equivalently from setting the gradient of YXβ2|Y - X\beta|^2 to zero — they agree). Under five Gauss-Markov assumptions, OLS is BLUE. Adding Normality buys exact small-sample t and F. Every later section of Part 4 — and a substantial chunk of Part 9 — perturbs this geometric picture in one specific way. §4.2 next explores what happens when each assumption fails one at a time.

What you now know

You can write the linear-regression model in matrix form Y=Xβ+εY = X\beta + \varepsilon with explicit dimensions (n,p)(n, p). You can state the OLS criterion as the minimisation of YXβ2|Y - X\beta|^2 and re-cast it geometrically: Y^=Xβ^\hat Y = X\hat\beta is the orthogonal projection of YY onto the column space col(X)\mathrm{col}(X), characterised uniquely by the condition that the residual e=YY^e = Y - \hat Y is perpendicular to every column of XX, written compactly as Xe=0X^\top e = 0.

You can derive the NORMAL EQUATIONS XXβ^=XYX^\top X \hat\beta = X^\top Y by substituting e=YXβ^e = Y - X\hat\beta into Xe=0X^\top e = 0. You know the closed-form solution β^=(XX)1XY\hat\beta = (X^\top X)^{-1} X^\top Y when XX has full column rank, and that this estimator is LINEAR in YY — the matrix (XX)1X(X^\top X)^{-1} X^\top applied to the response vector.

You can state the HAT MATRIX H=X(XX)1XH = X(X^\top X)^{-1} X^\top and verify its three structural properties: (i) symmetric, (ii) idempotent, (iii) trace = p. You know that Y^=HY\hat Y = HY and e=(IH)Ye = (I - H)Y, that fitted values and residuals are orthogonal vectors in Rn\mathbb{R}^n, and that they decompose YY by Pythagoras: Y2=Y^2+e2|Y|^2 = |\hat Y|^2 + |e|^2. The ols-projection-geometry widget makes this picture inhabitable in 3-D.

You can read R² as a GEOMETRIC QUANTITY: after centring, R2=cos2θR^2 = \cos^2 \theta where θ\theta is the angle between YYˉ1Y - \bar Y \mathbf{1} and the projection-residual decomposition. You know the three honest caveats (monotone in p, not a model-quality measure on its own, says nothing about functional-form correctness).

You can define LEVERAGE hii=(H)ii=xi(XX)1xih_{ii} = (H){ii} = x_i^\top(X^\top X)^{-1} x_i and recall: (i) hii[0,1]h{ii} \in [0, 1]; (ii) hii=p\sum h_{ii} = p; (iii) average leverage =p/n= p/n; (iv) Belsley-Kuh-Welsch (1980) threshold =2p/n= 2p/n. For simple regression X=[1,x]X = [\mathbf{1}, x] the closed form is hii=1/n+(xixˉ)2/(xjxˉ)2h_{ii} = 1/n + (x_i - \bar x)^2 / \sum(x_j - \bar x)^2. The hat-matrix-leverage widget lets you drag points and watch h_ii update; it also demonstrates that leverage is the POTENTIAL to influence — actual influence requires non-zero residuals too.

You can state MULTICOLLINEARITY as the geometric statement that near-parallel columns of XX make XXX^\top X nearly singular and (XX)1(X^\top X)^{-1} have very large entries, propagating small YY-perturbations into large β^\hat\beta-jumps. You know that §4.3 develops VIFs and condition indices, and that ridge regression (Part 9 §9.2) is one geometric remedy (add λI\lambda I to XXX^\top X).

You can state the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) and the GAUSS-MARKOV THEOREM: under those five, OLS is BLUE — minimum-variance among linear unbiased estimators of β\beta. You know the three caveats: (i) BLUE is about the linear-unbiased class — biased and nonlinear estimators can beat OLS in MSE; (ii) GLS (§4.4) is BLUE under heteroscedasticity / autocorrelation; (iii) BLUE says nothing about the distribution of β^\hat\beta — exact small-sample inference needs the optional Normality assumption.

You can articulate the optional NORMALITY assumption εXN(0,σ2I)\varepsilon \mid X \sim \mathcal{N}(0, \sigma^2 I) and its two consequences: (i) β^N(β,σ2(XX)1)\hat\beta \sim \mathcal{N}(\beta, \sigma^2(X^\top X)^{-1}) exactly; (ii) (np)σ^2/σ2χnp2(n - p)\hat\sigma^2/\sigma^2 \sim \chi^2_{n-p} independently. Together they give the exact small-sample t and F tests of §4.3. Without Normality the CLT carries the same conclusions asymptotically.

You can articulate WHY GEOMETRY FIRST PAYS OFF: every later section of Part 4 (assumptions failing, diagnostics, GLS, robust regression, interactions, model selection, causal warnings) and substantial parts of Part 9 (ridge, lasso) perturb the projection geometry in one specific way. The picture you built in §4.1 is the foundation on which the rest of regression rests.

Where this lands. §4.2 takes each Gauss-Markov assumption in turn and asks "what breaks when this fails?" — linearity becomes nonlinear regression; homoscedasticity becomes WLS / sandwich estimators; no autocorrelation becomes time-series corrections; etc. §4.3 develops the diagnostic toolkit (residual plots, Q-Q plots, Cook's distance, DFFITS, DFBETAS, VIFs, leverage-residual plots — every one of which is a function of HH and ee). §4.4 covers GLS for known and unknown covariance. §4.5 develops robust regression — M-estimators, S-estimators, MM-estimators — for heavy-tailed errors. §4.6 handles interactions, polynomials, and basis expansions as ways to enlarge col(X)\mathrm{col}(X). §4.7 covers model selection via AIC, BIC, and cross-validation — picking which columns to include. §4.8 closes Part 4 with the causal-interpretation warnings: regression is not causation, and the geometry doesn't care which way XX causes YY. Part 5 generalises to GLMs (logistic, Poisson, mixed-effects). Part 9 §9.2 develops ridge and lasso as biased estimators that beat OLS in MSE when the bias-variance trade-off favours them.

References

  • Gauss, C.F. (1809). Theoria motus corporum coelestium in sectionibus conicis solem ambientium. Hamburg: Perthes & Besser. (The first formal derivation of OLS in the context of orbital determination of comets and asteroids. Gauss introduces the squared-error criterion as the principle most consistent with the Normal error distribution.)
  • Legendre, A.M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Paris: Firmin Didot. (Independent publication of the method of least squares, predating Gauss's 1809 book by four years. Legendre coined the name "méthode des moindres carrés" — least squares.)
  • Plackett, R.L. (1972). "Studies in the history of probability and statistics. XXIX: The discovery of the method of least squares." Biometrika 59(2), 239–251. (The definitive priority study of the Legendre-Gauss dispute. Concludes that Legendre published first but Gauss had used the method privately since 1795.)
  • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 develops linear regression with the geometric / projection emphasis that §4.1 follows, and continues into ridge / lasso in chapter 3.4. The canonical modern reference for statistical learning theory.)
  • James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 is the gentler companion to ESL's chapter 3 — same geometric emphasis, less mathematical machinery, more applied worked examples in R / Python. Widely used as the undergraduate / first-graduate textbook.)
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer. (Chapter 13 covers linear regression in the mathematical-statistics tradition. Compact derivations of the normal equations, Gauss-Markov, and inference under Normality.)
  • Casella, G., Berger, R.L. (2002). Statistical Inference, 2nd ed. Pacific Grove: Duxbury. (Chapter 11 develops linear regression with the Gauss-Markov theorem as the central result. The graduate-level mathematical-statistics reference for the theoretical structure of §4.1.)
  • Greene, W.H. (2018). Econometric Analysis, 8th ed. New York: Pearson. (The standard graduate-level econometrics reference. Chapter 3 covers OLS, chapter 4 the assumptions and Gauss-Markov, chapter 5 onwards generalised methods and finite-sample inference. The applied-econometrics complement to ESL / ISL.)
  • Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. (The foundational reference for leverage hiih_{ii}, Cook's distance, DFFITS, DFBETAS, VIFs, and condition indices. The 2p/n leverage threshold used in §4.1 and the diagnostics in §4.3 come from this book.)
  • Lehmann, E.L., Casella, G. (1998). Theory of Point Estimation, 2nd ed. New York: Springer. (Chapter 3 contains the rigorous statement and proof of the Gauss-Markov theorem in the linear-model framework, and the broader theory of best linear unbiased estimators. The reference for the BLUE concept and its precise scope.)
  • Cochran, W.G. (1934). "The distribution of quadratic forms in a normal system, with applications to the analysis of variance." Proceedings of the Cambridge Philosophical Society 30(2), 178–191. (Cochran's theorem on the independence of quadratic forms in Normal vectors — the foundational result that gives the exact t and F distributions under the Normality assumption.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.