OLS as geometry

Part 4 — Linear regression, done seriously

Learning objectives

State the LINEAR-REGRESSION SETUP in matrix form: Y = Xβ + ε with Y ∈ ℝⁿ, X ∈ ℝⁿˣᵖ (design matrix; rows = observations, columns = predictors, typically including a column of 1s for the intercept), β ∈ ℝᵖ (coefficient vector), ε ∈ ℝⁿ (error vector). Identify the four objects' dimensions and how p ≪ n is the regime where OLS is well-posed
Recognise that OLS minimises ‖Y − Xβ‖² and that the GEOMETRIC interpretation — Ŷ = Xβ̂ is the orthogonal projection of Y onto col(X), the column space of X — is more fundamental than the calculus derivation. The proof requires no calculus; the closest-point theorem in Euclidean space (Hilbert space) gives the projection
Define the RESIDUAL e = Y − Ŷ and the ORTHOGONALITY PROPERTY: e ⊥ every column of X, equivalently X′e = 0. Recognise this as the SAME condition as the first-order condition from minimising Σ residuals² with calculus — geometry and calculus give the same answer
State the HAT MATRIX H = X(X′X)⁻¹X′. Verify its properties: (i) symmetric (H′ = H); (ii) idempotent (H² = H); (iii) Ŷ = HY; (iv) the residual-maker M = I − H is also symmetric and idempotent with e = MY; (v) trace(H) = p, the column-space dimension. Recognise H as THE matrix of the orthogonal projection onto col(X)
Derive the NORMAL EQUATIONS X′Xβ̂ = X′Y by setting X′e = 0 and substituting e = Y − Xβ̂. Recognise that when X has full column rank, X′X is p × p and invertible, giving the closed-form OLS estimator β̂ = (X′X)⁻¹X′Y. State that the estimator is LINEAR in Y — β̂ = (X′X)⁻¹X′·Y is a linear map from ℝⁿ to ℝᵖ — which Gauss-Markov uses
Read off the GEOMETRIC R² FORMULA: after centring Y and Ŷ around Ȳ, R² = ‖Ŷ − Ȳ𝟏‖² / ‖Y − Ȳ𝟏‖² = cos² of the angle between (Y − Ȳ𝟏) and col(X) projected onto the orthogonal complement of 𝟏. State that R² ∈ [0, 1] is the fraction of variation in Y explained by the projection. Note its three failure modes: (i) it always increases when adjusting predictors; (ii) it is not a model-quality measure on its own; (iii) adjusted R² and AIC/BIC (§4.7) are the responsible alternatives for model comparison
Define LEVERAGE h_ii = (H)_ii, the i-th diagonal entry of the hat matrix. State the identity Σ h_ii = trace(H) = p, hence the AVERAGE LEVERAGE = p/n. State the Belsley-Kuh-Welsch (1980) RULE OF THUMB: h_ii > 2p/n flags a point worth a second look. Interpret geometrically: high leverage = the row of X is far from the centroid in covariate space, in the design's metric
State the MULTICOLLINEARITY DIAGNOSTIC as the geometric statement: when two or more columns of X are nearly parallel (nearly linearly dependent), X′X is nearly singular and (X′X)⁻¹ has very large entries, making β̂ unstable (small changes in Y produce large changes in β̂). Recognise that this is structural, not a small-sample artefact; §4.3 develops VIFs / condition indices
State the FIVE GAUSS-MARKOV ASSUMPTIONS: (1) LINEARITY — 𝔼[Y | X] = Xβ; (2) EXOGENEITY / strict exogeneity — 𝔼[ε | X] = 0; (3) HOMOSCEDASTICITY — Var(ε_i | X) = σ² constant; (4) NO AUTOCORRELATION — Cov(ε_i, ε_j | X) = 0 for i ≠ j; (5) NO PERFECT MULTICOLLINEARITY — rank(X) = p. State the GAUSS-MARKOV THEOREM: under these five assumptions, the OLS estimator β̂ is BLUE (Best Linear Unbiased Estimator) — minimum-variance among all linear unbiased estimators of β. Note that BLUE concerns LINEAR unbiased estimators only; biased estimators (ridge, §4.7) and nonlinear estimators (LASSO, §9.2) can beat OLS in MSE
Recognise the optional SIXTH ASSUMPTION (NORMALITY): ε | X ~ 𝒩(0, σ²I). Under Normality, β̂ is exactly Normal in finite samples, σ̂²(X′X)⁻¹ gives exact small-sample standard errors, and (n − p)σ̂²/σ² ~ χ²_{n−p} — yielding the exact small-sample t- and F-tests of §4.3. Without Normality, the CLT carries the same inference asymptotically. Distinguish the five Gauss-Markov assumptions (about MOMENTS) from the Normality assumption (about the DISTRIBUTION)
Articulate why GEOMETRY FIRST helps: the projection picture makes diagnostics (§4.3), GLS (§4.4 — change the inner product), robust regression (§4.5 — change the loss away from ‖·‖²), ridge / lasso (Part 9 — add a penalty whose level sets carve the projection differently), and the causal-interpretation warnings (§4.8 — the geometry says nothing about cause) all snap into place as different perturbations of the same geometric object
Read the catalogue of seminal references: Gauss (1809) and Legendre (1805) for the historical origin; Plackett (1972) for the priority history; Hastie-Tibshirani-Friedman (2009) ch.3 and James-Witten-Hastie-Tibshirani (2021) ch.3 as the modern textbook treatments; Greene (2018) as the econometric reference; Belsley-Kuh-Welsch (1980) for leverage and diagnostics; Wasserman (2004) ch.13 and Casella-Berger (2002) ch.11 for the mathematical-statistics treatment

Part 3 closed with the empirical-testability backbone (calibration) and the communication side of UNIVARIATE inference. Part 4 turns to the keystone tool of applied statistics across every science: linear regression, the procedure that takes a response $Y$ and a vector of predictors $X$ and asks "what linear combination of the predictors best explains the response?" The simplicity is deceptive — when its assumptions hold, OLS is the unique optimum among an entire family of estimators; when they fail, the literature on what to do next fills shelves.

§4.1 sets the foundation. The angle we take is GEOMETRIC. Most introductory presentations open with calculus: write $|Y - X\beta|^2$ , take the derivative with respect to $\beta$ , set it to zero, solve. That works and gives the right formula, but it hides the underlying object. The MORE FUNDAMENTAL view — and the one that pays for itself in every later section — is that OLS computes the ORTHOGONAL PROJECTION of $Y$ onto the column space of $X$ . Once you see the picture, the hat matrix, the normal equations, leverage, R², the Gauss-Markov theorem, multicollinearity, ridge regression, robust regression, and generalised least squares all reveal themselves as geometric statements about projections — sometimes onto col(X), sometimes onto a tilted or penalised version of it.

The §4.1 arc has eight stops. First, the matrix-form setup $Y = X\beta + \varepsilon$ and the dimensions to keep in mind. Second, the geometric closest-point argument that defines $\hat\beta$ . Third, the orthogonality of residuals to columns of X, expressed as $X^\top e = 0$ . Fourth, the hat matrix $H = X(X^\top X)^{-1}X^\top$ and its three structural properties (symmetric, idempotent, trace = p). Fifth, the normal equations and the closed-form OLS estimator. Sixth, geometric implications: R² as $\cos^2$ , leverage as the diagonal of $H$ , multicollinearity as near-parallel columns. Seventh, the Gauss-Markov theorem (BLUE under five assumptions) and the role of the optional Normality assumption. Eighth, the two widgets — ols-projection-geometry (a 3-D view of the projection) and hat-matrix-leverage (a leverage scatter where you drag points and watch h_ii update).

The setup: Y = Xβ + ε in matrix form

The data live in four objects of fixed shape:

$Y \in \mathbb{R}^n$ — the response (outcome) vector. One number per observation. In a study of $n$ patients, $Y_i$ might be patient $i$ 's recovery time. The whole-vector view treats $Y$ as a single point in $n$ -dimensional space.
$X \in \mathbb{R}^{n \times p}$ — the design matrix. Row $i$ holds the $p$ predictors for observation $i$ ; column $j$ holds predictor $j$ 's values across all $n$ observations. The first column is typically a vector of $1$ s (the intercept). For two predictors plus intercept: row $i = (1, x_{i1}, x_{i2})$ , so $p = 3$ .
$\beta \in \mathbb{R}^p$ — the coefficient vector. One number per predictor. In population terms, the unknown parameter we want to estimate.
$\varepsilon \in \mathbb{R}^n$ — the error vector. One number per observation, the part of $Y_i$ not explained by the linear combination $X_i \beta$ . In population terms, $\varepsilon_i = Y_i - X_i\beta$ .

The MODEL is the matrix equation

Y \;=\; X\beta \;+\; \varepsilon.

Read it slowly. The left side is observed ( $Y$ ). The right side decomposes that observation into (i) a SYSTEMATIC piece $X\beta$ , a linear combination of the columns of $X$ , and (ii) a RANDOM piece $\varepsilon$ . The estimation problem is: given $Y, X$ , find a $\hat\beta$ such that $X\hat\beta$ explains as much of $Y$ as possible while the unexplained residual $e = Y - X\hat\beta$ behaves like the errors should under whatever assumptions are appropriate. The criterion that defines OLS is "make the residual as short as possible in Euclidean length":

\hat\beta \;=\; \arg\min_{\beta \in \mathbb{R}^p} \|Y - X\beta\|^2.

The square is convenient — it makes the calculus tractable — but the choice of Euclidean norm is what locks OLS to its projection-geometry interpretation. Change the norm (e.g., to $|\cdot|_{\Omega^{-1}}$ for a known covariance $\Omega$ ) and you get GLS; change $|\cdot|^2$ to a bounded loss and you get M-estimators / robust regression. The keystone is the Euclidean choice. §4.1 keeps it.

OLS is the orthogonal projection of Y onto col(X)

Here is the geometric argument, which uses NO calculus. Let $\mathrm{col}(X) = {X\beta : \beta \in \mathbb{R}^p}$ be the $p$ -dimensional subspace of $\mathbb{R}^n$ spanned by the columns of $X$ . The minimisation $\min_\beta |Y - X\beta|^2$ asks: over all points $X\beta$ in $\mathrm{col}(X)$ , which one is closest (in Euclidean distance) to $Y$ ? The CLOSEST-POINT THEOREM in Euclidean space (or more generally in any Hilbert space) gives the answer:

The closest point to $Y$ in a closed subspace $\mathcal{S}$ is the orthogonal projection of $Y$ onto $\mathcal{S}$ . It is characterised uniquely by the condition that the residual $Y - \hat Y$ is orthogonal to every vector in $\mathcal{S}$ .

Apply this with $\mathcal{S} = \mathrm{col}(X)$ . The closest point in $\mathrm{col}(X)$ to $Y$ is the orthogonal projection — call it $\hat Y = X\hat\beta$ . The residual $e = Y - \hat Y$ is perpendicular to every column of $X$ :

\langle e, x_j \rangle \;=\; x_j^\top e \;=\; 0 \qquad \text{for every column } x_j \text{ of } X,

or, stacking the $p$ equations into a single matrix equation,

\boxed{\;X^\top e \;=\; 0.\;}

This is the GEOMETRIC OLS condition. Substituting $e = Y - X\hat\beta$ gives $X^\top(Y - X\hat\beta) = 0$ , i.e.

X^\top X \,\hat\beta \;=\; X^\top Y.

These are the NORMAL EQUATIONS. When $X$ has full column rank (the no-perfect-multicollinearity assumption), the $p \times p$ matrix $X^\top X$ is invertible and the unique OLS estimator is

\boxed{\;\hat\beta \;=\; (X^\top X)^{-1} X^\top Y.\;}

Two things to absorb. First, the calculus version — minimise $|Y - X\beta|^2$ by differentiating in $\beta$ and setting the gradient to zero — gives $2X^\top(X\beta - Y) = 0$ , i.e. exactly the same normal equations. Geometry and calculus agree, but the geometric argument needs neither vector calculus nor convex optimisation — only the closest-point theorem. Second, the estimator $\hat\beta = (X^\top X)^{-1} X^\top Y$ is LINEAR IN Y: it is the matrix $(X^\top X)^{-1} X^\top \in \mathbb{R}^{p \times n}$ applied to the $n$ -vector $Y$ . The Gauss-Markov theorem, below, uses this linearity in an essential way.

The hat matrix H = X(X′X)⁻¹X′

Multiply both sides of $\hat\beta = (X^\top X)^{-1} X^\top Y$ on the left by $X$ :

\hat Y \;=\; X\hat\beta \;=\; X(X^\top X)^{-1} X^\top \, Y \;=\; H\,Y, \qquad H \;\equiv\; X(X^\top X)^{-1} X^\top.

The matrix $H$ — universally called the HAT MATRIX (because it "puts the hat on" $Y$ ) — is $n \times n$ and depends ONLY on $X$ , not on $Y$ . It is THE matrix representation of the orthogonal projection from $\mathbb{R}^n$ onto $\mathrm{col}(X)$ . Three structural properties make everything in Parts 4–5 click:

Symmetric: $H^\top = H$ . Direct from the definition: $H^\top = \bigl(X(X^\top X)^{-1}X^\top\bigr)^\top = X((X^\top X)^{-1})^\top X^\top = X(X^\top X)^{-1} X^\top = H$ , using $(A^{-1})^\top = (A^\top)^{-1}$ for symmetric $A = X^\top X$ .
Idempotent: $H^2 = H$ . $H^2 = X(X^\top X)^{-1}X^\top \cdot X(X^\top X)^{-1}X^\top = X(X^\top X)^{-1}(X^\top X)(X^\top X)^{-1}X^\top = X(X^\top X)^{-1}X^\top = H$ . Applying the projection twice gives the same result as applying it once — the geometric reading is that projecting an already-projected vector leaves it unchanged.
trace( $H$ ) = $p$ . $\mathrm{tr}(H) = \mathrm{tr}(X(X^\top X)^{-1}X^\top) = \mathrm{tr}((X^\top X)^{-1}X^\top X) = \mathrm{tr}(I_p) = p$ , using the cyclic property of trace. This identity is the heart of degrees-of-freedom counting in regression: the $p$ columns of $X$ consume $p$ degrees of freedom; the residual sits in an $(n - p)$ -dimensional subspace.

The RESIDUAL-MAKER matrix is $M = I - H$ . It is also symmetric ( $M^\top = I - H^\top = I - H = M$ ) and idempotent ( $M^2 = (I - H)(I - H) = I - 2H + H^2 = I - 2H + H = I - H = M$ ), so $M$ is the orthogonal projection onto the $(n - p)$ -dimensional orthogonal complement of $\mathrm{col}(X)$ . The residual is $e = Y - \hat Y = (I - H)Y = MY$ . The fitted values and residuals decompose $Y$ orthogonally:

Y \;=\; HY + (I - H)Y \;=\; \hat Y + e, \qquad \hat Y^\top e = (HY)^\top (MY) = Y^\top H M Y = 0

because $HM = H(I - H) = H - H^2 = 0$ . The fitted values and the residuals are PERPENDICULAR vectors in $\mathbb{R}^n$ , and their squared lengths add by Pythagoras:

\|Y\|^2 \;=\; \|\hat Y\|^2 \;+\; \|e\|^2.

The widget below makes this concrete: as you drag $Y$ around, watch $|\hat Y|^2 + |e|^2$ track $|Y|^2$ to within rounding. That identity is not a sample fact — it is a geometric necessity of orthogonal projection.

The first widget makes the picture inhabitable. To keep the visualisation in 3-D, we fix $n = 3$ and $p = 2$ : $Y$ is a vector in $\mathbb{R}^3$ , the column space $\mathrm{col}(X)$ is a 2-D plane through the origin, and the projection $\hat Y$ is the foot of the perpendicular from $Y$ to that plane. Drag the canvas to rotate the view; move $Y$ with the three sliders; or "Re-roll columns" to draw a fresh random orthonormal pair so the plane visibly tilts.

Things to verify in the widget:

The vector $Y$ (orange) starts off the plane. $\hat Y$ (cyan) sits on the plane. The residual $e = Y - \hat Y$ (red dashed) is the perpendicular from $Y$ to the plane. The right-angle tick mark at $\hat Y$ is not decoration — it is the geometric statement $e \perp \mathrm{col}(X)$ .
The status table reports $\langle e, x_1 \rangle$ and $\langle e, x_2 \rangle$ — the inner products of the residual with each column. They are both numerically $0$ (to within $10^{-15}$ floating-point error). This is the $X^\top e = 0$ statement, verified numerically.
The table also reports $|Y|^2$ , $|\hat Y|^2$ , $|e|^2$ , and the SUM $|\hat Y|^2 + |e|^2$ . The sum matches $|Y|^2$ exactly — Pythagoras for orthogonal projection.
Click "Snap Y onto plane". The widget animates $Y$ onto $\mathrm{col}(X)$ and back. When $Y$ sits inside the plane, the residual collapses to zero, $R^2 \to 1$ , and the fit is exact. This is the "lucky" or "trivial" case where the data lie exactly in the assumed model.
Slide $Y$ so it is roughly perpendicular to the plane. $\hat Y$ shrinks toward the origin; $R^2 \to 0$ . The columns of $X$ carry no information about the direction of $Y$ ; the projection is (nearly) zero. This is the "the predictors do not predict $Y$ " case.
Click "Re-roll columns" to draw a fresh orthonormal pair. The plane tilts; $\hat Y$ jumps to the new projection; the residual reorients. The numeric $R^2$ changes — different column spaces explain different fractions of $|Y|^2$ .
Read the "geometric R²" row of the table: $R^2 = \cos^2 \theta$ , where $\theta$ is the angle between $Y$ and the plane. Small $\theta$ (Y close to the plane) gives R² close to 1; large $\theta$ (Y nearly orthogonal to the plane) gives R² close to 0. This is the GEOMETRIC INTERPRETATION OF R², independent of any sample statistic.

R² is cos² of an angle

The widget exposes a formula that introductory regression texts often state without context: $R^2 = \cos^2 \theta$ . Here is the full statement. Let $\bar Y$ be the sample mean of $Y$ and $\mathbf{1}$ the $n$ -vector of ones. The CENTRED response is $Y - \bar Y \mathbf{1}$ and the CENTRED fitted values are $\hat Y - \bar Y \mathbf{1}$ . When the design includes an intercept (so $\mathbf{1} \in \mathrm{col}(X)$ ), we have $\sum_i e_i = 0$ (the residuals sum to zero), hence $\bar{\hat Y} = \bar Y$ , and the standard decomposition holds:

\underbrace{\|Y - \bar Y \mathbf{1}\|^2}_{\text{SST (total)}} \;=\; \underbrace{\|\hat Y - \bar Y \mathbf{1}\|^2}_{\text{SSR (regression)}} \;+\; \underbrace{\|e\|^2}_{\text{SSE (error)}}.

The coefficient of determination is

R^2 \;=\; \frac{\mathrm{SSR}}{\mathrm{SST}} \;=\; 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}.

Geometrically: after centring, both $Y - \bar Y \mathbf{1}$ and $\hat Y - \bar Y \mathbf{1}$ are vectors orthogonal to $\mathbf{1}$ , and $\hat Y - \bar Y \mathbf{1}$ is the orthogonal projection of $Y - \bar Y \mathbf{1}$ onto $\mathrm{col}(X) \cap \mathbf{1}^\perp$ . By the definition of orthogonal projection, the cosine of the angle $\theta$ between $Y - \bar Y \mathbf{1}$ and its projection equals $|\hat Y - \bar Y \mathbf{1}| / |Y - \bar Y \mathbf{1}|$ . Squaring gives $R^2 = \cos^2 \theta$ . The widget displays both $R^2$ and $\theta$ in the table; the relationship is exact, not an approximation.

Three honest caveats about R². First, R² is MONOTONE in $p$ : adding a column to $X$ enlarges $\mathrm{col}(X)$ , so the projection cannot move further from $Y$ — R² can only increase. R² alone is therefore not a model-quality measure (a model with more predictors always has a higher R²; that does not make it a better model). Adjusted R² (penalises $p$ ) and information criteria (AIC, BIC; §4.7) are the responsible alternatives for model comparison. Second, R² says nothing about whether the assumed linear functional form is correct — a high R² with badly nonlinear true relationship can still produce systematic residual patterns (§4.3). Third, on a designed experiment where the predictors are fixed and the response is the random object, R² has a clean variance-decomposition interpretation; on an observational study where both are random, R² is the squared sample correlation between fitted and observed values — a different object that requires the §4.8 causal-warnings care.

Leverage h_ii = diag(H)

Each diagonal entry of $H$ is called the LEVERAGE of observation $i$ :

h_{ii} \;=\; (H)_{ii} \;=\; x_i^\top (X^\top X)^{-1} x_i,

where $x_i \in \mathbb{R}^p$ is the $i$ -th row of $X$ . Three facts to internalise. First, $h_{ii} \in [0, 1]$ for any row in a model with an intercept (an inequality that comes from idempotency: $H^2 = H$ implies $h_{ii} = \sum_j h_{ij}^2 \ge h_{ii}^2$ , hence $h_{ii}(1 - h_{ii}) \ge 0$ ). Second, $\sum_{i=1}^n h_{ii} = \mathrm{tr}(H) = p$ , so the AVERAGE LEVERAGE is exactly $p/n$ . Third, $h_{ii}$ measures how much observation $i$ 's own value $Y_i$ contributes to its own fitted value $\hat Y_i$ : from $\hat Y = HY$ ,

\hat Y_i \;=\; h_{ii} \, Y_i \;+\; \sum_{j \ne i} h_{ij}\, Y_j.

When $h_{ii} = 1$ , $\hat Y_i = Y_i$ exactly — the fit is forced to pass through observation $i$ ; the point has total leverage and the fitted line is at its mercy. When $h_{ii} = p/n$ (the average), observation $i$ contributes its FAIR SHARE of the $p$ -dimensional explanatory budget. When $h_{ii} > 2p/n$ — the BELSLEY-KUH-WELSCH (1980) RULE OF THUMB — observation $i$ has unusually high leverage and is worth a second look. For simple regression with intercept $+$ one slope ( $p = 2$ ), the average leverage is $2/n$ and the threshold is $4/n$ .

Geometrically, $h_{ii}$ measures how far row $i$ of $X$ sits from the centroid of the design in COVARIATE SPACE, in the metric $(X^\top X / n)^{-1}$ . For simple regression with $X = [\mathbf{1}, x]$ , the formula simplifies to

h_{ii} \;=\; \frac{1}{n} \;+\; \frac{(x_i - \bar x)^2}{\sum_{j=1}^n (x_j - \bar x)^2}.

Points far from $\bar x$ in $x$ -coordinates have high leverage. The second widget makes this immediate.

The second widget shows a 2-D scatter of $n = 30$ points generated from a clean linear trend. Each point is sized and colour-coded by its leverage $h_{ii}$ . The OLS line is overlaid, and a vertical dashed line marks $\bar x$ (where leverage is at its minimum). The right panel shows a sorted bar chart of the leverages, with vertical reference lines at the average $p/n$ and the Belsley-Kuh-Welsch threshold $2p/n$ .

Things to verify in the widget:

On a clean sample, leverages are all near $p/n = 2/30 \approx 0.067$ . The status panel reports $\mathrm{tr}(H) = \sum h_{ii} = 2.000$ (within rounding): the trace-equals-p identity, verified numerically.
Drag a point far to the right or left of the cloud — well past $\bar x$ — and watch its leverage spike. It glows orange (above $2p/n$ ), then red (above $3p/n$ ). The other points' leverages adjust slightly because $\bar x$ and $\sum(x_j - \bar x)^2$ both move when you drag.
Click "Add high-leverage point". A new point is dropped near the right edge AT THE OLS LINE'S CURRENT $y$ VALUE. The point has very high leverage but a residual of approximately zero — so the OLS line BARELY MOVES. Now drag that point vertically away from the line. The OLS line ROTATES dramatically. This is the textbook high-leverage / high-residual scenario where OLS gets pulled around. §4.3 turns this into Cook's distance and DFFITS, the influence diagnostics that combine leverage and residual size.
Right-click any point to remove it. Re-roll the sample to start over.
Verify the sum-of-leverages identity: as you add or remove points, the status panel's "trace(H) = Σ h_ii" line always reads close to $2.00$ . The trace is a STRUCTURAL property of any orthogonal projection onto a 2-D subspace — it does not depend on the specific points, only on the dimension $p$ .
Observation: the leverage formula $h_{ii} = 1/n + (x_i - \bar x)^2 / \sum(x_j - \bar x)^2$ has NO $y_i$ in it. Leverage is a property of the DESIGN matrix $X$ alone, not of the response. This is why high leverage is "potential to influence", not influence itself — a high-leverage point with a tiny residual barely affects the fit. The §4.3 influence diagnostics combine leverage with residual size to identify points that ACTUALLY influence the fit.

Multicollinearity as a geometric condition

When two or more columns of $X$ are nearly parallel (nearly linearly dependent), the $p \times p$ matrix $X^\top X$ becomes nearly singular. Its smallest eigenvalue approaches zero; its condition number $\kappa(X^\top X) = \lambda_{\max}/\lambda_{\min}$ explodes; the inverse $(X^\top X)^{-1}$ has very large entries. The OLS formula $\hat\beta = (X^\top X)^{-1} X^\top Y$ then propagates SMALL changes in $Y$ into LARGE changes in $\hat\beta$ — the estimator is UNSTABLE.

The geometric reading is direct. If columns $x_1, x_2$ are nearly parallel, the 2-D parallelogram they span is nearly DEGENERATE — almost a 1-D line. To express the projection of $Y$ as a linear combination of $x_1$ and $x_2$ , the coefficients $\hat\beta_1, \hat\beta_2$ must be enormous (and of opposite signs), with most of the magnitude cancelling. Small perturbations of $Y$ shift which "side" of the cancellation wins, so $\hat\beta$ jumps wildly.

The §4.3 diagnostics — VIFs (variance inflation factors) and condition indices — are quantitative versions of "how nearly parallel are the columns?" The standard formulas come from the same geometry: $\mathrm{VIF}(x_j) = 1/(1 - R_j^2)$ where $R_j^2$ is the R² from regressing $x_j$ on the OTHER predictors. A large VIF means a large fraction of $x_j$ 's variation is explained by the other columns — i.e., $x_j$ is nearly in the span of the rest. The remedies are also geometric: drop a redundant column (reduce $p$ ), centre and orthogonalise the predictors, or move to ridge regression (Part 9 §9.2), which adds $\lambda I$ to $X^\top X$ to push the smallest eigenvalue safely away from zero — a geometric modification of the projection itself.

The Gauss-Markov theorem: OLS is BLUE

So far the discussion has been purely about the SAMPLE FACTS — the geometry of one observed $(X, Y)$ . To talk about properties like UNBIASEDNESS and MINIMUM VARIANCE we need an assumed sampling-distribution model. The Gauss-Markov (1809–1821, formalised by Markov 1900) framework places five assumptions on the error vector $\varepsilon$ :

Linearity: $\mathbb{E}[Y \mid X] = X\beta$ . The conditional mean of $Y$ is exactly the linear combination $X\beta$ . No omitted nonlinear terms.
Strict exogeneity: $\mathbb{E}[\varepsilon \mid X] = 0$ . The errors have mean zero given the predictors. (Equivalently, the predictors carry no information about the errors' direction — they are uncorrelated, in a stronger conditional-mean sense.)
Homoscedasticity: $\mathrm{Var}(\varepsilon_i \mid X) = \sigma^2$ for every $i$ . The errors have the same variance regardless of the predictor values.
No autocorrelation: $\mathrm{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0$ for $i \ne j$ . The errors are uncorrelated across observations.
No perfect multicollinearity: rank $(X) = p$ . The columns of $X$ are linearly independent, so $X^\top X$ is invertible and $\hat\beta$ is well-defined.

Under these five assumptions, the GAUSS-MARKOV THEOREM (Gauss 1809; Markov 1900; cf. Lehmann-Casella 1998 ch.3) states:

The OLS estimator $\hat\beta_{\mathrm{OLS}} = (X^\top X)^{-1} X^\top Y$ is the BLUE — Best Linear Unbiased Estimator — of $\beta$ . That is: among all estimators that (i) are LINEAR in $Y$ and (ii) are UNBIASED for $\beta$ (have $\mathbb{E}[\tilde\beta] = \beta$ for every value of $\beta$ ), the OLS estimator has the minimum variance.

"Linear in $Y$ " means $\tilde\beta = AY + a$ for some non-random matrix $A$ and vector $a$ ; the natural example is $\tilde\beta = AY$ with $A = (X^\top X)^{-1} X^\top$ for OLS, or any other $A$ for a different linear estimator. "Minimum variance" is in the matrix sense: $\mathrm{Var}(\hat\beta_{\mathrm{OLS}}) \preceq \mathrm{Var}(\tilde\beta)$ as $p \times p$ matrices (i.e., the difference is positive semi-definite).

Three caveats worth keeping in mind. First, BLUE concerns LINEAR UNBIASED estimators only. BIASED estimators (ridge, lasso) can have lower mean-squared error than OLS (this is the §1.5 bias-variance trade-off again; Part 9 develops it for regression). NONLINEAR estimators (median regression, M-estimators) can also beat OLS in MSE when the error distribution has heavy tails. Gauss-Markov says only that among the linear-unbiased class, OLS wins. Second, the theorem requires assumptions (3) and (4) — homoscedasticity and no autocorrelation. When they fail, OLS is still unbiased but no longer BLUE: GLS (§4.4) is the BLUE in that broader setting. Third, the theorem says nothing about the DISTRIBUTION of $\hat\beta$ . For exact small-sample inference (t-tests, F-tests, confidence intervals) we need the optional sixth assumption.

The optional Normality assumption

If we add the sixth assumption

\varepsilon \mid X \;\sim\; \mathcal{N}(0, \sigma^2 I_n),

two exact small-sample consequences follow. First, $\hat\beta = (X^\top X)^{-1} X^\top Y$ is a linear function of the Normal vector $Y = X\beta + \varepsilon$ , hence is itself Normal:

\hat\beta \;\sim\; \mathcal{N}\!\left(\beta, \;\sigma^2 (X^\top X)^{-1}\right).

Second, the residual sum of squares scales to a chi-square:

\frac{(n - p)\,\hat\sigma^2}{\sigma^2} \;=\; \frac{\|e\|^2}{\sigma^2} \;\sim\; \chi^2_{n - p},

independent of $\hat\beta$ (Cochran's theorem, 1934, on independence of quadratic forms in Normal vectors). These two facts give the EXACT small-sample t-distribution for testing single coefficients ( $(\hat\beta_j - \beta_j) / \widehat{\mathrm{SE}}(\hat\beta_j) \sim t_{n-p}$ ) and the EXACT F-distribution for testing joint hypotheses about multiple coefficients (§4.3 develops both).

Without the Normality assumption, the CLT (§0.7) takes over asymptotically. As $n \to \infty$ with $p$ fixed and the design well-behaved, $\sqrt n(\hat\beta - \beta) \xrightarrow{d} \mathcal{N}(0, \sigma^2 (X^\top X / n)^{-1})$ , and the t- and F-tests are APPROXIMATELY valid for large $n$ . The §4.3 diagnostics include Q-Q plots of the standardised residuals (§3.5) precisely to check whether the Normality assumption is empirically defensible for any given dataset; the more departure, the more we rely on $n$ being large.

Why "geometry first" pays off downstream

The geometric picture pays for itself in every later section of Part 4:

§4.2 (assumptions fail). Each Gauss-Markov assumption corresponds to a GEOMETRIC property of the projection. Homoscedasticity says $\mathrm{Var}(\varepsilon) = \sigma^2 I$ — the errors are SPHERICALLY symmetric about the column space. Autocorrelation distorts the sphere into an ellipsoid. Endogeneity tilts the column space relative to the truth. Each failure has a geometric signature.
§4.3 (diagnostics). Residuals, leverage, and influence are all read off the hat matrix $H$ . Cook's distance, DFFITS, and DFBETAS are scalar summaries of the SAME projection geometry.
§4.4 (GLS). When $\mathrm{Var}(\varepsilon) = \sigma^2 \Omega$ for known $\Omega \ne I$ , the BLUE is obtained by projecting $\Omega^{-1/2} Y$ onto $\mathrm{col}(\Omega^{-1/2} X)$ . Same picture, different inner product.
§4.5 (robust regression). Replace $|\cdot|^2$ with a bounded loss $\rho(\cdot)$ . The "projection" is no longer Euclidean orthogonal, but the same closest-point logic recovers the M-estimator.
§4.6 (interactions). Adding an interaction term enlarges $\mathrm{col}(X)$ by one dimension. The projection moves into the bigger subspace.
§4.7 (model selection — AIC/BIC/CV). Selection chooses which columns to include — i.e., which sub-spaces of $\mathbb{R}^n$ to project onto. The trade-off is between fitting $Y$ closely (big $\mathrm{col}(X)$ ) and over-fitting noise (the bias-variance frontier of §1.5).
§4.8 (causal warnings). The projection geometry tells you the linear combination of predictors that best matches $Y$ . It tells you NOTHING about which predictors CAUSE $Y$ . Confounders, instruments, mediators, colliders — all of those concepts (Part 6) live at the population / causal level, not in the geometry of one sample.
Part 9 (ridge / lasso). Ridge adds $\lambda I$ to $X^\top X$ , geometrically SHRINKING the projection toward the origin (already seen in §1.5's shrinkage-cv-tuner widget). Lasso uses a non-Euclidean penalty whose level sets have CORNERS, so the projection lands on the boundary with some coefficients set to zero. Both are geometric modifications of OLS.

Try it

In the ols-projection-geometry, leave the defaults and rotate the view. Confirm that $\hat Y$ (cyan) sits inside the blue plane and that the red dashed residual $e = Y - \hat Y$ is perpendicular to the plane (the right-angle tick at $\hat Y$ marks this). Read off $\langle e, x_1 \rangle$ and $\langle e, x_2 \rangle$ — both should be 0 to within $10^{-15}$ .
Same widget. Verify $|Y|^2 = |\hat Y|^2 + |e|^2$ to 3 decimal places. Slide $Y_3$ from $1.2$ to $-1.2$ and watch the three squared norms re-balance. The sum $|\hat Y|^2 + |e|^2$ always equals $|Y|^2$ — Pythagoras for orthogonal projection.
Same widget. Click "Snap Y onto plane". Watch the residual collapse to zero, R² → 1, and $\theta \to 0$ . State why: when $Y \in \mathrm{col}(X)$ , the projection is $Y$ itself and the fit is exact.
Same widget. Click "Re-roll columns" repeatedly. Notice that $R^2 = \cos^2 \theta$ can change dramatically across different column choices. State why: different column choices give different $\mathrm{col}(X)$ , and the angle between a fixed $Y$ and different planes varies.
In the hat-matrix-leverage, observe that on a clean sample of $n = 30$ points, the leverage bar chart shows all bars below the $2p/n$ threshold and most bars near $p/n = 0.067$ . Confirm: $\sum h_{ii} = \mathrm{tr}(H) = 2.000$ (within rounding).
Same widget. Drag a point far to the right of the cloud (x ≈ 14). Read off its new leverage $h_{ii}$ . Verify by hand: $h_{ii} = 1/30 + (14 - \bar x)^2 / \sum(x_j - \bar x)^2 \approx 1/30 + 70/270 \approx 0.29$ (precise numbers depend on the sample). Should be well above $2p/n = 0.133$ . The bar chart highlights it in orange / red.
Same widget. Click "Add high-leverage point". The point lands ON the OLS line, so its residual is approximately zero and the line barely moves. NOW drag that point vertically away from the line. The OLS line ROTATES dramatically. Conclude: leverage is the POTENTIAL to influence, realised influence requires also a non-zero residual — the §4.3 Cook's distance combines both.
Pen-and-paper. With $X = [\mathbf{1}, x]$ for simple regression and the $h_{ii} = 1/n + (x_i - \bar x)^2 / \sum(x_j - \bar x)^2$ formula, verify $\sum_{i=1}^n h_{ii} = p = 2$ . Hint: expand $(x_i - \bar x)^2$ , sum over $i$ , and use $\sum(x_i - \bar x) = 0$ . The 1/n terms sum to $n \cdot 1/n = 1$ ; the (x_i - x̄)² terms sum to $\sum(x_i - \bar x)^2 / \sum(x_j - \bar x)^2 = 1$ . Total = 2 = p. ✓
Pen-and-paper. Prove that the OLS residuals $e$ are orthogonal to every column $x_j$ of $X$ . Hint: differentiate $|Y - X\beta|^2$ in $\beta$ ; setting the gradient to zero gives $X^\top(Y - X\hat\beta) = 0$ , i.e. $X^\top e = 0$ . The $j$ -th row says $x_j^\top e = 0$ . Geometric statement of the same fact: the residual is perpendicular to the column space, hence to every spanning vector.
Pen-and-paper. State the Gauss-Markov theorem precisely. List the five assumptions. State what BLUE means (Best Linear Unbiased Estimator: minimum variance among the linear-unbiased class). State two regimes where OLS can be IMPROVED on in MSE: (i) by adding bias (ridge, lasso) when the bias-variance trade-off favours it; (ii) by using nonlinear estimators (M-estimators) when the error distribution is heavy-tailed.

Pause and reflect: §4.1 has reframed OLS as GEOMETRY. The data live in $\mathbb{R}^n$ . The columns of $X$ span a $p$ -dimensional subspace $\mathrm{col}(X)$ . OLS computes the orthogonal projection of $Y$ onto that subspace — that is the WHOLE algorithm. The hat matrix $H = X(X^\top X)^{-1} X^\top$ is the matrix of the projection; its diagonal entries $h_{ii}$ are the leverages; trace( $H$ ) = $p$ counts the degrees of freedom consumed by the $p$ columns. The normal equations and the closed-form $\hat\beta = (X^\top X)^{-1} X^\top Y$ fall out of the geometric orthogonality condition $X^\top e = 0$ (or equivalently from setting the gradient of $|Y - X\beta|^2$ to zero — they agree). Under five Gauss-Markov assumptions, OLS is BLUE. Adding Normality buys exact small-sample t and F. Every later section of Part 4 — and a substantial chunk of Part 9 — perturbs this geometric picture in one specific way. §4.2 next explores what happens when each assumption fails one at a time.

What you now know

You can write the linear-regression model in matrix form $Y = X\beta + \varepsilon$ with explicit dimensions $(n, p)$ . You can state the OLS criterion as the minimisation of $|Y - X\beta|^2$ and re-cast it geometrically: $\hat Y = X\hat\beta$ is the orthogonal projection of $Y$ onto the column space $\mathrm{col}(X)$ , characterised uniquely by the condition that the residual $e = Y - \hat Y$ is perpendicular to every column of $X$ , written compactly as $X^\top e = 0$ .

You can derive the NORMAL EQUATIONS $X^\top X \hat\beta = X^\top Y$ by substituting $e = Y - X\hat\beta$ into $X^\top e = 0$ . You know the closed-form solution $\hat\beta = (X^\top X)^{-1} X^\top Y$ when $X$ has full column rank, and that this estimator is LINEAR in $Y$ — the matrix $(X^\top X)^{-1} X^\top$ applied to the response vector.

You can state the HAT MATRIX $H = X(X^\top X)^{-1} X^\top$ and verify its three structural properties: (i) symmetric, (ii) idempotent, (iii) trace = p. You know that $\hat Y = HY$ and $e = (I - H)Y$ , that fitted values and residuals are orthogonal vectors in $\mathbb{R}^n$ , and that they decompose $Y$ by Pythagoras: $|Y|^2 = |\hat Y|^2 + |e|^2$ . The ols-projection-geometry widget makes this picture inhabitable in 3-D.

You can read R² as a GEOMETRIC QUANTITY: after centring, $R^2 = \cos^2 \theta$ where $\theta$ is the angle between $Y - \bar Y \mathbf{1}$ and the projection-residual decomposition. You know the three honest caveats (monotone in p, not a model-quality measure on its own, says nothing about functional-form correctness).

You can define LEVERAGE $h_{ii} = (H)$ and recall: (i) $h$ {ii} \in [0, 1] $h_{ii} \in [0, 1]$ ; (ii) $\sum h_{ii} = p$ ; (iii) average leverage $= p/n$ ; (iv) Belsley-Kuh-Welsch (1980) threshold $= 2p/n$ . For simple regression $X = [\mathbf{1}, x]$ the closed form is $h_{ii} = 1/n + (x_i - \bar x)^2 / \sum(x_j - \bar x)^2$ . The hat-matrix-leverage widget lets you drag points and watch h_ii update; it also demonstrates that leverage is the POTENTIAL to influence — actual influence requires non-zero residuals too.

You can state MULTICOLLINEARITY as the geometric statement that near-parallel columns of $X$ make $X^\top X$ nearly singular and $(X^\top X)^{-1}$ have very large entries, propagating small $Y$ -perturbations into large $\hat\beta$ -jumps. You know that §4.3 develops VIFs and condition indices, and that ridge regression (Part 9 §9.2) is one geometric remedy (add $\lambda I$ to $X^\top X$ ).

You can state the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) and the GAUSS-MARKOV THEOREM: under those five, OLS is BLUE — minimum-variance among linear unbiased estimators of $\beta$ . You know the three caveats: (i) BLUE is about the linear-unbiased class — biased and nonlinear estimators can beat OLS in MSE; (ii) GLS (§4.4) is BLUE under heteroscedasticity / autocorrelation; (iii) BLUE says nothing about the distribution of $\hat\beta$ — exact small-sample inference needs the optional Normality assumption.

You can articulate the optional NORMALITY assumption $\varepsilon \mid X \sim \mathcal{N}(0, \sigma^2 I)$ and its two consequences: (i) $\hat\beta \sim \mathcal{N}(\beta, \sigma^2(X^\top X)^{-1})$ exactly; (ii) $(n - p)\hat\sigma^2/\sigma^2 \sim \chi^2_{n-p}$ independently. Together they give the exact small-sample t and F tests of §4.3. Without Normality the CLT carries the same conclusions asymptotically.

You can articulate WHY GEOMETRY FIRST PAYS OFF: every later section of Part 4 (assumptions failing, diagnostics, GLS, robust regression, interactions, model selection, causal warnings) and substantial parts of Part 9 (ridge, lasso) perturb the projection geometry in one specific way. The picture you built in §4.1 is the foundation on which the rest of regression rests.

Where this lands. §4.2 takes each Gauss-Markov assumption in turn and asks "what breaks when this fails?" — linearity becomes nonlinear regression; homoscedasticity becomes WLS / sandwich estimators; no autocorrelation becomes time-series corrections; etc. §4.3 develops the diagnostic toolkit (residual plots, Q-Q plots, Cook's distance, DFFITS, DFBETAS, VIFs, leverage-residual plots — every one of which is a function of $H$ and $e$ ). §4.4 covers GLS for known and unknown covariance. §4.5 develops robust regression — M-estimators, S-estimators, MM-estimators — for heavy-tailed errors. §4.6 handles interactions, polynomials, and basis expansions as ways to enlarge $\mathrm{col}(X)$ . §4.7 covers model selection via AIC, BIC, and cross-validation — picking which columns to include. §4.8 closes Part 4 with the causal-interpretation warnings: regression is not causation, and the geometry doesn't care which way $X$ causes $Y$ . Part 5 generalises to GLMs (logistic, Poisson, mixed-effects). Part 9 §9.2 develops ridge and lasso as biased estimators that beat OLS in MSE when the bias-variance trade-off favours them.

References

Gauss, C.F. (1809). Theoria motus corporum coelestium in sectionibus conicis solem ambientium. Hamburg: Perthes & Besser. (The first formal derivation of OLS in the context of orbital determination of comets and asteroids. Gauss introduces the squared-error criterion as the principle most consistent with the Normal error distribution.)
Legendre, A.M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Paris: Firmin Didot. (Independent publication of the method of least squares, predating Gauss's 1809 book by four years. Legendre coined the name "méthode des moindres carrés" — least squares.)
Plackett, R.L. (1972). "Studies in the history of probability and statistics. XXIX: The discovery of the method of least squares." Biometrika 59(2), 239–251. (The definitive priority study of the Legendre-Gauss dispute. Concludes that Legendre published first but Gauss had used the method privately since 1795.)
Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 develops linear regression with the geometric / projection emphasis that §4.1 follows, and continues into ridge / lasso in chapter 3.4. The canonical modern reference for statistical learning theory.)
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 is the gentler companion to ESL's chapter 3 — same geometric emphasis, less mathematical machinery, more applied worked examples in R / Python. Widely used as the undergraduate / first-graduate textbook.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer. (Chapter 13 covers linear regression in the mathematical-statistics tradition. Compact derivations of the normal equations, Gauss-Markov, and inference under Normality.)
Casella, G., Berger, R.L. (2002). Statistical Inference, 2nd ed. Pacific Grove: Duxbury. (Chapter 11 develops linear regression with the Gauss-Markov theorem as the central result. The graduate-level mathematical-statistics reference for the theoretical structure of §4.1.)
Greene, W.H. (2018). Econometric Analysis, 8th ed. New York: Pearson. (The standard graduate-level econometrics reference. Chapter 3 covers OLS, chapter 4 the assumptions and Gauss-Markov, chapter 5 onwards generalised methods and finite-sample inference. The applied-econometrics complement to ESL / ISL.)
Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. (The foundational reference for leverage $h_{ii}$ , Cook's distance, DFFITS, DFBETAS, VIFs, and condition indices. The 2p/n leverage threshold used in §4.1 and the diagnostics in §4.3 come from this book.)
Lehmann, E.L., Casella, G. (1998). Theory of Point Estimation, 2nd ed. New York: Springer. (Chapter 3 contains the rigorous statement and proof of the Gauss-Markov theorem in the linear-model framework, and the broader theory of best linear unbiased estimators. The reference for the BLUE concept and its precise scope.)
Cochran, W.G. (1934). "The distribution of quadratic forms in a normal system, with applications to the analysis of variance." Proceedings of the Cambridge Philosophical Society 30(2), 178–191. (Cochran's theorem on the independence of quadratic forms in Normal vectors — the foundational result that gives the exact t and F distributions under the Normality assumption.)