OLS as geometry
Learning objectives
- State the LINEAR-REGRESSION SETUP in matrix form: Y = Xβ + ε with Y ∈ ℝⁿ, X ∈ ℝⁿˣᵖ (design matrix; rows = observations, columns = predictors, typically including a column of 1s for the intercept), β ∈ ℝᵖ (coefficient vector), ε ∈ ℝⁿ (error vector). Identify the four objects' dimensions and how p ≪ n is the regime where OLS is well-posed
- Recognise that OLS minimises ‖Y − Xβ‖² and that the GEOMETRIC interpretation — Ŷ = Xβ̂ is the orthogonal projection of Y onto col(X), the column space of X — is more fundamental than the calculus derivation. The proof requires no calculus; the closest-point theorem in Euclidean space (Hilbert space) gives the projection
- Define the RESIDUAL e = Y − Ŷ and the ORTHOGONALITY PROPERTY: e ⊥ every column of X, equivalently X′e = 0. Recognise this as the SAME condition as the first-order condition from minimising Σ residuals² with calculus — geometry and calculus give the same answer
- State the HAT MATRIX H = X(X′X)⁻¹X′. Verify its properties: (i) symmetric (H′ = H); (ii) idempotent (H² = H); (iii) Ŷ = HY; (iv) the residual-maker M = I − H is also symmetric and idempotent with e = MY; (v) trace(H) = p, the column-space dimension. Recognise H as THE matrix of the orthogonal projection onto col(X)
- Derive the NORMAL EQUATIONS X′Xβ̂ = X′Y by setting X′e = 0 and substituting e = Y − Xβ̂. Recognise that when X has full column rank, X′X is p × p and invertible, giving the closed-form OLS estimator β̂ = (X′X)⁻¹X′Y. State that the estimator is LINEAR in Y — β̂ = (X′X)⁻¹X′·Y is a linear map from ℝⁿ to ℝᵖ — which Gauss-Markov uses
- Read off the GEOMETRIC R² FORMULA: after centring Y and Ŷ around Ȳ, R² = ‖Ŷ − Ȳ𝟏‖² / ‖Y − Ȳ𝟏‖² = cos² of the angle between (Y − Ȳ𝟏) and col(X) projected onto the orthogonal complement of 𝟏. State that R² ∈ [0, 1] is the fraction of variation in Y explained by the projection. Note its three failure modes: (i) it always increases when adjusting predictors; (ii) it is not a model-quality measure on its own; (iii) adjusted R² and AIC/BIC (§4.7) are the responsible alternatives for model comparison
- Define LEVERAGE h_ii = (H)_ii, the i-th diagonal entry of the hat matrix. State the identity Σ h_ii = trace(H) = p, hence the AVERAGE LEVERAGE = p/n. State the Belsley-Kuh-Welsch (1980) RULE OF THUMB: h_ii > 2p/n flags a point worth a second look. Interpret geometrically: high leverage = the row of X is far from the centroid in covariate space, in the design's metric
- State the MULTICOLLINEARITY DIAGNOSTIC as the geometric statement: when two or more columns of X are nearly parallel (nearly linearly dependent), X′X is nearly singular and (X′X)⁻¹ has very large entries, making β̂ unstable (small changes in Y produce large changes in β̂). Recognise that this is structural, not a small-sample artefact; §4.3 develops VIFs / condition indices
- State the FIVE GAUSS-MARKOV ASSUMPTIONS: (1) LINEARITY — 𝔼[Y | X] = Xβ; (2) EXOGENEITY / strict exogeneity — 𝔼[ε | X] = 0; (3) HOMOSCEDASTICITY — Var(ε_i | X) = σ² constant; (4) NO AUTOCORRELATION — Cov(ε_i, ε_j | X) = 0 for i ≠ j; (5) NO PERFECT MULTICOLLINEARITY — rank(X) = p. State the GAUSS-MARKOV THEOREM: under these five assumptions, the OLS estimator β̂ is BLUE (Best Linear Unbiased Estimator) — minimum-variance among all linear unbiased estimators of β. Note that BLUE concerns LINEAR unbiased estimators only; biased estimators (ridge, §4.7) and nonlinear estimators (LASSO, §9.2) can beat OLS in MSE
- Recognise the optional SIXTH ASSUMPTION (NORMALITY): ε | X ~ 𝒩(0, σ²I). Under Normality, β̂ is exactly Normal in finite samples, σ̂²(X′X)⁻¹ gives exact small-sample standard errors, and (n − p)σ̂²/σ² ~ χ²_{n−p} — yielding the exact small-sample t- and F-tests of §4.3. Without Normality, the CLT carries the same inference asymptotically. Distinguish the five Gauss-Markov assumptions (about MOMENTS) from the Normality assumption (about the DISTRIBUTION)
- Articulate why GEOMETRY FIRST helps: the projection picture makes diagnostics (§4.3), GLS (§4.4 — change the inner product), robust regression (§4.5 — change the loss away from ‖·‖²), ridge / lasso (Part 9 — add a penalty whose level sets carve the projection differently), and the causal-interpretation warnings (§4.8 — the geometry says nothing about cause) all snap into place as different perturbations of the same geometric object
- Read the catalogue of seminal references: Gauss (1809) and Legendre (1805) for the historical origin; Plackett (1972) for the priority history; Hastie-Tibshirani-Friedman (2009) ch.3 and James-Witten-Hastie-Tibshirani (2021) ch.3 as the modern textbook treatments; Greene (2018) as the econometric reference; Belsley-Kuh-Welsch (1980) for leverage and diagnostics; Wasserman (2004) ch.13 and Casella-Berger (2002) ch.11 for the mathematical-statistics treatment
Part 3 closed with the empirical-testability backbone (calibration) and the communication side of UNIVARIATE inference. Part 4 turns to the keystone tool of applied statistics across every science: linear regression, the procedure that takes a response and a vector of predictors and asks "what linear combination of the predictors best explains the response?" The simplicity is deceptive — when its assumptions hold, OLS is the unique optimum among an entire family of estimators; when they fail, the literature on what to do next fills shelves.
§4.1 sets the foundation. The angle we take is GEOMETRIC. Most introductory presentations open with calculus: write , take the derivative with respect to , set it to zero, solve. That works and gives the right formula, but it hides the underlying object. The MORE FUNDAMENTAL view — and the one that pays for itself in every later section — is that OLS computes the ORTHOGONAL PROJECTION of onto the column space of . Once you see the picture, the hat matrix, the normal equations, leverage, R², the Gauss-Markov theorem, multicollinearity, ridge regression, robust regression, and generalised least squares all reveal themselves as geometric statements about projections — sometimes onto col(X), sometimes onto a tilted or penalised version of it.
The §4.1 arc has eight stops. First, the matrix-form setup and the dimensions to keep in mind. Second, the geometric closest-point argument that defines . Third, the orthogonality of residuals to columns of X, expressed as . Fourth, the hat matrix and its three structural properties (symmetric, idempotent, trace = p). Fifth, the normal equations and the closed-form OLS estimator. Sixth, geometric implications: R² as , leverage as the diagonal of , multicollinearity as near-parallel columns. Seventh, the Gauss-Markov theorem (BLUE under five assumptions) and the role of the optional Normality assumption. Eighth, the two widgets — ols-projection-geometry (a 3-D view of the projection) and hat-matrix-leverage (a leverage scatter where you drag points and watch h_ii update).
The setup: Y = Xβ + ε in matrix form
The data live in four objects of fixed shape:
- — the response (outcome) vector. One number per observation. In a study of patients, might be patient 's recovery time. The whole-vector view treats as a single point in -dimensional space.
- — the design matrix. Row holds the predictors for observation ; column holds predictor 's values across all observations. The first column is typically a vector of s (the intercept). For two predictors plus intercept: row , so .
- — the coefficient vector. One number per predictor. In population terms, the unknown parameter we want to estimate.
- — the error vector. One number per observation, the part of not explained by the linear combination . In population terms, .
The MODEL is the matrix equation
Read it slowly. The left side is observed (). The right side decomposes that observation into (i) a SYSTEMATIC piece , a linear combination of the columns of , and (ii) a RANDOM piece . The estimation problem is: given , find a such that explains as much of as possible while the unexplained residual behaves like the errors should under whatever assumptions are appropriate. The criterion that defines OLS is "make the residual as short as possible in Euclidean length":
The square is convenient — it makes the calculus tractable — but the choice of Euclidean norm is what locks OLS to its projection-geometry interpretation. Change the norm (e.g., to for a known covariance ) and you get GLS; change to a bounded loss and you get M-estimators / robust regression. The keystone is the Euclidean choice. §4.1 keeps it.
OLS is the orthogonal projection of Y onto col(X)
Here is the geometric argument, which uses NO calculus. Let be the -dimensional subspace of spanned by the columns of . The minimisation asks: over all points in , which one is closest (in Euclidean distance) to ? The CLOSEST-POINT THEOREM in Euclidean space (or more generally in any Hilbert space) gives the answer:
The closest point to in a closed subspace is the orthogonal projection of onto . It is characterised uniquely by the condition that the residual is orthogonal to every vector in .
Apply this with . The closest point in to is the orthogonal projection — call it . The residual is perpendicular to every column of :
or, stacking the equations into a single matrix equation,
This is the GEOMETRIC OLS condition. Substituting gives , i.e.
These are the NORMAL EQUATIONS. When has full column rank (the no-perfect-multicollinearity assumption), the matrix is invertible and the unique OLS estimator is
Two things to absorb. First, the calculus version — minimise by differentiating in and setting the gradient to zero — gives , i.e. exactly the same normal equations. Geometry and calculus agree, but the geometric argument needs neither vector calculus nor convex optimisation — only the closest-point theorem. Second, the estimator is LINEAR IN Y: it is the matrix applied to the -vector . The Gauss-Markov theorem, below, uses this linearity in an essential way.
The hat matrix H = X(X′X)⁻¹X′
Multiply both sides of on the left by :
The matrix — universally called the HAT MATRIX (because it "puts the hat on" ) — is and depends ONLY on , not on . It is THE matrix representation of the orthogonal projection from onto . Three structural properties make everything in Parts 4–5 click:
- Symmetric: . Direct from the definition: , using for symmetric .
- Idempotent: . . Applying the projection twice gives the same result as applying it once — the geometric reading is that projecting an already-projected vector leaves it unchanged.
- trace() = . , using the cyclic property of trace. This identity is the heart of degrees-of-freedom counting in regression: the columns of consume degrees of freedom; the residual sits in an -dimensional subspace.
The RESIDUAL-MAKER matrix is . It is also symmetric () and idempotent (), so is the orthogonal projection onto the -dimensional orthogonal complement of . The residual is . The fitted values and residuals decompose orthogonally:
because . The fitted values and the residuals are PERPENDICULAR vectors in , and their squared lengths add by Pythagoras:
The widget below makes this concrete: as you drag around, watch track to within rounding. That identity is not a sample fact — it is a geometric necessity of orthogonal projection.
The projection-geometry widget
The first widget makes the picture inhabitable. To keep the visualisation in 3-D, we fix and : is a vector in , the column space is a 2-D plane through the origin, and the projection is the foot of the perpendicular from to that plane. Drag the canvas to rotate the view; move with the three sliders; or "Re-roll columns" to draw a fresh random orthonormal pair so the plane visibly tilts.
Things to verify in the widget:
- The vector (orange) starts off the plane. (cyan) sits on the plane. The residual (red dashed) is the perpendicular from to the plane. The right-angle tick mark at is not decoration — it is the geometric statement .
- The status table reports and — the inner products of the residual with each column. They are both numerically (to within floating-point error). This is the statement, verified numerically.
- The table also reports , , , and the SUM . The sum matches exactly — Pythagoras for orthogonal projection.
- Click "Snap Y onto plane". The widget animates onto and back. When sits inside the plane, the residual collapses to zero, , and the fit is exact. This is the "lucky" or "trivial" case where the data lie exactly in the assumed model.
- Slide so it is roughly perpendicular to the plane. shrinks toward the origin; . The columns of carry no information about the direction of ; the projection is (nearly) zero. This is the "the predictors do not predict " case.
- Click "Re-roll columns" to draw a fresh orthonormal pair. The plane tilts; jumps to the new projection; the residual reorients. The numeric changes — different column spaces explain different fractions of .
- Read the "geometric R²" row of the table: , where is the angle between and the plane. Small (Y close to the plane) gives R² close to 1; large (Y nearly orthogonal to the plane) gives R² close to 0. This is the GEOMETRIC INTERPRETATION OF R², independent of any sample statistic.
R² is cos² of an angle
The widget exposes a formula that introductory regression texts often state without context: . Here is the full statement. Let be the sample mean of and the -vector of ones. The CENTRED response is and the CENTRED fitted values are . When the design includes an intercept (so ), we have (the residuals sum to zero), hence , and the standard decomposition holds:
The coefficient of determination is
Geometrically: after centring, both and are vectors orthogonal to , and is the orthogonal projection of onto . By the definition of orthogonal projection, the cosine of the angle between and its projection equals . Squaring gives . The widget displays both and in the table; the relationship is exact, not an approximation.
Three honest caveats about R². First, R² is MONOTONE in : adding a column to enlarges , so the projection cannot move further from — R² can only increase. R² alone is therefore not a model-quality measure (a model with more predictors always has a higher R²; that does not make it a better model). Adjusted R² (penalises ) and information criteria (AIC, BIC; §4.7) are the responsible alternatives for model comparison. Second, R² says nothing about whether the assumed linear functional form is correct — a high R² with badly nonlinear true relationship can still produce systematic residual patterns (§4.3). Third, on a designed experiment where the predictors are fixed and the response is the random object, R² has a clean variance-decomposition interpretation; on an observational study where both are random, R² is the squared sample correlation between fitted and observed values — a different object that requires the §4.8 causal-warnings care.
Leverage h_ii = diag(H)
Each diagonal entry of is called the LEVERAGE of observation :
where is the -th row of . Three facts to internalise. First, for any row in a model with an intercept (an inequality that comes from idempotency: implies , hence ). Second, , so the AVERAGE LEVERAGE is exactly . Third, measures how much observation 's own value contributes to its own fitted value : from ,
When , exactly — the fit is forced to pass through observation ; the point has total leverage and the fitted line is at its mercy. When (the average), observation contributes its FAIR SHARE of the -dimensional explanatory budget. When — the BELSLEY-KUH-WELSCH (1980) RULE OF THUMB — observation has unusually high leverage and is worth a second look. For simple regression with intercept one slope (), the average leverage is and the threshold is .
Geometrically, measures how far row of sits from the centroid of the design in COVARIATE SPACE, in the metric . For simple regression with , the formula simplifies to
Points far from in -coordinates have high leverage. The second widget makes this immediate.
The hat-matrix-leverage widget
The second widget shows a 2-D scatter of points generated from a clean linear trend. Each point is sized and colour-coded by its leverage . The OLS line is overlaid, and a vertical dashed line marks (where leverage is at its minimum). The right panel shows a sorted bar chart of the leverages, with vertical reference lines at the average and the Belsley-Kuh-Welsch threshold .
Things to verify in the widget:
- On a clean sample, leverages are all near . The status panel reports (within rounding): the trace-equals-p identity, verified numerically.
- Drag a point far to the right or left of the cloud — well past — and watch its leverage spike. It glows orange (above ), then red (above ). The other points' leverages adjust slightly because and both move when you drag.
- Click "Add high-leverage point". A new point is dropped near the right edge AT THE OLS LINE'S CURRENT VALUE. The point has very high leverage but a residual of approximately zero — so the OLS line BARELY MOVES. Now drag that point vertically away from the line. The OLS line ROTATES dramatically. This is the textbook high-leverage / high-residual scenario where OLS gets pulled around. §4.3 turns this into Cook's distance and DFFITS, the influence diagnostics that combine leverage and residual size.
- Right-click any point to remove it. Re-roll the sample to start over.
- Verify the sum-of-leverages identity: as you add or remove points, the status panel's "trace(H) = Σ h_ii" line always reads close to . The trace is a STRUCTURAL property of any orthogonal projection onto a 2-D subspace — it does not depend on the specific points, only on the dimension .
- Observation: the leverage formula has NO in it. Leverage is a property of the DESIGN matrix alone, not of the response. This is why high leverage is "potential to influence", not influence itself — a high-leverage point with a tiny residual barely affects the fit. The §4.3 influence diagnostics combine leverage with residual size to identify points that ACTUALLY influence the fit.
Multicollinearity as a geometric condition
When two or more columns of are nearly parallel (nearly linearly dependent), the matrix becomes nearly singular. Its smallest eigenvalue approaches zero; its condition number explodes; the inverse has very large entries. The OLS formula then propagates SMALL changes in into LARGE changes in — the estimator is UNSTABLE.
The geometric reading is direct. If columns are nearly parallel, the 2-D parallelogram they span is nearly DEGENERATE — almost a 1-D line. To express the projection of as a linear combination of and , the coefficients must be enormous (and of opposite signs), with most of the magnitude cancelling. Small perturbations of shift which "side" of the cancellation wins, so jumps wildly.
The §4.3 diagnostics — VIFs (variance inflation factors) and condition indices — are quantitative versions of "how nearly parallel are the columns?" The standard formulas come from the same geometry: where is the R² from regressing on the OTHER predictors. A large VIF means a large fraction of 's variation is explained by the other columns — i.e., is nearly in the span of the rest. The remedies are also geometric: drop a redundant column (reduce ), centre and orthogonalise the predictors, or move to ridge regression (Part 9 §9.2), which adds to to push the smallest eigenvalue safely away from zero — a geometric modification of the projection itself.
The Gauss-Markov theorem: OLS is BLUE
So far the discussion has been purely about the SAMPLE FACTS — the geometry of one observed . To talk about properties like UNBIASEDNESS and MINIMUM VARIANCE we need an assumed sampling-distribution model. The Gauss-Markov (1809–1821, formalised by Markov 1900) framework places five assumptions on the error vector :
- Linearity: . The conditional mean of is exactly the linear combination . No omitted nonlinear terms.
- Strict exogeneity: . The errors have mean zero given the predictors. (Equivalently, the predictors carry no information about the errors' direction — they are uncorrelated, in a stronger conditional-mean sense.)
- Homoscedasticity: for every . The errors have the same variance regardless of the predictor values.
- No autocorrelation: for . The errors are uncorrelated across observations.
- No perfect multicollinearity: rank. The columns of are linearly independent, so is invertible and is well-defined.
Under these five assumptions, the GAUSS-MARKOV THEOREM (Gauss 1809; Markov 1900; cf. Lehmann-Casella 1998 ch.3) states:
The OLS estimator is the BLUE — Best Linear Unbiased Estimator — of . That is: among all estimators that (i) are LINEAR in and (ii) are UNBIASED for (have for every value of ), the OLS estimator has the minimum variance.
"Linear in " means for some non-random matrix and vector ; the natural example is with for OLS, or any other for a different linear estimator. "Minimum variance" is in the matrix sense: as matrices (i.e., the difference is positive semi-definite).
Three caveats worth keeping in mind. First, BLUE concerns LINEAR UNBIASED estimators only. BIASED estimators (ridge, lasso) can have lower mean-squared error than OLS (this is the §1.5 bias-variance trade-off again; Part 9 develops it for regression). NONLINEAR estimators (median regression, M-estimators) can also beat OLS in MSE when the error distribution has heavy tails. Gauss-Markov says only that among the linear-unbiased class, OLS wins. Second, the theorem requires assumptions (3) and (4) — homoscedasticity and no autocorrelation. When they fail, OLS is still unbiased but no longer BLUE: GLS (§4.4) is the BLUE in that broader setting. Third, the theorem says nothing about the DISTRIBUTION of . For exact small-sample inference (t-tests, F-tests, confidence intervals) we need the optional sixth assumption.
The optional Normality assumption
If we add the sixth assumption
two exact small-sample consequences follow. First, is a linear function of the Normal vector , hence is itself Normal:
Second, the residual sum of squares scales to a chi-square:
independent of (Cochran's theorem, 1934, on independence of quadratic forms in Normal vectors). These two facts give the EXACT small-sample t-distribution for testing single coefficients () and the EXACT F-distribution for testing joint hypotheses about multiple coefficients (§4.3 develops both).
Without the Normality assumption, the CLT (§0.7) takes over asymptotically. As with fixed and the design well-behaved, , and the t- and F-tests are APPROXIMATELY valid for large . The §4.3 diagnostics include Q-Q plots of the standardised residuals (§3.5) precisely to check whether the Normality assumption is empirically defensible for any given dataset; the more departure, the more we rely on being large.
Why "geometry first" pays off downstream
The geometric picture pays for itself in every later section of Part 4:
- §4.2 (assumptions fail). Each Gauss-Markov assumption corresponds to a GEOMETRIC property of the projection. Homoscedasticity says — the errors are SPHERICALLY symmetric about the column space. Autocorrelation distorts the sphere into an ellipsoid. Endogeneity tilts the column space relative to the truth. Each failure has a geometric signature.
- §4.3 (diagnostics). Residuals, leverage, and influence are all read off the hat matrix . Cook's distance, DFFITS, and DFBETAS are scalar summaries of the SAME projection geometry.
- §4.4 (GLS). When for known , the BLUE is obtained by projecting onto . Same picture, different inner product.
- §4.5 (robust regression). Replace with a bounded loss . The "projection" is no longer Euclidean orthogonal, but the same closest-point logic recovers the M-estimator.
- §4.6 (interactions). Adding an interaction term enlarges by one dimension. The projection moves into the bigger subspace.
- §4.7 (model selection — AIC/BIC/CV). Selection chooses which columns to include — i.e., which sub-spaces of to project onto. The trade-off is between fitting closely (big ) and over-fitting noise (the bias-variance frontier of §1.5).
- §4.8 (causal warnings). The projection geometry tells you the linear combination of predictors that best matches . It tells you NOTHING about which predictors CAUSE . Confounders, instruments, mediators, colliders — all of those concepts (Part 6) live at the population / causal level, not in the geometry of one sample.
- Part 9 (ridge / lasso). Ridge adds to , geometrically SHRINKING the projection toward the origin (already seen in §1.5's shrinkage-cv-tuner widget). Lasso uses a non-Euclidean penalty whose level sets have CORNERS, so the projection lands on the boundary with some coefficients set to zero. Both are geometric modifications of OLS.
Try it
- In the ols-projection-geometry, leave the defaults and rotate the view. Confirm that (cyan) sits inside the blue plane and that the red dashed residual is perpendicular to the plane (the right-angle tick at marks this). Read off and — both should be 0 to within .
- Same widget. Verify to 3 decimal places. Slide from to and watch the three squared norms re-balance. The sum always equals — Pythagoras for orthogonal projection.
- Same widget. Click "Snap Y onto plane". Watch the residual collapse to zero, R² → 1, and . State why: when , the projection is itself and the fit is exact.
- Same widget. Click "Re-roll columns" repeatedly. Notice that can change dramatically across different column choices. State why: different column choices give different , and the angle between a fixed and different planes varies.
- In the hat-matrix-leverage, observe that on a clean sample of points, the leverage bar chart shows all bars below the threshold and most bars near . Confirm: (within rounding).
- Same widget. Drag a point far to the right of the cloud (x ≈ 14). Read off its new leverage . Verify by hand: (precise numbers depend on the sample). Should be well above . The bar chart highlights it in orange / red.
- Same widget. Click "Add high-leverage point". The point lands ON the OLS line, so its residual is approximately zero and the line barely moves. NOW drag that point vertically away from the line. The OLS line ROTATES dramatically. Conclude: leverage is the POTENTIAL to influence, realised influence requires also a non-zero residual — the §4.3 Cook's distance combines both.
- Pen-and-paper. With for simple regression and the formula, verify . Hint: expand , sum over , and use . The 1/n terms sum to ; the (x_i - x̄)² terms sum to . Total = 2 = p. ✓
- Pen-and-paper. Prove that the OLS residuals are orthogonal to every column of . Hint: differentiate in ; setting the gradient to zero gives , i.e. . The -th row says . Geometric statement of the same fact: the residual is perpendicular to the column space, hence to every spanning vector.
- Pen-and-paper. State the Gauss-Markov theorem precisely. List the five assumptions. State what BLUE means (Best Linear Unbiased Estimator: minimum variance among the linear-unbiased class). State two regimes where OLS can be IMPROVED on in MSE: (i) by adding bias (ridge, lasso) when the bias-variance trade-off favours it; (ii) by using nonlinear estimators (M-estimators) when the error distribution is heavy-tailed.
Pause and reflect: §4.1 has reframed OLS as GEOMETRY. The data live in . The columns of span a -dimensional subspace . OLS computes the orthogonal projection of onto that subspace — that is the WHOLE algorithm. The hat matrix is the matrix of the projection; its diagonal entries are the leverages; trace() = counts the degrees of freedom consumed by the columns. The normal equations and the closed-form fall out of the geometric orthogonality condition (or equivalently from setting the gradient of to zero — they agree). Under five Gauss-Markov assumptions, OLS is BLUE. Adding Normality buys exact small-sample t and F. Every later section of Part 4 — and a substantial chunk of Part 9 — perturbs this geometric picture in one specific way. §4.2 next explores what happens when each assumption fails one at a time.
What you now know
You can write the linear-regression model in matrix form with explicit dimensions . You can state the OLS criterion as the minimisation of and re-cast it geometrically: is the orthogonal projection of onto the column space , characterised uniquely by the condition that the residual is perpendicular to every column of , written compactly as .
You can derive the NORMAL EQUATIONS by substituting into . You know the closed-form solution when has full column rank, and that this estimator is LINEAR in — the matrix applied to the response vector.
You can state the HAT MATRIX and verify its three structural properties: (i) symmetric, (ii) idempotent, (iii) trace = p. You know that and , that fitted values and residuals are orthogonal vectors in , and that they decompose by Pythagoras: . The ols-projection-geometry widget makes this picture inhabitable in 3-D.
You can read R² as a GEOMETRIC QUANTITY: after centring, where is the angle between and the projection-residual decomposition. You know the three honest caveats (monotone in p, not a model-quality measure on its own, says nothing about functional-form correctness).
You can define LEVERAGE and recall: (i) ; (ii) ; (iii) average leverage ; (iv) Belsley-Kuh-Welsch (1980) threshold . For simple regression the closed form is . The hat-matrix-leverage widget lets you drag points and watch h_ii update; it also demonstrates that leverage is the POTENTIAL to influence — actual influence requires non-zero residuals too.
You can state MULTICOLLINEARITY as the geometric statement that near-parallel columns of make nearly singular and have very large entries, propagating small -perturbations into large -jumps. You know that §4.3 develops VIFs and condition indices, and that ridge regression (Part 9 §9.2) is one geometric remedy (add to ).
You can state the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) and the GAUSS-MARKOV THEOREM: under those five, OLS is BLUE — minimum-variance among linear unbiased estimators of . You know the three caveats: (i) BLUE is about the linear-unbiased class — biased and nonlinear estimators can beat OLS in MSE; (ii) GLS (§4.4) is BLUE under heteroscedasticity / autocorrelation; (iii) BLUE says nothing about the distribution of — exact small-sample inference needs the optional Normality assumption.
You can articulate the optional NORMALITY assumption and its two consequences: (i) exactly; (ii) independently. Together they give the exact small-sample t and F tests of §4.3. Without Normality the CLT carries the same conclusions asymptotically.
You can articulate WHY GEOMETRY FIRST PAYS OFF: every later section of Part 4 (assumptions failing, diagnostics, GLS, robust regression, interactions, model selection, causal warnings) and substantial parts of Part 9 (ridge, lasso) perturb the projection geometry in one specific way. The picture you built in §4.1 is the foundation on which the rest of regression rests.
Where this lands. §4.2 takes each Gauss-Markov assumption in turn and asks "what breaks when this fails?" — linearity becomes nonlinear regression; homoscedasticity becomes WLS / sandwich estimators; no autocorrelation becomes time-series corrections; etc. §4.3 develops the diagnostic toolkit (residual plots, Q-Q plots, Cook's distance, DFFITS, DFBETAS, VIFs, leverage-residual plots — every one of which is a function of and ). §4.4 covers GLS for known and unknown covariance. §4.5 develops robust regression — M-estimators, S-estimators, MM-estimators — for heavy-tailed errors. §4.6 handles interactions, polynomials, and basis expansions as ways to enlarge . §4.7 covers model selection via AIC, BIC, and cross-validation — picking which columns to include. §4.8 closes Part 4 with the causal-interpretation warnings: regression is not causation, and the geometry doesn't care which way causes . Part 5 generalises to GLMs (logistic, Poisson, mixed-effects). Part 9 §9.2 develops ridge and lasso as biased estimators that beat OLS in MSE when the bias-variance trade-off favours them.
References
- Gauss, C.F. (1809). Theoria motus corporum coelestium in sectionibus conicis solem ambientium. Hamburg: Perthes & Besser. (The first formal derivation of OLS in the context of orbital determination of comets and asteroids. Gauss introduces the squared-error criterion as the principle most consistent with the Normal error distribution.)
- Legendre, A.M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Paris: Firmin Didot. (Independent publication of the method of least squares, predating Gauss's 1809 book by four years. Legendre coined the name "méthode des moindres carrés" — least squares.)
- Plackett, R.L. (1972). "Studies in the history of probability and statistics. XXIX: The discovery of the method of least squares." Biometrika 59(2), 239–251. (The definitive priority study of the Legendre-Gauss dispute. Concludes that Legendre published first but Gauss had used the method privately since 1795.)
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 develops linear regression with the geometric / projection emphasis that §4.1 follows, and continues into ridge / lasso in chapter 3.4. The canonical modern reference for statistical learning theory.)
- James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 is the gentler companion to ESL's chapter 3 — same geometric emphasis, less mathematical machinery, more applied worked examples in R / Python. Widely used as the undergraduate / first-graduate textbook.)
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer. (Chapter 13 covers linear regression in the mathematical-statistics tradition. Compact derivations of the normal equations, Gauss-Markov, and inference under Normality.)
- Casella, G., Berger, R.L. (2002). Statistical Inference, 2nd ed. Pacific Grove: Duxbury. (Chapter 11 develops linear regression with the Gauss-Markov theorem as the central result. The graduate-level mathematical-statistics reference for the theoretical structure of §4.1.)
- Greene, W.H. (2018). Econometric Analysis, 8th ed. New York: Pearson. (The standard graduate-level econometrics reference. Chapter 3 covers OLS, chapter 4 the assumptions and Gauss-Markov, chapter 5 onwards generalised methods and finite-sample inference. The applied-econometrics complement to ESL / ISL.)
- Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. (The foundational reference for leverage , Cook's distance, DFFITS, DFBETAS, VIFs, and condition indices. The 2p/n leverage threshold used in §4.1 and the diagnostics in §4.3 come from this book.)
- Lehmann, E.L., Casella, G. (1998). Theory of Point Estimation, 2nd ed. New York: Springer. (Chapter 3 contains the rigorous statement and proof of the Gauss-Markov theorem in the linear-model framework, and the broader theory of best linear unbiased estimators. The reference for the BLUE concept and its precise scope.)
- Cochran, W.G. (1934). "The distribution of quadratic forms in a normal system, with applications to the analysis of variance." Proceedings of the Cambridge Philosophical Society 30(2), 178–191. (Cochran's theorem on the independence of quadratic forms in Normal vectors — the foundational result that gives the exact t and F distributions under the Normality assumption.)