Assumptions and what breaks when they fail

Part 4 — Linear regression, done seriously

Learning objectives

Recite the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) plus the optional sixth (Normality) as a checklist, and state for each one which residual-plot panel reveals the violation, what is broken in the OLS estimator or its standard errors when it fails, and what fix is appropriate
Read the FOUR DIAGNOSTIC PANELS — residuals vs fitted (linearity + mean structure), Q-Q plot of standardised residuals (Normality), residuals vs index/order (independence), scale-location √|standardised residual| vs fitted (homoscedasticity) — recognise each panel's specific signature, and state why these four together cover most assumption violations
State the LINEARITY-FAILURE signature: curvature (U-shape or arch) in residuals vs fitted. Consequence: β̂ is BIASED — it estimates a population-weighted average slope, not any local slope. Fix: add polynomial / spline terms (§4.6), basis expansions, or transform Y (e.g. log Y when the truth is multiplicative)
State the HETEROSCEDASTICITY signature: a fan / funnel in residuals vs fitted and a rising trend in the scale-location plot. The Breusch-Pagan test (Breusch-Pagan 1979) makes this quantitative: regress squared residuals on fitted values; the resulting test statistic ≈ χ²₁ under H₀ of constant variance. Consequence: β̂ remains UNBIASED but SE(β̂) is wrong — textbook formulas understate variance where x is dispersed. Fix: White (1980) heteroscedasticity-consistent ("sandwich") standard errors; weighted least squares / GLS (§4.4); transformation
State the AUTOCORRELATION signature: snaking / runs in residuals vs index (when the index is meaningful, e.g. time-ordered observations); residual autocorrelation function with significant lag-1 spike. The Durbin-Watson statistic (Durbin-Watson 1950) DW = Σ(eᵢ − eᵢ₋₁)² / Σeᵢ² is ≈ 2 under H₀ of independence, < 1.5 under positive AR(1) autocorrelation, > 2.5 under negative. Consequence: β̂ unbiased but SE understated; effective sample size shrinks. Fix: Newey-West (1987) HAC standard errors; explicit ARMA error model; first-differencing; cluster-robust SEs
State the OUTLIER / HIGH-INFLUENCE-POINT signature: one or two points sit far off the residuals-vs-fitted cloud and curl the Q-Q plot's tail. Consequence: β̂ is BIASED toward the outlier, especially when the outlier has high LEVERAGE (§4.1) so its Cook's distance (§4.3) is large. Fix: robust regression (§4.5) — M-estimators (Huber, Tukey biweight) downweight outliers; or investigate the outlier with subject-matter justification before deciding whether to correct or remove it
State the NON-NORMALITY-OF-ERRORS signature: Q-Q plot of standardised residuals shows tails curling away from the diagonal (heavy tails) or asymmetry (skewness). Critically — the assumption is about the conditional distribution of ε given X, NOT the marginal distribution of Y. Consequence: point estimates and SEs are still consistent (CLT); SMALL-n exact t and F intervals are off. Fix: rely on the CLT for large n; bootstrap CIs (§1.7, §3.2); robust regression (§4.5) for efficiency on heavy-tailed errors
State the MULTICOLLINEARITY signature: two or more predictors highly correlated, sample correlation matrix has near-singular eigenvalues, design matrix X has nearly linearly dependent columns. Make this quantitative with the VARIANCE INFLATION FACTOR $\mathrm{VIF}_j = 1 / (1 - R_j^2)$ where $R_j^2$ is the R² of regressing predictor j on the OTHER predictors. State the customary thresholds: VIF ≤ 5 is fine; 5 < VIF ≤ 10 is a mild concern; VIF > 10 is severe (some authors prefer VIF > 4 / VIF > 8 as analogous thresholds). Consequence: β̂_j on correlated predictors is UNSTABLE — small data perturbations swing signs and magnitudes; SEs blow up; CIs widen. Fix: drop a redundant predictor; combine via PCA on the design; ridge regression (Part 9 §9.2)
Articulate the HONEST CAVEATS that distinguish good practice from cargo-cult diagnostics: (1) Diagnostics are visual FIRST and statistical SECOND — always plot the residuals before invoking a test; (2) Multiple violations can mask each other — curvature can look like heteroscedasticity, an autocorrelated AR(1) series can look like a trend; (3) The Normality assumption is about ERRORS conditional on X, not the marginal distribution of Y; (4) Statistical tests for assumption violations (Breusch-Pagan, Durbin-Watson, Shapiro-Wilk) have LOW POWER in small n — a non-rejection at n = 30 is much weaker evidence of "assumptions hold" than a non-rejection at n = 1000
Memorise the FIX MATRIX as a one-screen reference: linearity → polynomials / splines / log transforms (§4.6); heteroscedasticity → White SEs / WLS-GLS (§4.4); autocorrelation → Newey-West SEs / ARMA model; outliers / heavy tails → robust regression (§4.5) / bootstrap CIs; multicollinearity → drop variables / PCA / ridge (Part 9). Recognise that each fix preserves the GEOMETRIC picture of §4.1 — sometimes by changing the inner product (GLS), sometimes by changing the loss (robust), sometimes by adding a penalty (ridge), sometimes by enlarging col(X) (splines)
Read the catalogue of seminal references: White (1980) for heteroscedasticity-robust standard errors; Breusch-Pagan (1979) for the homoscedasticity test; Durbin-Watson (1950) for the autocorrelation test; Newey-West (1987) for HAC standard errors; Belsley-Kuh-Welsch (1980) and Cook-Weisberg (1982) for the residual-diagnostic toolkit; Greene (2018) chs. 4 and 9 for the econometric summary; Hastie-Tibshirani-Friedman (2009) ch. 3 for the modern-stat-learning treatment; Wasserman (2004) ch. 13 for the compact mathematical-statistics version

§4.1 set OLS up as the orthogonal projection of $Y$ onto $\mathrm{col}(X)$ and stated the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) plus the optional sixth (Normality). Under those five OLS is BLUE; under all six it has exact small-sample t- and F-inference. §4.2 takes each assumption in turn and asks the practical question: what does its failure LOOK LIKE, in what diagnostic panel does the failure show, what is broken in OLS as a result, and what is the appropriate fix?

The framing is deliberately picture-driven. Diagnostics are visual FIRST and statistical SECOND. The four canonical residual-plot panels — residuals vs fitted, Q-Q plot of residuals, residuals vs index, scale-location — each have a specific signature for a specific violation. The two §4.2 widgets make these signatures inhabitable: the assumption-diagnostics-suite lets the reader pick a scenario (clean OLS, curvature, heteroscedasticity, autocorrelation, outlier, non-Normal errors, near-collinearity) and see all four panels populate; the vif-multicollinearity widget animates how the VIF rises and the CIs swell as the predictor correlation marches toward 1.

The §4.2 arc has seven stops, one per assumption + the wrap-up of the fix matrix. For each: the SIGNATURE in the residual plots, the CONSEQUENCE for OLS estimates and standard errors, the appropriate FIX in the regression toolkit. The geometry from §4.1 carries through every section: each fix is a specific perturbation of the projection — GLS changes the inner product, robust regression changes the loss, ridge changes the penalty, polynomial / spline terms enlarge $\mathrm{col}(X)$ .

The four canonical residual-plot panels

The diagnostic stack is built on four plots. Each one is a scatter of a residual-derived quantity against another scalar, and each has a specific tell. Let $e = Y - \hat Y$ be the residual vector and let $r_i = e_i / (\hat\sigma \sqrt{1 - h_{ii}})$ be the $i$ -th internally studentised ("standardised") residual — the residual scaled by its own estimated standard deviation under the model (the $\sqrt{1 - h_{ii}}$ factor comes from the fact that $\mathrm{Var}(e_i) = \sigma^2 (1 - h_{ii})$ under the model, where $h_{ii}$ is the leverage from §4.1).

Residuals vs fitted ( $e_i$ vs $\hat Y_i$ ). Under all assumptions, this is a featureless cloud centred on 0. Curvature (U or arch) signals that the true regression function is nonlinear and the linear fit is misspecified — LINEARITY fails. Fanning (vertical spread grows with $\hat Y$ ) signals that the error variance depends on the fitted value — HOMOSCEDASTICITY fails. A small smoother line through the cloud (the red line in the widget) helps the eye separate signal from noise.
Q-Q plot of standardised residuals (ordered $r_i$ vs theoretical N(0, 1) quantiles). Under all six assumptions including Normality, this is a straight line at slope 1 through the origin. Tail-heaviness (both ends curl AWAY from the diagonal) signals heavy-tailed errors. Asymmetry (one end curls up, the other curls down monotonically) signals skewness. Single off-line points at the extremes signal outliers.
Residuals vs index/order ( $e_i$ vs $i$ , when the index is meaningful — typically time-ordering for time-series data, or spatial ordering for spatial data). Under all assumptions this is a featureless cloud. Snaking — long runs of same-sign residuals — signals positive autocorrelation between adjacent errors. Alternation — residuals flip sign every step — signals negative autocorrelation. INDEPENDENCE (no autocorrelation) fails.
Scale-location plot ( $\sqrt{|r_i|}$ vs $\hat Y_i$ ). Same data as the residuals-vs-fitted panel, but the absolute-deviation transform removes sign cancellation, so fanning becomes a monotone TREND in the plot. Under homoscedasticity, this hovers around the constant $\sqrt{2 / \pi} \approx 0.80$ — the mean of $|Z|^{1/2}$ for $Z \sim N(0, 1)$ . A rising or falling trend is the cleanest visual diagnostic for heteroscedasticity, more sensitive than the residuals-vs-fitted panel.

These four panels — sometimes assembled into a 2×2 grid by statistical software (R's plot(lm.fit), statsmodels' diagnostic-plot helpers) — are the diagnostic backbone. Cook-Weisberg (1982) and Belsley-Kuh-Welsch (1980) extended the toolkit with leverage-vs-residual scatter and Cook's distance (§4.3). The §4.2 widget renders all four panels live; §4.3 dives deeper into the per-observation diagnostics (leverage, Cook's distance, DFFITS, DFBETAS) that supplement them.

The first widget gives the reader a switchboard of scenarios. Pick one, and the widget generates $n = 80$ observations from the corresponding data-generating process, fits the (mis)specified OLS model, and renders all four residual panels. The seven scenarios are:

Clean OLS — linear truth $Y = 1 + 0.75x + \varepsilon$ , $\varepsilon \sim \mathcal{N}(0, 0.9^2)$ . All four panels look featureless; this is the baseline you compare every violation against.
Linearity fails (curvature) — the truth is quadratic, the fit is straight-line. The residuals-vs-fitted panel shows a clear U-shape.
Homoscedasticity fails (fanning variance) — error variance grows with $x$ . The residuals-vs-fitted panel fans; the scale-location plot rises monotonically; Breusch-Pagan flags above the $\chi^2_{1, 0.05} = 3.84$ critical value.
Independence fails (AR(1) autocorrelation) — $\varepsilon_t = 0.75 \varepsilon_{t-1} + \eta_t$ . The residuals-vs-index panel shows long same-sign runs; Durbin-Watson drops well below 2.
Single high-influence outlier — clean data plus one point at large $x$ with a large vertical deviation. The Q-Q plot shows a single off-line tail point; residuals-vs-fitted reveals the lone outlier.
Normality fails (heavy-tailed errors) — errors drawn from Student-t with df = 3. The Q-Q plot curls at both ends; the residuals-vs-fitted panel looks ordinary; skewness ≈ 0 but excess kurtosis is large and positive.
Near-multicollinearity — two highly correlated predictors $x_1, x_2$ but we only fit $Y$ on $x_1$ ; $x_2$ 's contribution leaks into the residuals. (The structural collinearity story is the second widget.)

Things to verify in the widget:

On the "clean" scenario, all four panels are featureless; Breusch-Pagan and Durbin-Watson stay in their nominal ranges; skewness and excess kurtosis hover near 0. Click "New sample" repeatedly — every sample looks roughly the same. This is the BASELINE.
Switch to "curvature". The residuals-vs-fitted panel shows a clear U-shape: residuals are positive at the ends, negative in the middle. The Q-Q plot may look only mildly disturbed. The lesson: linearity failures show up in the residuals-vs-fitted panel, not the Q-Q plot.
Switch to "heteroscedasticity". The residuals-vs-fitted panel fans open with $x$ ; the scale-location plot rises monotonically. The Breusch-Pagan statistic typically lands above $\chi^2_{1, 0.05} \approx 3.84$ , formally flagging the heteroscedasticity. Note σ̂ in the status panel — the SINGLE-NUMBER residual SE is a misleading summary when variance changes with $x$ .
Switch to "autocorrelated". The residuals-vs-index panel shows a SLOW SNAKE through the index — runs of 5-10 same-sign residuals. Durbin-Watson drops well below 2 (often into the 0.5-1.0 range). The Q-Q plot looks normal-ish because the marginal distribution of an AR(1) is still Normal — autocorrelation hides in the ORDERED structure, which is exactly what the residuals-vs-index panel exposes.
Switch to "outlier". One point sits at $x \approx 13.5, y \approx -2.5$ — far from the trend line. The residuals-vs-fitted panel shows a lone point well off the cloud; the Q-Q plot has a single tail point curling away. The fitted slope $\hat\beta_1$ is visibly pulled DOWN compared to the clean baseline — the outlier biases the estimate.
Switch to "non-normal". The visual scale of residuals matches "clean" because we matched the variance scale, but the Q-Q plot now curls at BOTH ends — heavy tails. Excess kurtosis in the status panel jumps to 1.5-3 range. The first three panels still look fine; only the Q-Q reveals the violation.
Switch to "multicollinear". The widget fits $Y$ on $x_1$ alone but the true generative model uses $x_1, x_2$ with $\rho(x_1, x_2) \approx 0.98$ ; the omitted-but-correlated $x_2$ leaks into the residuals as extra variance. The structural collinearity story — VIF explosion, coefficient instability — is the territory of the second widget.
Click "New sample" while looking at the autocorrelated scenario. The Durbin-Watson statistic jitters around its near-1 mean but stays well below 2. With heavy-tailed errors, sample-to-sample variation in excess kurtosis is large — small $n = 80$ + heavy tails = high-variance higher-moment estimates.

Linearity fails: curvature in the residuals

If the true conditional mean is $\mathbb{E}[Y \mid X] = f(X)$ for some nonlinear $f$ , but we fit the linear model $\hat Y = X\hat\beta$ , OLS picks the linear combination of columns of $X$ that best approximates $f(X)$ in $L^2$ . The residuals $e_i = Y_i - \hat Y_i$ then carry the SHAPE of $f$ minus its best linear projection. For a quadratic $f$ , the leftover is approximately quadratic; residuals plotted against $\hat Y_i$ show a U or arch.

Consequences:

β̂ is BIASED. $\hat\beta_j$ does NOT estimate any local slope of $f$ ; it estimates a population-weighted average slope. If you care about the marginal effect of $x_j$ at a specific value, OLS on the misspecified linear model is not the right answer.
R² can still look respectable. A quadratic shape projected onto a line still explains a fraction of the variation; R² of 0.6-0.8 is common even with serious curvature. R² alone never diagnoses misspecification.
Standard errors are off too. The model is wrong, so the variance formula $\hat\sigma^2 (X^\top X)^{-1}$ does not give the actual sampling variance of $\hat\beta$ .

Fixes — three families:

Polynomial / spline terms (§4.6). Add $x^2, x^3, \ldots$ as columns of $X$ ; the column space $\mathrm{col}(X)$ grows to include the nonlinear basis functions. Geometrically: enlarge $\mathrm{col}(X)$ so the projection captures more of $f$ . Splines (piecewise polynomial bases) are a more disciplined generalisation.
Variable transformation. If the true relationship is multiplicative ( $Y = a \cdot x^b \cdot \varepsilon$ ), take $\log Y = \log a + b \log x + \log \varepsilon$ — the transformed model IS linear. Common transforms: $\log$ , $\sqrt{\cdot}$ , Box-Cox.
Generalised additive model (GAM, beyond §4.6). Fit $\hat Y = \sum_j f_j(x_j)$ with smooth per-predictor functions $f_j$ estimated nonparametrically. The §9.3 trees / random forests and §8.5 KDE-based nonparametric fits are further along the same axis.

Homoscedasticity fails: heteroscedastic ("fanning") variance

Homoscedasticity says $\mathrm{Var}(\varepsilon_i \mid X) = \sigma^2$ — the same for every observation. When this fails, $\mathrm{Var}(\varepsilon_i \mid X) = \sigma_i^2$ varies across observations, typically as a function of $X_i$ . The classic shape is "fanning": variance grows with the fitted value, so residuals form a wedge in the residuals-vs-fitted plot.

Quantitative diagnostic — the Breusch-Pagan test (Breusch-Pagan 1979): regress the squared (scaled) residuals on a vector of variance-determinants (often the fitted values), compute the auxiliary R²_aux, and form $\mathrm{BP} = n \cdot R^2_{\text{aux}}$ . Under H₀ of homoscedasticity, BP follows $\chi^2_{k}$ with $k$ degrees of freedom equal to the number of variance-determinants. The widget uses the simplest variant with a single regressor (the fitted value), so BP ≈ $\chi^2_1$ ; the 5% critical value is $\chi^2_{1, 0.05} = 3.84$ .

Consequences:

β̂ remains UNBIASED. The unbiasedness of OLS does not depend on the variance structure; it depends on $\mathbb{E}[\varepsilon \mid X] = 0$ . So OLS still estimates $\beta$ on average correctly. The point estimates are not what's wrong.
SE(β̂) is WRONG. The OLS variance formula $\hat\sigma^2 (X^\top X)^{-1}$ assumes constant variance. Under heteroscedasticity it typically UNDERSTATES the variance where $x$ is dispersed and OVERSTATES it where $x$ is tight. Confidence intervals and p-values built on this formula are off.
β̂ is no longer BLUE. Gauss-Markov assumed homoscedasticity. Under heteroscedasticity, OLS is still LINEAR and UNBIASED, but it is no longer the minimum-variance such estimator. GLS / WLS (§4.4) IS the new BLUE.

Fixes — three families:

White (1980) heteroscedasticity-consistent ("sandwich") SEs. Replace $\hat\sigma^2 (X^\top X)^{-1}$ with $(X^\top X)^{-1} \bigl(\sum_i e_i^2 , x_i x_i^\top \bigr) (X^\top X)^{-1}$ . Keeps the OLS point estimate; corrects the variance. Standard in modern econometrics — every regression package supports it (R: vcovHC; Python statsmodels: cov_type="HC0"/"HC3"; Stata: robust). Liberalises to clustered SEs when observations cluster.
Weighted least squares (WLS) / GLS (§4.4). If the variance structure is known (or can be estimated from auxiliary regression on $e_i^2$ ), weight each observation inversely to its variance. The resulting estimator IS the BLUE under heteroscedasticity (the analogue of Gauss-Markov for GLS).
Variable transformation. If the variance grows with the mean (Poisson-like), $\sqrt{Y}$ or $\log Y$ can stabilise the variance. Often combined with a sensible scientific reframing (e.g., model $\log\text{Income}$ instead of $\text{Income}$ ).

Independence fails: autocorrelation in the errors

No-autocorrelation says $\mathrm{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0$ for $i \ne j$ . Failures come most often when observations have an INTRINSIC ORDER — time-series, spatial, hierarchical. An AR(1) error process $\varepsilon_t = \rho \varepsilon_{t-1} + \eta_t$ with $\eta_t \sim \mathcal{N}(0, \sigma^2)$ and $|\rho| < 1$ is the canonical example.

Quantitative diagnostic — Durbin-Watson (Durbin-Watson 1950): $\mathrm{DW} = \frac{\sum_{t=2}^n (e_t - e_{t-1})^2}{\sum_{t=1}^n e_t^2}$ . Under H₀ of no autocorrelation, DW ≈ 2. Under AR(1) with autocorrelation $\rho$ , DW ≈ 2(1 − $\rho$ ), so positive autocorrelation ( $\rho > 0$ ) drives DW below 2, negative $\rho$ drives DW above 2. The exact distribution depends on the design $X$ ; standard tables give upper and lower bounds for the DW critical region.

Consequences:

β̂ remains UNBIASED. Same story as heteroscedasticity — unbiasedness needs only $\mathbb{E}[\varepsilon \mid X] = 0$ .
SE(β̂) is UNDERSTATED. Autocorrelation means observations carry overlapping information; the effective sample size is much smaller than $n$ . The OLS variance formula treats the $n$ observations as independent, so it understates the actual sampling variance — sometimes by factors of 2-5×.
OLS is no longer BLUE. GLS with the correct covariance $\Omega$ would be BLUE; OLS is suboptimal.

Fixes — three families:

Newey-West (1987) HAC standard errors. Heteroscedasticity-and-Autocorrelation-Consistent: a generalisation of White (1980) sandwich SEs that includes lagged cross-products with kernel-weighted decay (Bartlett kernel by default). Keep the OLS point estimate, replace the SEs. Standard in modern time-series regression.
Explicit ARMA error model. Fit $Y = X\beta + u$ where $u$ follows an ARMA(p, q) process; estimated jointly by maximum likelihood. R's arima / auto.arima, Python's statsmodels.tsa.arima.model.ARIMA. Recovers BLUE-like efficiency when the ARMA model is correct.
First-differencing or cluster-robust SEs. When the correlation has a known cluster structure (panel data, repeated measures), cluster-robust SEs treat each cluster as one independent unit. First-differencing converts AR(1) errors into a white-noise model.

Outliers and heavy-tailed errors

Two related-but-distinct issues. OUTLIERS are individual observations that are inconsistent with the bulk of the data — far from the residual cloud, sometimes with high leverage (§4.1). HEAVY-TAILED ERRORS are a distributional property — the conditional distribution of $\varepsilon \mid X$ has more probability mass in the tails than Normal (e.g., Student-t with df = 3, contaminated Normal). Heavy tails generate MANY moderate outliers; a single outlier might come from a Normal distribution by sheer chance, or from a data-entry error.

Diagnostic — both show in the Q-Q plot, but with different signatures. A single outlier puts ONE point far off the diagonal at one tail. Heavy tails curl BOTH tails away from the diagonal. The §4.3 leverage-vs-residual diagnostic + Cook's distance (Cook-Weisberg 1982) supplements the Q-Q with per-observation impact measures.

Consequences:

For a single high-influence outlier: β̂ is BIASED toward the outlier, sometimes severely. The Belsley-Kuh-Welsch (1980) leverage threshold (§4.1) catches outliers with large $h_{ii}$ ; Cook's distance (§4.3) combines leverage with residual size to flag outliers that actually influence the fit.
For heavy-tailed errors: β̂ is still consistent and asymptotically Normal by the CLT, so point estimates and SEs are still valid for large $n$ . Small-n exact inference (t and F) is off, but the CLT-based asymptotics carry through. Efficiency suffers, however: under heavy tails the OLS estimator has higher variance than alternatives.

Fixes:

Robust regression (§4.5). Replace the squared-error loss $\sum e_i^2$ with a bounded loss $\sum \rho(e_i)$ . Huber, Tukey biweight, and similar M-estimators downweight extreme residuals; the resulting estimator is much less sensitive to outliers and more efficient under heavy tails.
Bootstrap confidence intervals (§1.7, §3.2). Resample the residuals or the (X, Y) pairs and refit; the empirical distribution of $\hat\beta$ across bootstrap samples gives a CI that does not need Normality. Especially valuable for small $n$ .
Quantile regression (§8.6). Estimate the conditional MEDIAN (or another quantile) of $Y$ given $X$ instead of the mean. The median is the $L^1$ analogue of the $L^2$ mean and is far more robust to outliers and heavy tails.
Subject-matter investigation. An outlier may be a data-entry error (drop after correction), a genuine extreme observation (keep but report robust analysis alongside), or a flag for an unmodelled subgroup (model it explicitly). Statistical software gives diagnostics; only the analyst can decide what the outliers MEAN scientifically.

Multicollinearity and the variance inflation factor

No-perfect-multicollinearity is the §4.1 algebraic condition $\mathrm{rank}(X) = p$ — without it, $X^\top X$ is singular and $\hat\beta$ is not defined. The PRACTICAL problem is NEAR-multicollinearity: columns of $X$ are not literally linearly dependent but are highly correlated, so $X^\top X$ is invertible but ill-conditioned. The §4.1 geometric reading: nearly-parallel columns span a nearly-degenerate parallelogram; coefficients on those columns must be large and opposite-signed to fit any specific $Y$ ; small perturbations of $Y$ swing the coefficients dramatically.

Quantitative diagnostic — the VARIANCE INFLATION FACTOR $\mathrm{VIF}_j = 1 / (1 - R_j^2)$ where $R_j^2$ is the R² from regressing predictor $j$ on all the OTHER predictors. Interpretation: VIF_j is the factor by which $\mathrm{Var}(\hat\beta_j)$ is INFLATED relative to the no-collinearity case (where $R_j^2 = 0$ gives VIF = 1). Customary thresholds: VIF ≤ 5 is fine; 5 < VIF ≤ 10 is a mild concern; VIF > 10 is severe. (Some authors use VIF > 4 and VIF > 8 instead; the exact threshold is a convention, not a mathematical theorem.)

For the two-predictor case with sample correlation $r$ between predictors, both VIFs collapse to $1 / (1 - r^2)$ . As $|r| \to 1$ , both VIFs diverge. The second widget animates exactly this.

The second widget gives the reader a direct knob on the predictor correlation. Slide $\rho$ from 0 to 0.99 and watch the sample correlation $r(x_1, x_2)$ track it, the VIF rise (on a log scale, so VIF = 100 fits on screen), and the 95% confidence intervals for $\hat\beta_1, \hat\beta_2$ stretch ever wider — even though the TRUE coefficients stay fixed at $\beta_1 = \beta_2 = 0.5$ . Click "Rerun sample" to draw a new sample at the current $\rho$ ; the BETA POINT ESTIMATES jump around much more at high $\rho$ than at low $\rho$ . That sample-to-sample instability IS variance inflation in action.

Things to verify in the widget:

At $\rho = 0$ , the sample $r(x_1, x_2) \approx 0$ , both VIFs are very near 1, and the 95% CIs for $\hat\beta_1, \hat\beta_2$ are tight around 0.5. This is the BASELINE.
Slide $\rho$ up to 0.50. The sample $r$ tracks; both VIFs land near $1 / (1 - 0.5^2) = 1.33$ . The CIs are slightly wider but still tight. Multicollinearity is not biting yet.
Slide $\rho$ to 0.90. Sample $r \approx 0.90$ ; VIFs jump to $1 / (1 - 0.81) \approx 5.26$ . The CIs are visibly wider. We are in the "mild concern" zone.
Slide $\rho$ to 0.95. VIFs ≈ $1 / (1 - 0.9025) \approx 10.26$ — across the "severe" threshold. The CIs are now markedly wider; the bar chart shows the VIF bars crossing the red-line threshold. The status table flags the SEVERE warning.
Slide $\rho$ to 0.99. VIFs ≈ $1 / (1 - 0.9801) \approx 50$ . The CIs sprawl across most of the plot — neither coefficient is identifiable with any precision on its own. The point estimates are nearly meaningless individually; only their SUM $\hat\beta_1 + \hat\beta_2 \approx 1.0$ is stable (because that's what the data actually constrain).
At any high $\rho$ , click "Rerun sample" repeatedly. The point estimates $\hat\beta_1, \hat\beta_2$ DANCE WILDLY around their true 0.5 — sometimes one is near 0.9, the other near 0.1; sometimes both are near 0.5; sometimes one is NEGATIVE. The CIs always cover 0.5, but their width and the wild point estimates make individual interpretation impossible. At low $\rho$ , the same "Rerun" produces small jitter around 0.5.
Read the formula $\mathrm{VIF}_j = 1 / (1 - R_j^2)$ off the widget: with only two predictors $R_j^2 = r^2$ where $r$ is the pairwise sample correlation. VIF goes to infinity as $r \to 1$ ; the "matrix singular" annotation in the widget marks $r > 0.99$ where numerical issues start to bite.

Fixes for multicollinearity:

Drop a redundant predictor. If $x_1$ and $x_2$ are measuring nearly the same construct, including both adds noise without adding information. Pick the one with cleaner measurement / better theoretical grounding.
Combine via PCA on the design. Project $(x_1, x_2)$ onto its leading principal component; use that single combined variable. The PCA component captures most of the variation; the orthogonal direction (which is what individual $\hat\beta$ s try to estimate) is the unstable part.
Ridge regression (Part 9 §9.2). Replace $\hat\beta = (X^\top X)^{-1} X^\top Y$ with $\hat\beta_\lambda = (X^\top X + \lambda I)^{-1} X^\top Y$ for some $\lambda > 0$ . The ridge penalty pushes the smallest eigenvalue of $X^\top X + \lambda I$ safely away from zero, stabilising the inverse. The trade-off is bias (toward zero) for variance reduction; cross-validation picks $\lambda$ .
Centring / re-parameterisation. When the collinearity arises from polynomial terms (e.g., $x$ and $x^2$ are correlated when $x$ doesn't straddle zero), centring $x$ at its mean before squaring reduces the VIF dramatically. Same algebra; better-conditioned design.

Honest caveats: diagnostics are not algorithms

The four-panel residual workflow is the standard apparatus, but four caveats keep practice honest:

Plot first, test second. Statistical tests for assumption violations (Breusch-Pagan, Durbin-Watson, Shapiro-Wilk) have LOW POWER in small $n$ . A non-rejection at $n = 30$ is much weaker evidence of "assumptions hold" than a non-rejection at $n = 1000$ . The residual PLOTS are visual; visual inspection scales much better with $n$ than do the formal tests in either direction.
Multiple violations can mask each other. Curvature can LOOK like heteroscedasticity (the U-shape spreads residuals more in the wings). Autocorrelation can LOOK like a trend in residuals vs index. An outlier can dominate the Q-Q plot in a way that hides the underlying heavy tails. Diagnose one assumption at a time, ideally on a refit after addressing the most obvious violation.
"Normality of residuals" ≠ "Normality of Y". The Gauss-Markov+Normality assumption is about the CONDITIONAL distribution of $\varepsilon$ given $X$ , NOT the marginal distribution of $Y$ . If $Y$ is bimodal because the population has two subgroups with different means, and the predictor $X$ codes for subgroup, then $\varepsilon \mid X$ can be unimodal and Normal even though $Y$ is bimodal. Always inspect the RESIDUAL Q-Q, not the raw $Y$ Q-Q.
Small samples lie. At $n = 20$ , a residual plot can look perfectly clean even when assumptions are violated, and conversely a plot can look bad by chance even when assumptions hold. The §0.7 CLT and §1.9 asymptotics give the regime where diagnostics become RELIABLE; below $n \approx 30-50$ , visual inspection is necessary but not sufficient. Bootstrap CIs (§3.2) are often the right safety net.

Each fix preserves the §4.1 geometry

One unifying theme — every fix in this section can be read as a specific perturbation of the §4.1 projection picture:

Polynomials / splines / transformations ENLARGE $\mathrm{col}(X)$ by adding columns. The projection lands in the bigger subspace; $\hat Y$ can track more of the true conditional mean.
GLS / WLS CHANGES THE INNER PRODUCT. Instead of orthogonal projection in Euclidean $|\cdot|$ , project in $|\cdot|$ {\Omega^{-1}} $∥ \cdot ∥_{Ω^{- 1}}$ . Same geometric structure, different metric.
Robust regression (M-estimators, §4.5) CHANGES THE LOSS from $|\cdot|^2$ to a bounded $\rho(\cdot)$ . The closest-point logic still applies, but in a non-Euclidean sense; the $\hat\beta$ that minimises the new loss downweights outliers.
Ridge (Part 9 §9.2) ADDS A PENALTY. The new objective is $|Y - X\beta|^2 + \lambda |\beta|^2$ ; geometrically, the level sets carve the projection differently and the estimator shrinks toward the origin.
White / Newey-West SEs KEEP THE PROJECTION but change the variance estimate. Same $\hat\beta$ ; replace the sandwich filling so SEs are robust to the actual covariance structure of $\varepsilon$ .
Drop a predictor / PCA SHRINKS $\mathrm{col}(X)$ . Project $Y$ onto a lower-dimensional subspace; coefficients become identifiable again.

The geometric picture is the SAME backbone; the fixes are different perturbations of it. That is why §4.1 invested so much in the projection-first framing: every later section of Part 4 and large chunks of Parts 5 and 9 reduce to "perturb the projection in a specific way."

Try it

In the assumption-diagnostics-suite, start on "clean" and click "New sample" five times. Confirm: all four panels stay featureless; BP stays under 3.84; DW stays between 1.5 and 2.5; excess kurtosis stays under ~0.7 in magnitude. This is the no-violation baseline pattern.
Same widget, switch to "curvature". Identify the U-shape in the residuals-vs-fitted panel. Read off the σ̂ value — note that it summarises a SYSTEMATIC misspecification with a SINGLE NUMBER, which is misleading. State the fix: add $x^2$ as a column of $X$ , which would let the projection capture the curvature.
Same widget, switch to "heteroscedastic". Identify the fan in residuals-vs-fitted AND the rising trend in scale-location. Read off the Breusch-Pagan statistic — it should land well above $\chi^2_{1, 0.05} = 3.84$ . State two fixes that keep $\hat\beta$ unchanged (White SEs) and one that changes $\hat\beta$ to a better estimator (WLS / GLS).
Same widget, switch to "autocorrelated". Identify the snake in residuals-vs-index — long same-sign runs. Read off the Durbin-Watson statistic; it should drop well below 1.5 (typical: 0.7-1.2). State why the Q-Q plot still looks roughly Normal: AR(1) errors are MARGINALLY Normal — the autocorrelation hides in the JOINT distribution of consecutive errors, which is what residuals-vs-index exposes.
Same widget, switch to "outlier". Locate the single off-cloud point in residuals-vs-fitted; locate the single off-line point in the Q-Q tails. Note that DW is unaffected (outliers are not autocorrelation) and BP may or may not be affected. State the fix: robust regression (§4.5) downweights the outlier so $\hat\beta$ converges back toward the truth.
Same widget, switch to "non-normal". Observe: the first three panels look healthy; only the Q-Q plot reveals the heavy tails (both ends curl AWAY from the diagonal). State the consequence: large- $n$ inference is fine by the CLT; small- $n$ exact t/F intervals are off. State two fixes: bootstrap CIs (preserves $\hat\beta$ ; corrects intervals) and robust regression (changes $\hat\beta$ to a higher-efficiency estimator under heavy tails).
In the vif-multicollinearity widget, slide $\rho$ from 0 to 0.99 step by step. Verify: VIF tracks the formula $1 / (1 - r^2)$ where $r$ is the sample correlation. At $\rho = 0.9$ , VIF ≈ 5; at $\rho = 0.95$ , VIF ≈ 10; at $\rho = 0.99$ , VIF ≈ 50. The CIs widen monotonically.
Same widget, set $\rho = 0.95$ and click "Rerun sample" eight times. Record the $\hat\beta_1, \hat\beta_2$ values. Note the wild scatter — sometimes one is near 0.9 while the other is near 0.1; sometimes both are 0.5. State why their SUM is much more stable: the data identify $\hat\beta_1 + \hat\beta_2$ (the direction of the design's long axis), but not the individual coefficients (the direction the design barely covers).
Pen-and-paper. Derive $\mathrm{VIF}$ from scratch for the two-predictor case. Hint: $\mathrm{Var}(\hat\beta_j) = \sigma^2 \cdot (X^\top X)^{-1}$ {jj} $Var (\hat{β}_{j}) = σ^{2} \cdot (X^{⊤} X)_{jj}^{- 1}$ ; expand $(X^\top X)^{-1}$ for the centred 2×2 case and factor out the no-collinearity baseline. The remaining factor IS $1 / (1 - r^2)$ where $r$ is the predictor-predictor correlation.
Pen-and-paper. Write the FIX MATRIX as a 2×3 table: rows = "what is biased" / "what has wrong SEs"; columns = linearity-fail / heteroscedasticity / autocorrelation / multicollinearity / outliers / non-Normal-errors. For each cell, name the recommended fix. This is the one-screen reference an analyst keeps next to the keyboard.

Pause and reflect: §4.2 has built the diagnostic stack on top of the §4.1 geometry. Each Gauss-Markov assumption has a SPECIFIC RESIDUAL-PLOT SIGNATURE — curvature in residuals-vs-fitted for linearity failure; fan in residuals-vs-fitted plus rising trend in scale-location for heteroscedasticity; snake in residuals-vs-index for autocorrelation; off-line tails in Q-Q for non-Normality; off-cloud points in residuals-vs-fitted plus off-line tail point in Q-Q for outliers; pair-wise sample-correlation $r$ near 1 (VIF > 10) for multicollinearity. Each failure has a CONSEQUENCE for OLS: linearity failure BIASES $\hat\beta$ ; heteroscedasticity / autocorrelation / heavy tails make SEs unreliable but leave $\hat\beta$ unbiased; outliers BIAS $\hat\beta$ ; multicollinearity makes $\hat\beta$ UNSTABLE (large CIs). And each failure has a FIX, all of which preserve the §4.1 projection geometry: enlarge col(X) for linearity, change the inner product for GLS, change the loss for robust regression, change the variance estimator for sandwich SEs, shrink col(X) for collinearity, add a penalty for ridge. The four-panel diagnostic + VIF check is the operational checklist; §4.3 dives deeper into per-observation diagnostics (Cook's distance, DFFITS, DFBETAS) that supplement these panels for influence detection.

What you now know

You can RECITE the five Gauss-Markov assumptions plus the optional Normality assumption, and for each one you can state which residual-plot panel reveals its violation, what is broken in OLS as a result (biased $\hat\beta$ ? wrong SEs? both?), and what fix is appropriate.

You can READ the four canonical residual-plot panels: residuals vs fitted (linearity + mean structure), Q-Q plot of standardised residuals (Normality), residuals vs index (independence), scale-location √|standardised residual| vs fitted (homoscedasticity). You recognise each panel's specific signature and you can read off the Breusch-Pagan statistic (≈ $\chi^2_1$ under H₀ of constant variance), the Durbin-Watson statistic (≈ 2 under no autocorrelation), and the skewness / excess kurtosis of standardised residuals.

You can state the LINEARITY-FAILURE story: curvature in residuals-vs-fitted; β̂ is biased; fix with polynomial / spline terms (§4.6) or transform Y. You can state the HETEROSCEDASTICITY story: fan in residuals-vs-fitted, rising scale-location, BP statistic flags it; β̂ unbiased, SEs wrong; fix with White (1980) sandwich SEs, WLS / GLS (§4.4), or transformation. You can state the AUTOCORRELATION story: snake in residuals-vs-index, DW drops below 2; β̂ unbiased, SEs understated; fix with Newey-West (1987) HAC SEs, explicit ARMA model, or cluster-robust SEs.

You can distinguish OUTLIERS (a single high-influence point) from HEAVY-TAILED ERRORS (a distributional property). Outliers bias $\hat\beta$ ; heavy tails preserve consistency but degrade efficiency and break small-n exact inference. Fixes: robust regression (§4.5) for both; bootstrap CIs (§1.7, §3.2) for the inference side of heavy tails; subject-matter investigation for any single suspect point.

You can state MULTICOLLINEARITY quantitatively: VIF_j = 1 / (1 − R_j²); thresholds VIF ≤ 5 fine, 5 < VIF ≤ 10 mild concern, VIF > 10 severe. For two predictors, both VIFs collapse to 1 / (1 − r²) where r is the sample correlation. Consequence: β̂ is unstable across samples; SEs blow up; CIs swell. Fix: drop a redundant predictor, combine via PCA, or use ridge (Part 9 §9.2).

You can articulate the HONEST CAVEATS: diagnostics are visual first and statistical second; multiple violations can mask each other; Normality of RESIDUALS conditional on X, not of Y marginally; statistical tests for assumption violations have low power in small n. The assumption-diagnostics-suite widget operationalises the four-panel workflow; the vif-multicollinearity widget makes VIF inflation and CI swelling inhabitable.

You can articulate why the §4.1 GEOMETRY survives every fix: each fix is a specific perturbation of the projection picture — enlarge col(X) for linearity, change the inner product for GLS, change the loss for robust regression, change the variance estimator for sandwich SEs, shrink col(X) for multicollinearity, add a penalty for ridge. The projection-first framing of §4.1 is the unifying backbone for the rest of Part 4 and major chunks of Parts 5 and 9.

Where this lands. §4.3 dives deeper into per-observation diagnostics — leverage $h_{ii}$ (already introduced in §4.1), Cook's distance, DFFITS, DFBETAS, partial-regression plots, added-variable plots, and the formal hypothesis tests for individual and joint coefficient significance. §4.4 develops GLS and WLS for known and estimated covariance $\Omega$ . §4.5 develops robust regression (M-estimators, S-estimators, MM-estimators) for heavy-tailed errors and outlier robustness. §4.6 handles interactions, polynomial terms, and basis expansions as concrete ways to enlarge $\mathrm{col}(X)$ . §4.7 covers model selection (AIC, BIC, cross-validation). §4.8 closes Part 4 with the causal-interpretation warnings. Part 9 §9.2 develops ridge and lasso as principled fixes to multicollinearity and over-fitting.

References

White, H. (1980). "A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity." Econometrica 48(4), 817–838. (The original "sandwich" estimator. Replaces $\hat\sigma^2 (X^\top X)^{-1}$ with $(X^\top X)^{-1} \bigl(\sum_i e_i^2 x_i x_i^\top\bigr) (X^\top X)^{-1}$ ; consistent for the true variance of $\hat\beta$ under heteroscedasticity. The starting point of modern robust-SE practice.)
Breusch, T.S., Pagan, A.R. (1979). "A simple test for heteroscedasticity and random coefficient variation." Econometrica 47(5), 1287–1294. (The BP statistic: regress $e_i^2 / \hat\sigma^2$ on a vector of variance-determinants; $\mathrm{BP} = n R^2_{\text{aux}}$ is asymptotically $\chi^2_{k}$ under H₀ of homoscedasticity. The standard formal test.)
Durbin, J., Watson, G.S. (1950). "Testing for serial correlation in least squares regression: I." Biometrika 37(3-4), 409–428. (The DW statistic for first-order autocorrelation. Sequel papers (1951, 1971) extend the distribution theory. Standard tables give exact critical bounds depending on $n$ and the design.)
Newey, W.K., West, K.D. (1987). "A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix." Econometrica 55(3), 703–708. (HAC standard errors. Generalises White (1980) by adding lagged cross-products with a Bartlett (triangular) kernel weight, ensuring positive semi-definite estimates even with long-range autocorrelation.)
Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. (The foundational reference on residual diagnostics, leverage, Cook's distance, DFFITS, DFBETAS, VIFs, and condition indices. The VIF > 10 threshold used in §4.2 and the $h_{ii} > 2p/n$ threshold from §4.1 come from this book.)
Cook, R.D., Weisberg, S. (1982). Residuals and Influence in Regression. London: Chapman & Hall. (Companion volume to Belsley-Kuh-Welsch focused on residual and influence diagnostics. Cook's distance, partial-residual plots, added-variable plots, and the full diagnostic toolkit that §4.3 builds on come from this book.)
Greene, W.H. (2018). Econometric Analysis, 8th ed. New York: Pearson. (Standard graduate-level econometrics reference. Chapter 4 covers Gauss-Markov assumptions and consequences of their failure; Chapter 9 covers heteroscedasticity in depth (including White SEs and FGLS); Chapter 20 covers serial correlation and Newey-West. The applied-econometrics complement to ESL.)
Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 develops linear regression with residual diagnostics integrated into the geometric / projection presentation. Chapter 3.4 introduces ridge regression as the principled response to multicollinearity. The canonical modern reference for statistical learning theory.)
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer. (Chapter 13 covers linear regression in the mathematical-statistics tradition. The compact derivation of residual diagnostics and consequences of assumption failure complements the applied-econometrics treatment in Greene.)

Assumptions and what breaks when they fail

Learning objectives

The four canonical residual-plot panels

Widget 1: the diagnostic suite

Linearity fails: curvature in the residuals

Homoscedasticity fails: heteroscedastic ("fanning") variance

Independence fails: autocorrelation in the errors

Outliers and heavy-tailed errors

Multicollinearity and the variance inflation factor

Widget 2: VIF and CI swelling

Honest caveats: diagnostics are not algorithms

Each fix preserves the §4.1 geometry

Try it

What you now know

References