Assumptions and what breaks when they fail

Part 4 — Linear regression, done seriously

Learning objectives

  • Recite the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) plus the optional sixth (Normality) as a checklist, and state for each one which residual-plot panel reveals the violation, what is broken in the OLS estimator or its standard errors when it fails, and what fix is appropriate
  • Read the FOUR DIAGNOSTIC PANELS — residuals vs fitted (linearity + mean structure), Q-Q plot of standardised residuals (Normality), residuals vs index/order (independence), scale-location √|standardised residual| vs fitted (homoscedasticity) — recognise each panel's specific signature, and state why these four together cover most assumption violations
  • State the LINEARITY-FAILURE signature: curvature (U-shape or arch) in residuals vs fitted. Consequence: β̂ is BIASED — it estimates a population-weighted average slope, not any local slope. Fix: add polynomial / spline terms (§4.6), basis expansions, or transform Y (e.g. log Y when the truth is multiplicative)
  • State the HETEROSCEDASTICITY signature: a fan / funnel in residuals vs fitted and a rising trend in the scale-location plot. The Breusch-Pagan test (Breusch-Pagan 1979) makes this quantitative: regress squared residuals on fitted values; the resulting test statistic ≈ χ²₁ under H₀ of constant variance. Consequence: β̂ remains UNBIASED but SE(β̂) is wrong — textbook formulas understate variance where x is dispersed. Fix: White (1980) heteroscedasticity-consistent ("sandwich") standard errors; weighted least squares / GLS (§4.4); transformation
  • State the AUTOCORRELATION signature: snaking / runs in residuals vs index (when the index is meaningful, e.g. time-ordered observations); residual autocorrelation function with significant lag-1 spike. The Durbin-Watson statistic (Durbin-Watson 1950) DW = Σ(eᵢ − eᵢ₋₁)² / Σeᵢ² is ≈ 2 under H₀ of independence, < 1.5 under positive AR(1) autocorrelation, > 2.5 under negative. Consequence: β̂ unbiased but SE understated; effective sample size shrinks. Fix: Newey-West (1987) HAC standard errors; explicit ARMA error model; first-differencing; cluster-robust SEs
  • State the OUTLIER / HIGH-INFLUENCE-POINT signature: one or two points sit far off the residuals-vs-fitted cloud and curl the Q-Q plot's tail. Consequence: β̂ is BIASED toward the outlier, especially when the outlier has high LEVERAGE (§4.1) so its Cook's distance (§4.3) is large. Fix: robust regression (§4.5) — M-estimators (Huber, Tukey biweight) downweight outliers; or investigate the outlier with subject-matter justification before deciding whether to correct or remove it
  • State the NON-NORMALITY-OF-ERRORS signature: Q-Q plot of standardised residuals shows tails curling away from the diagonal (heavy tails) or asymmetry (skewness). Critically — the assumption is about the conditional distribution of ε given X, NOT the marginal distribution of Y. Consequence: point estimates and SEs are still consistent (CLT); SMALL-n exact t and F intervals are off. Fix: rely on the CLT for large n; bootstrap CIs (§1.7, §3.2); robust regression (§4.5) for efficiency on heavy-tailed errors
  • State the MULTICOLLINEARITY signature: two or more predictors highly correlated, sample correlation matrix has near-singular eigenvalues, design matrix X has nearly linearly dependent columns. Make this quantitative with the VARIANCE INFLATION FACTOR VIFj=1/(1Rj2)\mathrm{VIF}_j = 1 / (1 - R_j^2) where Rj2R_j^2 is the R² of regressing predictor j on the OTHER predictors. State the customary thresholds: VIF ≤ 5 is fine; 5 < VIF ≤ 10 is a mild concern; VIF > 10 is severe (some authors prefer VIF > 4 / VIF > 8 as analogous thresholds). Consequence: β̂_j on correlated predictors is UNSTABLE — small data perturbations swing signs and magnitudes; SEs blow up; CIs widen. Fix: drop a redundant predictor; combine via PCA on the design; ridge regression (Part 9 §9.2)
  • Articulate the HONEST CAVEATS that distinguish good practice from cargo-cult diagnostics: (1) Diagnostics are visual FIRST and statistical SECOND — always plot the residuals before invoking a test; (2) Multiple violations can mask each other — curvature can look like heteroscedasticity, an autocorrelated AR(1) series can look like a trend; (3) The Normality assumption is about ERRORS conditional on X, not the marginal distribution of Y; (4) Statistical tests for assumption violations (Breusch-Pagan, Durbin-Watson, Shapiro-Wilk) have LOW POWER in small n — a non-rejection at n = 30 is much weaker evidence of "assumptions hold" than a non-rejection at n = 1000
  • Memorise the FIX MATRIX as a one-screen reference: linearity → polynomials / splines / log transforms (§4.6); heteroscedasticity → White SEs / WLS-GLS (§4.4); autocorrelation → Newey-West SEs / ARMA model; outliers / heavy tails → robust regression (§4.5) / bootstrap CIs; multicollinearity → drop variables / PCA / ridge (Part 9). Recognise that each fix preserves the GEOMETRIC picture of §4.1 — sometimes by changing the inner product (GLS), sometimes by changing the loss (robust), sometimes by adding a penalty (ridge), sometimes by enlarging col(X) (splines)
  • Read the catalogue of seminal references: White (1980) for heteroscedasticity-robust standard errors; Breusch-Pagan (1979) for the homoscedasticity test; Durbin-Watson (1950) for the autocorrelation test; Newey-West (1987) for HAC standard errors; Belsley-Kuh-Welsch (1980) and Cook-Weisberg (1982) for the residual-diagnostic toolkit; Greene (2018) chs. 4 and 9 for the econometric summary; Hastie-Tibshirani-Friedman (2009) ch. 3 for the modern-stat-learning treatment; Wasserman (2004) ch. 13 for the compact mathematical-statistics version

§4.1 set OLS up as the orthogonal projection of YY onto col(X)\mathrm{col}(X) and stated the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) plus the optional sixth (Normality). Under those five OLS is BLUE; under all six it has exact small-sample t- and F-inference. §4.2 takes each assumption in turn and asks the practical question: what does its failure LOOK LIKE, in what diagnostic panel does the failure show, what is broken in OLS as a result, and what is the appropriate fix?

The framing is deliberately picture-driven. Diagnostics are visual FIRST and statistical SECOND. The four canonical residual-plot panels — residuals vs fitted, Q-Q plot of residuals, residuals vs index, scale-location — each have a specific signature for a specific violation. The two §4.2 widgets make these signatures inhabitable: the assumption-diagnostics-suite lets the reader pick a scenario (clean OLS, curvature, heteroscedasticity, autocorrelation, outlier, non-Normal errors, near-collinearity) and see all four panels populate; the vif-multicollinearity widget animates how the VIF rises and the CIs swell as the predictor correlation marches toward 1.

The §4.2 arc has seven stops, one per assumption + the wrap-up of the fix matrix. For each: the SIGNATURE in the residual plots, the CONSEQUENCE for OLS estimates and standard errors, the appropriate FIX in the regression toolkit. The geometry from §4.1 carries through every section: each fix is a specific perturbation of the projection — GLS changes the inner product, robust regression changes the loss, ridge changes the penalty, polynomial / spline terms enlarge col(X)\mathrm{col}(X).

The four canonical residual-plot panels

The diagnostic stack is built on four plots. Each one is a scatter of a residual-derived quantity against another scalar, and each has a specific tell. Let e=YY^e = Y - \hat Y be the residual vector and let ri=ei/(σ^1hii)r_i = e_i / (\hat\sigma \sqrt{1 - h_{ii}}) be the ii-th internally studentised ("standardised") residual — the residual scaled by its own estimated standard deviation under the model (the 1hii\sqrt{1 - h_{ii}} factor comes from the fact that Var(ei)=σ2(1hii)\mathrm{Var}(e_i) = \sigma^2 (1 - h_{ii}) under the model, where hiih_{ii} is the leverage from §4.1).

  • Residuals vs fitted (eie_i vs Y^i\hat Y_i). Under all assumptions, this is a featureless cloud centred on 0. Curvature (U or arch) signals that the true regression function is nonlinear and the linear fit is misspecified — LINEARITY fails. Fanning (vertical spread grows with Y^\hat Y) signals that the error variance depends on the fitted value — HOMOSCEDASTICITY fails. A small smoother line through the cloud (the red line in the widget) helps the eye separate signal from noise.
  • Q-Q plot of standardised residuals (ordered rir_i vs theoretical N(0, 1) quantiles). Under all six assumptions including Normality, this is a straight line at slope 1 through the origin. Tail-heaviness (both ends curl AWAY from the diagonal) signals heavy-tailed errors. Asymmetry (one end curls up, the other curls down monotonically) signals skewness. Single off-line points at the extremes signal outliers.
  • Residuals vs index/order (eie_i vs ii, when the index is meaningful — typically time-ordering for time-series data, or spatial ordering for spatial data). Under all assumptions this is a featureless cloud. Snaking — long runs of same-sign residuals — signals positive autocorrelation between adjacent errors. Alternation — residuals flip sign every step — signals negative autocorrelation. INDEPENDENCE (no autocorrelation) fails.
  • Scale-location plot (ri\sqrt{|r_i|} vs Y^i\hat Y_i). Same data as the residuals-vs-fitted panel, but the absolute-deviation transform removes sign cancellation, so fanning becomes a monotone TREND in the plot. Under homoscedasticity, this hovers around the constant 2/π0.80\sqrt{2 / \pi} \approx 0.80 — the mean of Z1/2|Z|^{1/2} for ZN(0,1)Z \sim N(0, 1). A rising or falling trend is the cleanest visual diagnostic for heteroscedasticity, more sensitive than the residuals-vs-fitted panel.

These four panels — sometimes assembled into a 2×2 grid by statistical software (R's plot(lm.fit), statsmodels' diagnostic-plot helpers) — are the diagnostic backbone. Cook-Weisberg (1982) and Belsley-Kuh-Welsch (1980) extended the toolkit with leverage-vs-residual scatter and Cook's distance (§4.3). The §4.2 widget renders all four panels live; §4.3 dives deeper into the per-observation diagnostics (leverage, Cook's distance, DFFITS, DFBETAS) that supplement them.

Widget 1: the diagnostic suite

The first widget gives the reader a switchboard of scenarios. Pick one, and the widget generates n=80n = 80 observations from the corresponding data-generating process, fits the (mis)specified OLS model, and renders all four residual panels. The seven scenarios are:

  • Clean OLS — linear truth Y=1+0.75x+εY = 1 + 0.75x + \varepsilon, εN(0,0.92)\varepsilon \sim \mathcal{N}(0, 0.9^2). All four panels look featureless; this is the baseline you compare every violation against.
  • Linearity fails (curvature) — the truth is quadratic, the fit is straight-line. The residuals-vs-fitted panel shows a clear U-shape.
  • Homoscedasticity fails (fanning variance) — error variance grows with xx. The residuals-vs-fitted panel fans; the scale-location plot rises monotonically; Breusch-Pagan flags above the χ1,0.052=3.84\chi^2_{1, 0.05} = 3.84 critical value.
  • Independence fails (AR(1) autocorrelation)εt=0.75εt1+ηt\varepsilon_t = 0.75 \varepsilon_{t-1} + \eta_t. The residuals-vs-index panel shows long same-sign runs; Durbin-Watson drops well below 2.
  • Single high-influence outlier — clean data plus one point at large xx with a large vertical deviation. The Q-Q plot shows a single off-line tail point; residuals-vs-fitted reveals the lone outlier.
  • Normality fails (heavy-tailed errors) — errors drawn from Student-t with df = 3. The Q-Q plot curls at both ends; the residuals-vs-fitted panel looks ordinary; skewness ≈ 0 but excess kurtosis is large and positive.
  • Near-multicollinearity — two highly correlated predictors x1,x2x_1, x_2 but we only fit YY on x1x_1; x2x_2's contribution leaks into the residuals. (The structural collinearity story is the second widget.)

Assumption Diagnostics SuiteInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • On the "clean" scenario, all four panels are featureless; Breusch-Pagan and Durbin-Watson stay in their nominal ranges; skewness and excess kurtosis hover near 0. Click "New sample" repeatedly — every sample looks roughly the same. This is the BASELINE.
  • Switch to "curvature". The residuals-vs-fitted panel shows a clear U-shape: residuals are positive at the ends, negative in the middle. The Q-Q plot may look only mildly disturbed. The lesson: linearity failures show up in the residuals-vs-fitted panel, not the Q-Q plot.
  • Switch to "heteroscedasticity". The residuals-vs-fitted panel fans open with xx; the scale-location plot rises monotonically. The Breusch-Pagan statistic typically lands above χ1,0.0523.84\chi^2_{1, 0.05} \approx 3.84, formally flagging the heteroscedasticity. Note σ̂ in the status panel — the SINGLE-NUMBER residual SE is a misleading summary when variance changes with xx.
  • Switch to "autocorrelated". The residuals-vs-index panel shows a SLOW SNAKE through the index — runs of 5-10 same-sign residuals. Durbin-Watson drops well below 2 (often into the 0.5-1.0 range). The Q-Q plot looks normal-ish because the marginal distribution of an AR(1) is still Normal — autocorrelation hides in the ORDERED structure, which is exactly what the residuals-vs-index panel exposes.
  • Switch to "outlier". One point sits at x13.5,y2.5x \approx 13.5, y \approx -2.5 — far from the trend line. The residuals-vs-fitted panel shows a lone point well off the cloud; the Q-Q plot has a single tail point curling away. The fitted slope β^1\hat\beta_1 is visibly pulled DOWN compared to the clean baseline — the outlier biases the estimate.
  • Switch to "non-normal". The visual scale of residuals matches "clean" because we matched the variance scale, but the Q-Q plot now curls at BOTH ends — heavy tails. Excess kurtosis in the status panel jumps to 1.5-3 range. The first three panels still look fine; only the Q-Q reveals the violation.
  • Switch to "multicollinear". The widget fits YY on x1x_1 alone but the true generative model uses x1,x2x_1, x_2 with ρ(x1,x2)0.98\rho(x_1, x_2) \approx 0.98; the omitted-but-correlated x2x_2 leaks into the residuals as extra variance. The structural collinearity story — VIF explosion, coefficient instability — is the territory of the second widget.
  • Click "New sample" while looking at the autocorrelated scenario. The Durbin-Watson statistic jitters around its near-1 mean but stays well below 2. With heavy-tailed errors, sample-to-sample variation in excess kurtosis is large — small n=80n = 80 + heavy tails = high-variance higher-moment estimates.

Linearity fails: curvature in the residuals

If the true conditional mean is E[YX]=f(X)\mathbb{E}[Y \mid X] = f(X) for some nonlinear ff, but we fit the linear model Y^=Xβ^\hat Y = X\hat\beta, OLS picks the linear combination of columns of XX that best approximates f(X)f(X) in L2L^2. The residuals ei=YiY^ie_i = Y_i - \hat Y_i then carry the SHAPE of ff minus its best linear projection. For a quadratic ff, the leftover is approximately quadratic; residuals plotted against Y^i\hat Y_i show a U or arch.

Consequences:

  • β̂ is BIASED. β^j\hat\beta_j does NOT estimate any local slope of ff; it estimates a population-weighted average slope. If you care about the marginal effect of xjx_j at a specific value, OLS on the misspecified linear model is not the right answer.
  • R² can still look respectable. A quadratic shape projected onto a line still explains a fraction of the variation; R² of 0.6-0.8 is common even with serious curvature. R² alone never diagnoses misspecification.
  • Standard errors are off too. The model is wrong, so the variance formula σ^2(XX)1\hat\sigma^2 (X^\top X)^{-1} does not give the actual sampling variance of β^\hat\beta.

Fixes — three families:

  • Polynomial / spline terms (§4.6). Add x2,x3,x^2, x^3, \ldots as columns of XX; the column space col(X)\mathrm{col}(X) grows to include the nonlinear basis functions. Geometrically: enlarge col(X)\mathrm{col}(X) so the projection captures more of ff. Splines (piecewise polynomial bases) are a more disciplined generalisation.
  • Variable transformation. If the true relationship is multiplicative (Y=axbεY = a \cdot x^b \cdot \varepsilon), take logY=loga+blogx+logε\log Y = \log a + b \log x + \log \varepsilon — the transformed model IS linear. Common transforms: log\log, \sqrt{\cdot}, Box-Cox.
  • Generalised additive model (GAM, beyond §4.6). Fit Y^=jfj(xj)\hat Y = \sum_j f_j(x_j) with smooth per-predictor functions fjf_j estimated nonparametrically. The §9.3 trees / random forests and §8.5 KDE-based nonparametric fits are further along the same axis.

Homoscedasticity fails: heteroscedastic ("fanning") variance

Homoscedasticity says Var(εiX)=σ2\mathrm{Var}(\varepsilon_i \mid X) = \sigma^2 — the same for every observation. When this fails, Var(εiX)=σi2\mathrm{Var}(\varepsilon_i \mid X) = \sigma_i^2 varies across observations, typically as a function of XiX_i. The classic shape is "fanning": variance grows with the fitted value, so residuals form a wedge in the residuals-vs-fitted plot.

Quantitative diagnostic — the Breusch-Pagan test (Breusch-Pagan 1979): regress the squared (scaled) residuals on a vector of variance-determinants (often the fitted values), compute the auxiliary R²_aux, and form BP=nRaux2\mathrm{BP} = n \cdot R^2_{\text{aux}}. Under H₀ of homoscedasticity, BP follows χk2\chi^2_{k} with kk degrees of freedom equal to the number of variance-determinants. The widget uses the simplest variant with a single regressor (the fitted value), so BP ≈ χ12\chi^2_1; the 5% critical value is χ1,0.052=3.84\chi^2_{1, 0.05} = 3.84.

Consequences:

  • β̂ remains UNBIASED. The unbiasedness of OLS does not depend on the variance structure; it depends on E[εX]=0\mathbb{E}[\varepsilon \mid X] = 0. So OLS still estimates β\beta on average correctly. The point estimates are not what's wrong.
  • SE(β̂) is WRONG. The OLS variance formula σ^2(XX)1\hat\sigma^2 (X^\top X)^{-1} assumes constant variance. Under heteroscedasticity it typically UNDERSTATES the variance where xx is dispersed and OVERSTATES it where xx is tight. Confidence intervals and p-values built on this formula are off.
  • β̂ is no longer BLUE. Gauss-Markov assumed homoscedasticity. Under heteroscedasticity, OLS is still LINEAR and UNBIASED, but it is no longer the minimum-variance such estimator. GLS / WLS (§4.4) IS the new BLUE.

Fixes — three families:

  • White (1980) heteroscedasticity-consistent ("sandwich") SEs. Replace σ^2(XX)1\hat\sigma^2 (X^\top X)^{-1} with (XX)1(iei2xixi)(XX)1(X^\top X)^{-1} \bigl(\sum_i e_i^2 , x_i x_i^\top \bigr) (X^\top X)^{-1}. Keeps the OLS point estimate; corrects the variance. Standard in modern econometrics — every regression package supports it (R: vcovHC; Python statsmodels: cov_type="HC0"/"HC3"; Stata: robust). Liberalises to clustered SEs when observations cluster.
  • Weighted least squares (WLS) / GLS (§4.4). If the variance structure is known (or can be estimated from auxiliary regression on ei2e_i^2), weight each observation inversely to its variance. The resulting estimator IS the BLUE under heteroscedasticity (the analogue of Gauss-Markov for GLS).
  • Variable transformation. If the variance grows with the mean (Poisson-like), Y\sqrt{Y} or logY\log Y can stabilise the variance. Often combined with a sensible scientific reframing (e.g., model logIncome\log\text{Income} instead of Income\text{Income}).

Independence fails: autocorrelation in the errors

No-autocorrelation says Cov(εi,εjX)=0\mathrm{Cov}(\varepsilon_i, \varepsilon_j \mid X) = 0 for iji \ne j. Failures come most often when observations have an INTRINSIC ORDER — time-series, spatial, hierarchical. An AR(1) error process εt=ρεt1+ηt\varepsilon_t = \rho \varepsilon_{t-1} + \eta_t with ηtN(0,σ2)\eta_t \sim \mathcal{N}(0, \sigma^2) and ρ<1|\rho| < 1 is the canonical example.

Quantitative diagnostic — Durbin-Watson (Durbin-Watson 1950): DW=t=2n(etet1)2t=1net2\mathrm{DW} = \frac{\sum_{t=2}^n (e_t - e_{t-1})^2}{\sum_{t=1}^n e_t^2}. Under H₀ of no autocorrelation, DW ≈ 2. Under AR(1) with autocorrelation ρ\rho, DW ≈ 2(1 − ρ\rho), so positive autocorrelation (ρ>0\rho > 0) drives DW below 2, negative ρ\rho drives DW above 2. The exact distribution depends on the design XX; standard tables give upper and lower bounds for the DW critical region.

Consequences:

  • β̂ remains UNBIASED. Same story as heteroscedasticity — unbiasedness needs only E[εX]=0\mathbb{E}[\varepsilon \mid X] = 0.
  • SE(β̂) is UNDERSTATED. Autocorrelation means observations carry overlapping information; the effective sample size is much smaller than nn. The OLS variance formula treats the nn observations as independent, so it understates the actual sampling variance — sometimes by factors of 2-5×.
  • OLS is no longer BLUE. GLS with the correct covariance Ω\Omega would be BLUE; OLS is suboptimal.

Fixes — three families:

  • Newey-West (1987) HAC standard errors. Heteroscedasticity-and-Autocorrelation-Consistent: a generalisation of White (1980) sandwich SEs that includes lagged cross-products with kernel-weighted decay (Bartlett kernel by default). Keep the OLS point estimate, replace the SEs. Standard in modern time-series regression.
  • Explicit ARMA error model. Fit Y=Xβ+uY = X\beta + u where uu follows an ARMA(p, q) process; estimated jointly by maximum likelihood. R's arima / auto.arima, Python's statsmodels.tsa.arima.model.ARIMA. Recovers BLUE-like efficiency when the ARMA model is correct.
  • First-differencing or cluster-robust SEs. When the correlation has a known cluster structure (panel data, repeated measures), cluster-robust SEs treat each cluster as one independent unit. First-differencing converts AR(1) errors into a white-noise model.

Outliers and heavy-tailed errors

Two related-but-distinct issues. OUTLIERS are individual observations that are inconsistent with the bulk of the data — far from the residual cloud, sometimes with high leverage (§4.1). HEAVY-TAILED ERRORS are a distributional property — the conditional distribution of εX\varepsilon \mid X has more probability mass in the tails than Normal (e.g., Student-t with df = 3, contaminated Normal). Heavy tails generate MANY moderate outliers; a single outlier might come from a Normal distribution by sheer chance, or from a data-entry error.

Diagnostic — both show in the Q-Q plot, but with different signatures. A single outlier puts ONE point far off the diagonal at one tail. Heavy tails curl BOTH tails away from the diagonal. The §4.3 leverage-vs-residual diagnostic + Cook's distance (Cook-Weisberg 1982) supplements the Q-Q with per-observation impact measures.

Consequences:

  • For a single high-influence outlier: β̂ is BIASED toward the outlier, sometimes severely. The Belsley-Kuh-Welsch (1980) leverage threshold (§4.1) catches outliers with large hiih_{ii}; Cook's distance (§4.3) combines leverage with residual size to flag outliers that actually influence the fit.
  • For heavy-tailed errors: β̂ is still consistent and asymptotically Normal by the CLT, so point estimates and SEs are still valid for large nn. Small-n exact inference (t and F) is off, but the CLT-based asymptotics carry through. Efficiency suffers, however: under heavy tails the OLS estimator has higher variance than alternatives.

Fixes:

  • Robust regression (§4.5). Replace the squared-error loss ei2\sum e_i^2 with a bounded loss ρ(ei)\sum \rho(e_i). Huber, Tukey biweight, and similar M-estimators downweight extreme residuals; the resulting estimator is much less sensitive to outliers and more efficient under heavy tails.
  • Bootstrap confidence intervals (§1.7, §3.2). Resample the residuals or the (X, Y) pairs and refit; the empirical distribution of β^\hat\beta across bootstrap samples gives a CI that does not need Normality. Especially valuable for small nn.
  • Quantile regression (§8.6). Estimate the conditional MEDIAN (or another quantile) of YY given XX instead of the mean. The median is the L1L^1 analogue of the L2L^2 mean and is far more robust to outliers and heavy tails.
  • Subject-matter investigation. An outlier may be a data-entry error (drop after correction), a genuine extreme observation (keep but report robust analysis alongside), or a flag for an unmodelled subgroup (model it explicitly). Statistical software gives diagnostics; only the analyst can decide what the outliers MEAN scientifically.

Multicollinearity and the variance inflation factor

No-perfect-multicollinearity is the §4.1 algebraic condition rank(X)=p\mathrm{rank}(X) = p — without it, XXX^\top X is singular and β^\hat\beta is not defined. The PRACTICAL problem is NEAR-multicollinearity: columns of XX are not literally linearly dependent but are highly correlated, so XXX^\top X is invertible but ill-conditioned. The §4.1 geometric reading: nearly-parallel columns span a nearly-degenerate parallelogram; coefficients on those columns must be large and opposite-signed to fit any specific YY; small perturbations of YY swing the coefficients dramatically.

Quantitative diagnostic — the VARIANCE INFLATION FACTOR VIFj=1/(1Rj2)\mathrm{VIF}_j = 1 / (1 - R_j^2) where Rj2R_j^2 is the R² from regressing predictor jj on all the OTHER predictors. Interpretation: VIF_j is the factor by which Var(β^j)\mathrm{Var}(\hat\beta_j) is INFLATED relative to the no-collinearity case (where Rj2=0R_j^2 = 0 gives VIF = 1). Customary thresholds: VIF ≤ 5 is fine; 5 < VIF ≤ 10 is a mild concern; VIF > 10 is severe. (Some authors use VIF > 4 and VIF > 8 instead; the exact threshold is a convention, not a mathematical theorem.)

For the two-predictor case with sample correlation rr between predictors, both VIFs collapse to 1/(1r2)1 / (1 - r^2). As r1|r| \to 1, both VIFs diverge. The second widget animates exactly this.

Widget 2: VIF and CI swelling

The second widget gives the reader a direct knob on the predictor correlation. Slide ρ\rho from 0 to 0.99 and watch the sample correlation r(x1,x2)r(x_1, x_2) track it, the VIF rise (on a log scale, so VIF = 100 fits on screen), and the 95% confidence intervals for β^1,β^2\hat\beta_1, \hat\beta_2 stretch ever wider — even though the TRUE coefficients stay fixed at β1=β2=0.5\beta_1 = \beta_2 = 0.5. Click "Rerun sample" to draw a new sample at the current ρ\rho; the BETA POINT ESTIMATES jump around much more at high ρ\rho than at low ρ\rho. That sample-to-sample instability IS variance inflation in action.

Vif MulticollinearityInteractive figure — enable JavaScript to interact.

Things to verify in the widget:

  • At ρ=0\rho = 0, the sample r(x1,x2)0r(x_1, x_2) \approx 0, both VIFs are very near 1, and the 95% CIs for β^1,β^2\hat\beta_1, \hat\beta_2 are tight around 0.5. This is the BASELINE.
  • Slide ρ\rho up to 0.50. The sample rr tracks; both VIFs land near 1/(10.52)=1.331 / (1 - 0.5^2) = 1.33. The CIs are slightly wider but still tight. Multicollinearity is not biting yet.
  • Slide ρ\rho to 0.90. Sample r0.90r \approx 0.90; VIFs jump to 1/(10.81)5.261 / (1 - 0.81) \approx 5.26. The CIs are visibly wider. We are in the "mild concern" zone.
  • Slide ρ\rho to 0.95. VIFs ≈ 1/(10.9025)10.261 / (1 - 0.9025) \approx 10.26 — across the "severe" threshold. The CIs are now markedly wider; the bar chart shows the VIF bars crossing the red-line threshold. The status table flags the SEVERE warning.
  • Slide ρ\rho to 0.99. VIFs ≈ 1/(10.9801)501 / (1 - 0.9801) \approx 50. The CIs sprawl across most of the plot — neither coefficient is identifiable with any precision on its own. The point estimates are nearly meaningless individually; only their SUM β^1+β^21.0\hat\beta_1 + \hat\beta_2 \approx 1.0 is stable (because that's what the data actually constrain).
  • At any high ρ\rho, click "Rerun sample" repeatedly. The point estimates β^1,β^2\hat\beta_1, \hat\beta_2 DANCE WILDLY around their true 0.5 — sometimes one is near 0.9, the other near 0.1; sometimes both are near 0.5; sometimes one is NEGATIVE. The CIs always cover 0.5, but their width and the wild point estimates make individual interpretation impossible. At low ρ\rho, the same "Rerun" produces small jitter around 0.5.
  • Read the formula VIFj=1/(1Rj2)\mathrm{VIF}_j = 1 / (1 - R_j^2) off the widget: with only two predictors Rj2=r2R_j^2 = r^2 where rr is the pairwise sample correlation. VIF goes to infinity as r1r \to 1; the "matrix singular" annotation in the widget marks r>0.99r > 0.99 where numerical issues start to bite.

Fixes for multicollinearity:

  • Drop a redundant predictor. If x1x_1 and x2x_2 are measuring nearly the same construct, including both adds noise without adding information. Pick the one with cleaner measurement / better theoretical grounding.
  • Combine via PCA on the design. Project (x1,x2)(x_1, x_2) onto its leading principal component; use that single combined variable. The PCA component captures most of the variation; the orthogonal direction (which is what individual β^\hat\betas try to estimate) is the unstable part.
  • Ridge regression (Part 9 §9.2). Replace β^=(XX)1XY\hat\beta = (X^\top X)^{-1} X^\top Y with β^λ=(XX+λI)1XY\hat\beta_\lambda = (X^\top X + \lambda I)^{-1} X^\top Y for some λ>0\lambda > 0. The ridge penalty pushes the smallest eigenvalue of XX+λIX^\top X + \lambda I safely away from zero, stabilising the inverse. The trade-off is bias (toward zero) for variance reduction; cross-validation picks λ\lambda.
  • Centring / re-parameterisation. When the collinearity arises from polynomial terms (e.g., xx and x2x^2 are correlated when xx doesn't straddle zero), centring xx at its mean before squaring reduces the VIF dramatically. Same algebra; better-conditioned design.

Honest caveats: diagnostics are not algorithms

The four-panel residual workflow is the standard apparatus, but four caveats keep practice honest:

  • Plot first, test second. Statistical tests for assumption violations (Breusch-Pagan, Durbin-Watson, Shapiro-Wilk) have LOW POWER in small nn. A non-rejection at n=30n = 30 is much weaker evidence of "assumptions hold" than a non-rejection at n=1000n = 1000. The residual PLOTS are visual; visual inspection scales much better with nn than do the formal tests in either direction.
  • Multiple violations can mask each other. Curvature can LOOK like heteroscedasticity (the U-shape spreads residuals more in the wings). Autocorrelation can LOOK like a trend in residuals vs index. An outlier can dominate the Q-Q plot in a way that hides the underlying heavy tails. Diagnose one assumption at a time, ideally on a refit after addressing the most obvious violation.
  • "Normality of residuals" ≠ "Normality of Y". The Gauss-Markov+Normality assumption is about the CONDITIONAL distribution of ε\varepsilon given XX, NOT the marginal distribution of YY. If YY is bimodal because the population has two subgroups with different means, and the predictor XX codes for subgroup, then εX\varepsilon \mid X can be unimodal and Normal even though YY is bimodal. Always inspect the RESIDUAL Q-Q, not the raw YY Q-Q.
  • Small samples lie. At n=20n = 20, a residual plot can look perfectly clean even when assumptions are violated, and conversely a plot can look bad by chance even when assumptions hold. The §0.7 CLT and §1.9 asymptotics give the regime where diagnostics become RELIABLE; below n3050n \approx 30-50, visual inspection is necessary but not sufficient. Bootstrap CIs (§3.2) are often the right safety net.

Each fix preserves the §4.1 geometry

One unifying theme — every fix in this section can be read as a specific perturbation of the §4.1 projection picture:

  • Polynomials / splines / transformations ENLARGE col(X)\mathrm{col}(X) by adding columns. The projection lands in the bigger subspace; Y^\hat Y can track more of the true conditional mean.
  • GLS / WLS CHANGES THE INNER PRODUCT. Instead of orthogonal projection in Euclidean 2|\cdot|2, project in Ω1|\cdot|{\Omega^{-1}}. Same geometric structure, different metric.
  • Robust regression (M-estimators, §4.5) CHANGES THE LOSS from 2|\cdot|^2 to a bounded ρ()\rho(\cdot). The closest-point logic still applies, but in a non-Euclidean sense; the β^\hat\beta that minimises the new loss downweights outliers.
  • Ridge (Part 9 §9.2) ADDS A PENALTY. The new objective is YXβ2+λβ2|Y - X\beta|^2 + \lambda |\beta|^2; geometrically, the level sets carve the projection differently and the estimator shrinks toward the origin.
  • White / Newey-West SEs KEEP THE PROJECTION but change the variance estimate. Same β^\hat\beta; replace the sandwich filling so SEs are robust to the actual covariance structure of ε\varepsilon.
  • Drop a predictor / PCA SHRINKS col(X)\mathrm{col}(X). Project YY onto a lower-dimensional subspace; coefficients become identifiable again.

The geometric picture is the SAME backbone; the fixes are different perturbations of it. That is why §4.1 invested so much in the projection-first framing: every later section of Part 4 and large chunks of Parts 5 and 9 reduce to "perturb the projection in a specific way."

Try it

  • In the assumption-diagnostics-suite, start on "clean" and click "New sample" five times. Confirm: all four panels stay featureless; BP stays under 3.84; DW stays between 1.5 and 2.5; excess kurtosis stays under ~0.7 in magnitude. This is the no-violation baseline pattern.
  • Same widget, switch to "curvature". Identify the U-shape in the residuals-vs-fitted panel. Read off the σ̂ value — note that it summarises a SYSTEMATIC misspecification with a SINGLE NUMBER, which is misleading. State the fix: add x2x^2 as a column of XX, which would let the projection capture the curvature.
  • Same widget, switch to "heteroscedastic". Identify the fan in residuals-vs-fitted AND the rising trend in scale-location. Read off the Breusch-Pagan statistic — it should land well above χ1,0.052=3.84\chi^2_{1, 0.05} = 3.84. State two fixes that keep β^\hat\beta unchanged (White SEs) and one that changes β^\hat\beta to a better estimator (WLS / GLS).
  • Same widget, switch to "autocorrelated". Identify the snake in residuals-vs-index — long same-sign runs. Read off the Durbin-Watson statistic; it should drop well below 1.5 (typical: 0.7-1.2). State why the Q-Q plot still looks roughly Normal: AR(1) errors are MARGINALLY Normal — the autocorrelation hides in the JOINT distribution of consecutive errors, which is what residuals-vs-index exposes.
  • Same widget, switch to "outlier". Locate the single off-cloud point in residuals-vs-fitted; locate the single off-line point in the Q-Q tails. Note that DW is unaffected (outliers are not autocorrelation) and BP may or may not be affected. State the fix: robust regression (§4.5) downweights the outlier so β^\hat\beta converges back toward the truth.
  • Same widget, switch to "non-normal". Observe: the first three panels look healthy; only the Q-Q plot reveals the heavy tails (both ends curl AWAY from the diagonal). State the consequence: large-nn inference is fine by the CLT; small-nn exact t/F intervals are off. State two fixes: bootstrap CIs (preserves β^\hat\beta; corrects intervals) and robust regression (changes β^\hat\beta to a higher-efficiency estimator under heavy tails).
  • In the vif-multicollinearity widget, slide ρ\rho from 0 to 0.99 step by step. Verify: VIF tracks the formula 1/(1r2)1 / (1 - r^2) where rr is the sample correlation. At ρ=0.9\rho = 0.9, VIF ≈ 5; at ρ=0.95\rho = 0.95, VIF ≈ 10; at ρ=0.99\rho = 0.99, VIF ≈ 50. The CIs widen monotonically.
  • Same widget, set ρ=0.95\rho = 0.95 and click "Rerun sample" eight times. Record the β^1,β^2\hat\beta_1, \hat\beta_2 values. Note the wild scatter — sometimes one is near 0.9 while the other is near 0.1; sometimes both are 0.5. State why their SUM is much more stable: the data identify β^1+β^2\hat\beta_1 + \hat\beta_2 (the direction of the design's long axis), but not the individual coefficients (the direction the design barely covers).
  • Pen-and-paper. Derive VIFj=1/(1Rj2)\mathrm{VIF}j = 1 / (1 - R_j^2) from scratch for the two-predictor case. Hint: Var(β^j)=σ2(XX)jj1\mathrm{Var}(\hat\beta_j) = \sigma^2 \cdot (X^\top X)^{-1}{jj}; expand (XX)1(X^\top X)^{-1} for the centred 2×2 case and factor out the no-collinearity baseline. The remaining factor IS 1/(1r2)1 / (1 - r^2) where rr is the predictor-predictor correlation.
  • Pen-and-paper. Write the FIX MATRIX as a 2×3 table: rows = "what is biased" / "what has wrong SEs"; columns = linearity-fail / heteroscedasticity / autocorrelation / multicollinearity / outliers / non-Normal-errors. For each cell, name the recommended fix. This is the one-screen reference an analyst keeps next to the keyboard.

Pause and reflect: §4.2 has built the diagnostic stack on top of the §4.1 geometry. Each Gauss-Markov assumption has a SPECIFIC RESIDUAL-PLOT SIGNATURE — curvature in residuals-vs-fitted for linearity failure; fan in residuals-vs-fitted plus rising trend in scale-location for heteroscedasticity; snake in residuals-vs-index for autocorrelation; off-line tails in Q-Q for non-Normality; off-cloud points in residuals-vs-fitted plus off-line tail point in Q-Q for outliers; pair-wise sample-correlation rr near 1 (VIF > 10) for multicollinearity. Each failure has a CONSEQUENCE for OLS: linearity failure BIASES β^\hat\beta; heteroscedasticity / autocorrelation / heavy tails make SEs unreliable but leave β^\hat\beta unbiased; outliers BIAS β^\hat\beta; multicollinearity makes β^\hat\beta UNSTABLE (large CIs). And each failure has a FIX, all of which preserve the §4.1 projection geometry: enlarge col(X) for linearity, change the inner product for GLS, change the loss for robust regression, change the variance estimator for sandwich SEs, shrink col(X) for collinearity, add a penalty for ridge. The four-panel diagnostic + VIF check is the operational checklist; §4.3 dives deeper into per-observation diagnostics (Cook's distance, DFFITS, DFBETAS) that supplement these panels for influence detection.

What you now know

You can RECITE the five Gauss-Markov assumptions plus the optional Normality assumption, and for each one you can state which residual-plot panel reveals its violation, what is broken in OLS as a result (biased β^\hat\beta? wrong SEs? both?), and what fix is appropriate.

You can READ the four canonical residual-plot panels: residuals vs fitted (linearity + mean structure), Q-Q plot of standardised residuals (Normality), residuals vs index (independence), scale-location √|standardised residual| vs fitted (homoscedasticity). You recognise each panel's specific signature and you can read off the Breusch-Pagan statistic (≈ χ12\chi^2_1 under H₀ of constant variance), the Durbin-Watson statistic (≈ 2 under no autocorrelation), and the skewness / excess kurtosis of standardised residuals.

You can state the LINEARITY-FAILURE story: curvature in residuals-vs-fitted; β̂ is biased; fix with polynomial / spline terms (§4.6) or transform Y. You can state the HETEROSCEDASTICITY story: fan in residuals-vs-fitted, rising scale-location, BP statistic flags it; β̂ unbiased, SEs wrong; fix with White (1980) sandwich SEs, WLS / GLS (§4.4), or transformation. You can state the AUTOCORRELATION story: snake in residuals-vs-index, DW drops below 2; β̂ unbiased, SEs understated; fix with Newey-West (1987) HAC SEs, explicit ARMA model, or cluster-robust SEs.

You can distinguish OUTLIERS (a single high-influence point) from HEAVY-TAILED ERRORS (a distributional property). Outliers bias β^\hat\beta; heavy tails preserve consistency but degrade efficiency and break small-n exact inference. Fixes: robust regression (§4.5) for both; bootstrap CIs (§1.7, §3.2) for the inference side of heavy tails; subject-matter investigation for any single suspect point.

You can state MULTICOLLINEARITY quantitatively: VIF_j = 1 / (1 − R_j²); thresholds VIF ≤ 5 fine, 5 < VIF ≤ 10 mild concern, VIF > 10 severe. For two predictors, both VIFs collapse to 1 / (1 − r²) where r is the sample correlation. Consequence: β̂ is unstable across samples; SEs blow up; CIs swell. Fix: drop a redundant predictor, combine via PCA, or use ridge (Part 9 §9.2).

You can articulate the HONEST CAVEATS: diagnostics are visual first and statistical second; multiple violations can mask each other; Normality of RESIDUALS conditional on X, not of Y marginally; statistical tests for assumption violations have low power in small n. The assumption-diagnostics-suite widget operationalises the four-panel workflow; the vif-multicollinearity widget makes VIF inflation and CI swelling inhabitable.

You can articulate why the §4.1 GEOMETRY survives every fix: each fix is a specific perturbation of the projection picture — enlarge col(X) for linearity, change the inner product for GLS, change the loss for robust regression, change the variance estimator for sandwich SEs, shrink col(X) for multicollinearity, add a penalty for ridge. The projection-first framing of §4.1 is the unifying backbone for the rest of Part 4 and major chunks of Parts 5 and 9.

Where this lands. §4.3 dives deeper into per-observation diagnostics — leverage hiih_{ii} (already introduced in §4.1), Cook's distance, DFFITS, DFBETAS, partial-regression plots, added-variable plots, and the formal hypothesis tests for individual and joint coefficient significance. §4.4 develops GLS and WLS for known and estimated covariance Ω\Omega. §4.5 develops robust regression (M-estimators, S-estimators, MM-estimators) for heavy-tailed errors and outlier robustness. §4.6 handles interactions, polynomial terms, and basis expansions as concrete ways to enlarge col(X)\mathrm{col}(X). §4.7 covers model selection (AIC, BIC, cross-validation). §4.8 closes Part 4 with the causal-interpretation warnings. Part 9 §9.2 develops ridge and lasso as principled fixes to multicollinearity and over-fitting.

References

  • White, H. (1980). "A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity." Econometrica 48(4), 817–838. (The original "sandwich" estimator. Replaces σ^2(XX)1\hat\sigma^2 (X^\top X)^{-1} with (XX)1(iei2xixi)(XX)1(X^\top X)^{-1} \bigl(\sum_i e_i^2 x_i x_i^\top\bigr) (X^\top X)^{-1}; consistent for the true variance of β^\hat\beta under heteroscedasticity. The starting point of modern robust-SE practice.)
  • Breusch, T.S., Pagan, A.R. (1979). "A simple test for heteroscedasticity and random coefficient variation." Econometrica 47(5), 1287–1294. (The BP statistic: regress ei2/σ^2e_i^2 / \hat\sigma^2 on a vector of variance-determinants; BP=nRaux2\mathrm{BP} = n R^2_{\text{aux}} is asymptotically χk2\chi^2_{k} under H₀ of homoscedasticity. The standard formal test.)
  • Durbin, J., Watson, G.S. (1950). "Testing for serial correlation in least squares regression: I." Biometrika 37(3-4), 409–428. (The DW statistic for first-order autocorrelation. Sequel papers (1951, 1971) extend the distribution theory. Standard tables give exact critical bounds depending on nn and the design.)
  • Newey, W.K., West, K.D. (1987). "A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix." Econometrica 55(3), 703–708. (HAC standard errors. Generalises White (1980) by adding lagged cross-products with a Bartlett (triangular) kernel weight, ensuring positive semi-definite estimates even with long-range autocorrelation.)
  • Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. (The foundational reference on residual diagnostics, leverage, Cook's distance, DFFITS, DFBETAS, VIFs, and condition indices. The VIF > 10 threshold used in §4.2 and the hii>2p/nh_{ii} > 2p/n threshold from §4.1 come from this book.)
  • Cook, R.D., Weisberg, S. (1982). Residuals and Influence in Regression. London: Chapman & Hall. (Companion volume to Belsley-Kuh-Welsch focused on residual and influence diagnostics. Cook's distance, partial-residual plots, added-variable plots, and the full diagnostic toolkit that §4.3 builds on come from this book.)
  • Greene, W.H. (2018). Econometric Analysis, 8th ed. New York: Pearson. (Standard graduate-level econometrics reference. Chapter 4 covers Gauss-Markov assumptions and consequences of their failure; Chapter 9 covers heteroscedasticity in depth (including White SEs and FGLS); Chapter 20 covers serial correlation and Newey-West. The applied-econometrics complement to ESL.)
  • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 develops linear regression with residual diagnostics integrated into the geometric / projection presentation. Chapter 3.4 introduces ridge regression as the principled response to multicollinearity. The canonical modern reference for statistical learning theory.)
  • Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer. (Chapter 13 covers linear regression in the mathematical-statistics tradition. The compact derivation of residual diagnostics and consequences of assumption failure complements the applied-econometrics treatment in Greene.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.