Assumptions and what breaks when they fail
Learning objectives
- Recite the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) plus the optional sixth (Normality) as a checklist, and state for each one which residual-plot panel reveals the violation, what is broken in the OLS estimator or its standard errors when it fails, and what fix is appropriate
- Read the FOUR DIAGNOSTIC PANELS — residuals vs fitted (linearity + mean structure), Q-Q plot of standardised residuals (Normality), residuals vs index/order (independence), scale-location √|standardised residual| vs fitted (homoscedasticity) — recognise each panel's specific signature, and state why these four together cover most assumption violations
- State the LINEARITY-FAILURE signature: curvature (U-shape or arch) in residuals vs fitted. Consequence: β̂ is BIASED — it estimates a population-weighted average slope, not any local slope. Fix: add polynomial / spline terms (§4.6), basis expansions, or transform Y (e.g. log Y when the truth is multiplicative)
- State the HETEROSCEDASTICITY signature: a fan / funnel in residuals vs fitted and a rising trend in the scale-location plot. The Breusch-Pagan test (Breusch-Pagan 1979) makes this quantitative: regress squared residuals on fitted values; the resulting test statistic ≈ χ²₁ under H₀ of constant variance. Consequence: β̂ remains UNBIASED but SE(β̂) is wrong — textbook formulas understate variance where x is dispersed. Fix: White (1980) heteroscedasticity-consistent ("sandwich") standard errors; weighted least squares / GLS (§4.4); transformation
- State the AUTOCORRELATION signature: snaking / runs in residuals vs index (when the index is meaningful, e.g. time-ordered observations); residual autocorrelation function with significant lag-1 spike. The Durbin-Watson statistic (Durbin-Watson 1950) DW = Σ(eᵢ − eᵢ₋₁)² / Σeᵢ² is ≈ 2 under H₀ of independence, < 1.5 under positive AR(1) autocorrelation, > 2.5 under negative. Consequence: β̂ unbiased but SE understated; effective sample size shrinks. Fix: Newey-West (1987) HAC standard errors; explicit ARMA error model; first-differencing; cluster-robust SEs
- State the OUTLIER / HIGH-INFLUENCE-POINT signature: one or two points sit far off the residuals-vs-fitted cloud and curl the Q-Q plot's tail. Consequence: β̂ is BIASED toward the outlier, especially when the outlier has high LEVERAGE (§4.1) so its Cook's distance (§4.3) is large. Fix: robust regression (§4.5) — M-estimators (Huber, Tukey biweight) downweight outliers; or investigate the outlier with subject-matter justification before deciding whether to correct or remove it
- State the NON-NORMALITY-OF-ERRORS signature: Q-Q plot of standardised residuals shows tails curling away from the diagonal (heavy tails) or asymmetry (skewness). Critically — the assumption is about the conditional distribution of ε given X, NOT the marginal distribution of Y. Consequence: point estimates and SEs are still consistent (CLT); SMALL-n exact t and F intervals are off. Fix: rely on the CLT for large n; bootstrap CIs (§1.7, §3.2); robust regression (§4.5) for efficiency on heavy-tailed errors
- State the MULTICOLLINEARITY signature: two or more predictors highly correlated, sample correlation matrix has near-singular eigenvalues, design matrix X has nearly linearly dependent columns. Make this quantitative with the VARIANCE INFLATION FACTOR where is the R² of regressing predictor j on the OTHER predictors. State the customary thresholds: VIF ≤ 5 is fine; 5 < VIF ≤ 10 is a mild concern; VIF > 10 is severe (some authors prefer VIF > 4 / VIF > 8 as analogous thresholds). Consequence: β̂_j on correlated predictors is UNSTABLE — small data perturbations swing signs and magnitudes; SEs blow up; CIs widen. Fix: drop a redundant predictor; combine via PCA on the design; ridge regression (Part 9 §9.2)
- Articulate the HONEST CAVEATS that distinguish good practice from cargo-cult diagnostics: (1) Diagnostics are visual FIRST and statistical SECOND — always plot the residuals before invoking a test; (2) Multiple violations can mask each other — curvature can look like heteroscedasticity, an autocorrelated AR(1) series can look like a trend; (3) The Normality assumption is about ERRORS conditional on X, not the marginal distribution of Y; (4) Statistical tests for assumption violations (Breusch-Pagan, Durbin-Watson, Shapiro-Wilk) have LOW POWER in small n — a non-rejection at n = 30 is much weaker evidence of "assumptions hold" than a non-rejection at n = 1000
- Memorise the FIX MATRIX as a one-screen reference: linearity → polynomials / splines / log transforms (§4.6); heteroscedasticity → White SEs / WLS-GLS (§4.4); autocorrelation → Newey-West SEs / ARMA model; outliers / heavy tails → robust regression (§4.5) / bootstrap CIs; multicollinearity → drop variables / PCA / ridge (Part 9). Recognise that each fix preserves the GEOMETRIC picture of §4.1 — sometimes by changing the inner product (GLS), sometimes by changing the loss (robust), sometimes by adding a penalty (ridge), sometimes by enlarging col(X) (splines)
- Read the catalogue of seminal references: White (1980) for heteroscedasticity-robust standard errors; Breusch-Pagan (1979) for the homoscedasticity test; Durbin-Watson (1950) for the autocorrelation test; Newey-West (1987) for HAC standard errors; Belsley-Kuh-Welsch (1980) and Cook-Weisberg (1982) for the residual-diagnostic toolkit; Greene (2018) chs. 4 and 9 for the econometric summary; Hastie-Tibshirani-Friedman (2009) ch. 3 for the modern-stat-learning treatment; Wasserman (2004) ch. 13 for the compact mathematical-statistics version
§4.1 set OLS up as the orthogonal projection of onto and stated the FIVE GAUSS-MARKOV ASSUMPTIONS (linearity, exogeneity, homoscedasticity, no autocorrelation, no perfect multicollinearity) plus the optional sixth (Normality). Under those five OLS is BLUE; under all six it has exact small-sample t- and F-inference. §4.2 takes each assumption in turn and asks the practical question: what does its failure LOOK LIKE, in what diagnostic panel does the failure show, what is broken in OLS as a result, and what is the appropriate fix?
The framing is deliberately picture-driven. Diagnostics are visual FIRST and statistical SECOND. The four canonical residual-plot panels — residuals vs fitted, Q-Q plot of residuals, residuals vs index, scale-location — each have a specific signature for a specific violation. The two §4.2 widgets make these signatures inhabitable: the assumption-diagnostics-suite lets the reader pick a scenario (clean OLS, curvature, heteroscedasticity, autocorrelation, outlier, non-Normal errors, near-collinearity) and see all four panels populate; the vif-multicollinearity widget animates how the VIF rises and the CIs swell as the predictor correlation marches toward 1.
The §4.2 arc has seven stops, one per assumption + the wrap-up of the fix matrix. For each: the SIGNATURE in the residual plots, the CONSEQUENCE for OLS estimates and standard errors, the appropriate FIX in the regression toolkit. The geometry from §4.1 carries through every section: each fix is a specific perturbation of the projection — GLS changes the inner product, robust regression changes the loss, ridge changes the penalty, polynomial / spline terms enlarge .
The four canonical residual-plot panels
The diagnostic stack is built on four plots. Each one is a scatter of a residual-derived quantity against another scalar, and each has a specific tell. Let be the residual vector and let be the -th internally studentised ("standardised") residual — the residual scaled by its own estimated standard deviation under the model (the factor comes from the fact that under the model, where is the leverage from §4.1).
- Residuals vs fitted ( vs ). Under all assumptions, this is a featureless cloud centred on 0. Curvature (U or arch) signals that the true regression function is nonlinear and the linear fit is misspecified — LINEARITY fails. Fanning (vertical spread grows with ) signals that the error variance depends on the fitted value — HOMOSCEDASTICITY fails. A small smoother line through the cloud (the red line in the widget) helps the eye separate signal from noise.
- Q-Q plot of standardised residuals (ordered vs theoretical N(0, 1) quantiles). Under all six assumptions including Normality, this is a straight line at slope 1 through the origin. Tail-heaviness (both ends curl AWAY from the diagonal) signals heavy-tailed errors. Asymmetry (one end curls up, the other curls down monotonically) signals skewness. Single off-line points at the extremes signal outliers.
- Residuals vs index/order ( vs , when the index is meaningful — typically time-ordering for time-series data, or spatial ordering for spatial data). Under all assumptions this is a featureless cloud. Snaking — long runs of same-sign residuals — signals positive autocorrelation between adjacent errors. Alternation — residuals flip sign every step — signals negative autocorrelation. INDEPENDENCE (no autocorrelation) fails.
- Scale-location plot ( vs ). Same data as the residuals-vs-fitted panel, but the absolute-deviation transform removes sign cancellation, so fanning becomes a monotone TREND in the plot. Under homoscedasticity, this hovers around the constant — the mean of for . A rising or falling trend is the cleanest visual diagnostic for heteroscedasticity, more sensitive than the residuals-vs-fitted panel.
These four panels — sometimes assembled into a 2×2 grid by statistical software (R's plot(lm.fit), statsmodels' diagnostic-plot helpers) — are the diagnostic backbone. Cook-Weisberg (1982) and Belsley-Kuh-Welsch (1980) extended the toolkit with leverage-vs-residual scatter and Cook's distance (§4.3). The §4.2 widget renders all four panels live; §4.3 dives deeper into the per-observation diagnostics (leverage, Cook's distance, DFFITS, DFBETAS) that supplement them.
Widget 1: the diagnostic suite
The first widget gives the reader a switchboard of scenarios. Pick one, and the widget generates observations from the corresponding data-generating process, fits the (mis)specified OLS model, and renders all four residual panels. The seven scenarios are:
- Clean OLS — linear truth , . All four panels look featureless; this is the baseline you compare every violation against.
- Linearity fails (curvature) — the truth is quadratic, the fit is straight-line. The residuals-vs-fitted panel shows a clear U-shape.
- Homoscedasticity fails (fanning variance) — error variance grows with . The residuals-vs-fitted panel fans; the scale-location plot rises monotonically; Breusch-Pagan flags above the critical value.
- Independence fails (AR(1) autocorrelation) — . The residuals-vs-index panel shows long same-sign runs; Durbin-Watson drops well below 2.
- Single high-influence outlier — clean data plus one point at large with a large vertical deviation. The Q-Q plot shows a single off-line tail point; residuals-vs-fitted reveals the lone outlier.
- Normality fails (heavy-tailed errors) — errors drawn from Student-t with df = 3. The Q-Q plot curls at both ends; the residuals-vs-fitted panel looks ordinary; skewness ≈ 0 but excess kurtosis is large and positive.
- Near-multicollinearity — two highly correlated predictors but we only fit on ; 's contribution leaks into the residuals. (The structural collinearity story is the second widget.)
Things to verify in the widget:
- On the "clean" scenario, all four panels are featureless; Breusch-Pagan and Durbin-Watson stay in their nominal ranges; skewness and excess kurtosis hover near 0. Click "New sample" repeatedly — every sample looks roughly the same. This is the BASELINE.
- Switch to "curvature". The residuals-vs-fitted panel shows a clear U-shape: residuals are positive at the ends, negative in the middle. The Q-Q plot may look only mildly disturbed. The lesson: linearity failures show up in the residuals-vs-fitted panel, not the Q-Q plot.
- Switch to "heteroscedasticity". The residuals-vs-fitted panel fans open with ; the scale-location plot rises monotonically. The Breusch-Pagan statistic typically lands above , formally flagging the heteroscedasticity. Note σ̂ in the status panel — the SINGLE-NUMBER residual SE is a misleading summary when variance changes with .
- Switch to "autocorrelated". The residuals-vs-index panel shows a SLOW SNAKE through the index — runs of 5-10 same-sign residuals. Durbin-Watson drops well below 2 (often into the 0.5-1.0 range). The Q-Q plot looks normal-ish because the marginal distribution of an AR(1) is still Normal — autocorrelation hides in the ORDERED structure, which is exactly what the residuals-vs-index panel exposes.
- Switch to "outlier". One point sits at — far from the trend line. The residuals-vs-fitted panel shows a lone point well off the cloud; the Q-Q plot has a single tail point curling away. The fitted slope is visibly pulled DOWN compared to the clean baseline — the outlier biases the estimate.
- Switch to "non-normal". The visual scale of residuals matches "clean" because we matched the variance scale, but the Q-Q plot now curls at BOTH ends — heavy tails. Excess kurtosis in the status panel jumps to 1.5-3 range. The first three panels still look fine; only the Q-Q reveals the violation.
- Switch to "multicollinear". The widget fits on alone but the true generative model uses with ; the omitted-but-correlated leaks into the residuals as extra variance. The structural collinearity story — VIF explosion, coefficient instability — is the territory of the second widget.
- Click "New sample" while looking at the autocorrelated scenario. The Durbin-Watson statistic jitters around its near-1 mean but stays well below 2. With heavy-tailed errors, sample-to-sample variation in excess kurtosis is large — small + heavy tails = high-variance higher-moment estimates.
Linearity fails: curvature in the residuals
If the true conditional mean is for some nonlinear , but we fit the linear model , OLS picks the linear combination of columns of that best approximates in . The residuals then carry the SHAPE of minus its best linear projection. For a quadratic , the leftover is approximately quadratic; residuals plotted against show a U or arch.
Consequences:
- β̂ is BIASED. does NOT estimate any local slope of ; it estimates a population-weighted average slope. If you care about the marginal effect of at a specific value, OLS on the misspecified linear model is not the right answer.
- R² can still look respectable. A quadratic shape projected onto a line still explains a fraction of the variation; R² of 0.6-0.8 is common even with serious curvature. R² alone never diagnoses misspecification.
- Standard errors are off too. The model is wrong, so the variance formula does not give the actual sampling variance of .
Fixes — three families:
- Polynomial / spline terms (§4.6). Add as columns of ; the column space grows to include the nonlinear basis functions. Geometrically: enlarge so the projection captures more of . Splines (piecewise polynomial bases) are a more disciplined generalisation.
- Variable transformation. If the true relationship is multiplicative (), take — the transformed model IS linear. Common transforms: , , Box-Cox.
- Generalised additive model (GAM, beyond §4.6). Fit with smooth per-predictor functions estimated nonparametrically. The §9.3 trees / random forests and §8.5 KDE-based nonparametric fits are further along the same axis.
Homoscedasticity fails: heteroscedastic ("fanning") variance
Homoscedasticity says — the same for every observation. When this fails, varies across observations, typically as a function of . The classic shape is "fanning": variance grows with the fitted value, so residuals form a wedge in the residuals-vs-fitted plot.
Quantitative diagnostic — the Breusch-Pagan test (Breusch-Pagan 1979): regress the squared (scaled) residuals on a vector of variance-determinants (often the fitted values), compute the auxiliary R²_aux, and form . Under H₀ of homoscedasticity, BP follows with degrees of freedom equal to the number of variance-determinants. The widget uses the simplest variant with a single regressor (the fitted value), so BP ≈ ; the 5% critical value is .
Consequences:
- β̂ remains UNBIASED. The unbiasedness of OLS does not depend on the variance structure; it depends on . So OLS still estimates on average correctly. The point estimates are not what's wrong.
- SE(β̂) is WRONG. The OLS variance formula assumes constant variance. Under heteroscedasticity it typically UNDERSTATES the variance where is dispersed and OVERSTATES it where is tight. Confidence intervals and p-values built on this formula are off.
- β̂ is no longer BLUE. Gauss-Markov assumed homoscedasticity. Under heteroscedasticity, OLS is still LINEAR and UNBIASED, but it is no longer the minimum-variance such estimator. GLS / WLS (§4.4) IS the new BLUE.
Fixes — three families:
- White (1980) heteroscedasticity-consistent ("sandwich") SEs. Replace with . Keeps the OLS point estimate; corrects the variance. Standard in modern econometrics — every regression package supports it (R:
vcovHC; Python statsmodels:cov_type="HC0"/"HC3"; Stata:robust). Liberalises to clustered SEs when observations cluster. - Weighted least squares (WLS) / GLS (§4.4). If the variance structure is known (or can be estimated from auxiliary regression on ), weight each observation inversely to its variance. The resulting estimator IS the BLUE under heteroscedasticity (the analogue of Gauss-Markov for GLS).
- Variable transformation. If the variance grows with the mean (Poisson-like), or can stabilise the variance. Often combined with a sensible scientific reframing (e.g., model instead of ).
Independence fails: autocorrelation in the errors
No-autocorrelation says for . Failures come most often when observations have an INTRINSIC ORDER — time-series, spatial, hierarchical. An AR(1) error process with and is the canonical example.
Quantitative diagnostic — Durbin-Watson (Durbin-Watson 1950): . Under H₀ of no autocorrelation, DW ≈ 2. Under AR(1) with autocorrelation , DW ≈ 2(1 − ), so positive autocorrelation () drives DW below 2, negative drives DW above 2. The exact distribution depends on the design ; standard tables give upper and lower bounds for the DW critical region.
Consequences:
- β̂ remains UNBIASED. Same story as heteroscedasticity — unbiasedness needs only .
- SE(β̂) is UNDERSTATED. Autocorrelation means observations carry overlapping information; the effective sample size is much smaller than . The OLS variance formula treats the observations as independent, so it understates the actual sampling variance — sometimes by factors of 2-5×.
- OLS is no longer BLUE. GLS with the correct covariance would be BLUE; OLS is suboptimal.
Fixes — three families:
- Newey-West (1987) HAC standard errors. Heteroscedasticity-and-Autocorrelation-Consistent: a generalisation of White (1980) sandwich SEs that includes lagged cross-products with kernel-weighted decay (Bartlett kernel by default). Keep the OLS point estimate, replace the SEs. Standard in modern time-series regression.
- Explicit ARMA error model. Fit where follows an ARMA(p, q) process; estimated jointly by maximum likelihood. R's
arima/auto.arima, Python'sstatsmodels.tsa.arima.model.ARIMA. Recovers BLUE-like efficiency when the ARMA model is correct. - First-differencing or cluster-robust SEs. When the correlation has a known cluster structure (panel data, repeated measures), cluster-robust SEs treat each cluster as one independent unit. First-differencing converts AR(1) errors into a white-noise model.
Outliers and heavy-tailed errors
Two related-but-distinct issues. OUTLIERS are individual observations that are inconsistent with the bulk of the data — far from the residual cloud, sometimes with high leverage (§4.1). HEAVY-TAILED ERRORS are a distributional property — the conditional distribution of has more probability mass in the tails than Normal (e.g., Student-t with df = 3, contaminated Normal). Heavy tails generate MANY moderate outliers; a single outlier might come from a Normal distribution by sheer chance, or from a data-entry error.
Diagnostic — both show in the Q-Q plot, but with different signatures. A single outlier puts ONE point far off the diagonal at one tail. Heavy tails curl BOTH tails away from the diagonal. The §4.3 leverage-vs-residual diagnostic + Cook's distance (Cook-Weisberg 1982) supplements the Q-Q with per-observation impact measures.
Consequences:
- For a single high-influence outlier: β̂ is BIASED toward the outlier, sometimes severely. The Belsley-Kuh-Welsch (1980) leverage threshold (§4.1) catches outliers with large ; Cook's distance (§4.3) combines leverage with residual size to flag outliers that actually influence the fit.
- For heavy-tailed errors: β̂ is still consistent and asymptotically Normal by the CLT, so point estimates and SEs are still valid for large . Small-n exact inference (t and F) is off, but the CLT-based asymptotics carry through. Efficiency suffers, however: under heavy tails the OLS estimator has higher variance than alternatives.
Fixes:
- Robust regression (§4.5). Replace the squared-error loss with a bounded loss . Huber, Tukey biweight, and similar M-estimators downweight extreme residuals; the resulting estimator is much less sensitive to outliers and more efficient under heavy tails.
- Bootstrap confidence intervals (§1.7, §3.2). Resample the residuals or the (X, Y) pairs and refit; the empirical distribution of across bootstrap samples gives a CI that does not need Normality. Especially valuable for small .
- Quantile regression (§8.6). Estimate the conditional MEDIAN (or another quantile) of given instead of the mean. The median is the analogue of the mean and is far more robust to outliers and heavy tails.
- Subject-matter investigation. An outlier may be a data-entry error (drop after correction), a genuine extreme observation (keep but report robust analysis alongside), or a flag for an unmodelled subgroup (model it explicitly). Statistical software gives diagnostics; only the analyst can decide what the outliers MEAN scientifically.
Multicollinearity and the variance inflation factor
No-perfect-multicollinearity is the §4.1 algebraic condition — without it, is singular and is not defined. The PRACTICAL problem is NEAR-multicollinearity: columns of are not literally linearly dependent but are highly correlated, so is invertible but ill-conditioned. The §4.1 geometric reading: nearly-parallel columns span a nearly-degenerate parallelogram; coefficients on those columns must be large and opposite-signed to fit any specific ; small perturbations of swing the coefficients dramatically.
Quantitative diagnostic — the VARIANCE INFLATION FACTOR where is the R² from regressing predictor on all the OTHER predictors. Interpretation: VIF_j is the factor by which is INFLATED relative to the no-collinearity case (where gives VIF = 1). Customary thresholds: VIF ≤ 5 is fine; 5 < VIF ≤ 10 is a mild concern; VIF > 10 is severe. (Some authors use VIF > 4 and VIF > 8 instead; the exact threshold is a convention, not a mathematical theorem.)
For the two-predictor case with sample correlation between predictors, both VIFs collapse to . As , both VIFs diverge. The second widget animates exactly this.
Widget 2: VIF and CI swelling
The second widget gives the reader a direct knob on the predictor correlation. Slide from 0 to 0.99 and watch the sample correlation track it, the VIF rise (on a log scale, so VIF = 100 fits on screen), and the 95% confidence intervals for stretch ever wider — even though the TRUE coefficients stay fixed at . Click "Rerun sample" to draw a new sample at the current ; the BETA POINT ESTIMATES jump around much more at high than at low . That sample-to-sample instability IS variance inflation in action.
Things to verify in the widget:
- At , the sample , both VIFs are very near 1, and the 95% CIs for are tight around 0.5. This is the BASELINE.
- Slide up to 0.50. The sample tracks; both VIFs land near . The CIs are slightly wider but still tight. Multicollinearity is not biting yet.
- Slide to 0.90. Sample ; VIFs jump to . The CIs are visibly wider. We are in the "mild concern" zone.
- Slide to 0.95. VIFs ≈ — across the "severe" threshold. The CIs are now markedly wider; the bar chart shows the VIF bars crossing the red-line threshold. The status table flags the SEVERE warning.
- Slide to 0.99. VIFs ≈ . The CIs sprawl across most of the plot — neither coefficient is identifiable with any precision on its own. The point estimates are nearly meaningless individually; only their SUM is stable (because that's what the data actually constrain).
- At any high , click "Rerun sample" repeatedly. The point estimates DANCE WILDLY around their true 0.5 — sometimes one is near 0.9, the other near 0.1; sometimes both are near 0.5; sometimes one is NEGATIVE. The CIs always cover 0.5, but their width and the wild point estimates make individual interpretation impossible. At low , the same "Rerun" produces small jitter around 0.5.
- Read the formula off the widget: with only two predictors where is the pairwise sample correlation. VIF goes to infinity as ; the "matrix singular" annotation in the widget marks where numerical issues start to bite.
Fixes for multicollinearity:
- Drop a redundant predictor. If and are measuring nearly the same construct, including both adds noise without adding information. Pick the one with cleaner measurement / better theoretical grounding.
- Combine via PCA on the design. Project onto its leading principal component; use that single combined variable. The PCA component captures most of the variation; the orthogonal direction (which is what individual s try to estimate) is the unstable part.
- Ridge regression (Part 9 §9.2). Replace with for some . The ridge penalty pushes the smallest eigenvalue of safely away from zero, stabilising the inverse. The trade-off is bias (toward zero) for variance reduction; cross-validation picks .
- Centring / re-parameterisation. When the collinearity arises from polynomial terms (e.g., and are correlated when doesn't straddle zero), centring at its mean before squaring reduces the VIF dramatically. Same algebra; better-conditioned design.
Honest caveats: diagnostics are not algorithms
The four-panel residual workflow is the standard apparatus, but four caveats keep practice honest:
- Plot first, test second. Statistical tests for assumption violations (Breusch-Pagan, Durbin-Watson, Shapiro-Wilk) have LOW POWER in small . A non-rejection at is much weaker evidence of "assumptions hold" than a non-rejection at . The residual PLOTS are visual; visual inspection scales much better with than do the formal tests in either direction.
- Multiple violations can mask each other. Curvature can LOOK like heteroscedasticity (the U-shape spreads residuals more in the wings). Autocorrelation can LOOK like a trend in residuals vs index. An outlier can dominate the Q-Q plot in a way that hides the underlying heavy tails. Diagnose one assumption at a time, ideally on a refit after addressing the most obvious violation.
- "Normality of residuals" ≠ "Normality of Y". The Gauss-Markov+Normality assumption is about the CONDITIONAL distribution of given , NOT the marginal distribution of . If is bimodal because the population has two subgroups with different means, and the predictor codes for subgroup, then can be unimodal and Normal even though is bimodal. Always inspect the RESIDUAL Q-Q, not the raw Q-Q.
- Small samples lie. At , a residual plot can look perfectly clean even when assumptions are violated, and conversely a plot can look bad by chance even when assumptions hold. The §0.7 CLT and §1.9 asymptotics give the regime where diagnostics become RELIABLE; below , visual inspection is necessary but not sufficient. Bootstrap CIs (§3.2) are often the right safety net.
Each fix preserves the §4.1 geometry
One unifying theme — every fix in this section can be read as a specific perturbation of the §4.1 projection picture:
- Polynomials / splines / transformations ENLARGE by adding columns. The projection lands in the bigger subspace; can track more of the true conditional mean.
- GLS / WLS CHANGES THE INNER PRODUCT. Instead of orthogonal projection in Euclidean , project in . Same geometric structure, different metric.
- Robust regression (M-estimators, §4.5) CHANGES THE LOSS from to a bounded . The closest-point logic still applies, but in a non-Euclidean sense; the that minimises the new loss downweights outliers.
- Ridge (Part 9 §9.2) ADDS A PENALTY. The new objective is ; geometrically, the level sets carve the projection differently and the estimator shrinks toward the origin.
- White / Newey-West SEs KEEP THE PROJECTION but change the variance estimate. Same ; replace the sandwich filling so SEs are robust to the actual covariance structure of .
- Drop a predictor / PCA SHRINKS . Project onto a lower-dimensional subspace; coefficients become identifiable again.
The geometric picture is the SAME backbone; the fixes are different perturbations of it. That is why §4.1 invested so much in the projection-first framing: every later section of Part 4 and large chunks of Parts 5 and 9 reduce to "perturb the projection in a specific way."
Try it
- In the assumption-diagnostics-suite, start on "clean" and click "New sample" five times. Confirm: all four panels stay featureless; BP stays under 3.84; DW stays between 1.5 and 2.5; excess kurtosis stays under ~0.7 in magnitude. This is the no-violation baseline pattern.
- Same widget, switch to "curvature". Identify the U-shape in the residuals-vs-fitted panel. Read off the σ̂ value — note that it summarises a SYSTEMATIC misspecification with a SINGLE NUMBER, which is misleading. State the fix: add as a column of , which would let the projection capture the curvature.
- Same widget, switch to "heteroscedastic". Identify the fan in residuals-vs-fitted AND the rising trend in scale-location. Read off the Breusch-Pagan statistic — it should land well above . State two fixes that keep unchanged (White SEs) and one that changes to a better estimator (WLS / GLS).
- Same widget, switch to "autocorrelated". Identify the snake in residuals-vs-index — long same-sign runs. Read off the Durbin-Watson statistic; it should drop well below 1.5 (typical: 0.7-1.2). State why the Q-Q plot still looks roughly Normal: AR(1) errors are MARGINALLY Normal — the autocorrelation hides in the JOINT distribution of consecutive errors, which is what residuals-vs-index exposes.
- Same widget, switch to "outlier". Locate the single off-cloud point in residuals-vs-fitted; locate the single off-line point in the Q-Q tails. Note that DW is unaffected (outliers are not autocorrelation) and BP may or may not be affected. State the fix: robust regression (§4.5) downweights the outlier so converges back toward the truth.
- Same widget, switch to "non-normal". Observe: the first three panels look healthy; only the Q-Q plot reveals the heavy tails (both ends curl AWAY from the diagonal). State the consequence: large- inference is fine by the CLT; small- exact t/F intervals are off. State two fixes: bootstrap CIs (preserves ; corrects intervals) and robust regression (changes to a higher-efficiency estimator under heavy tails).
- In the vif-multicollinearity widget, slide from 0 to 0.99 step by step. Verify: VIF tracks the formula where is the sample correlation. At , VIF ≈ 5; at , VIF ≈ 10; at , VIF ≈ 50. The CIs widen monotonically.
- Same widget, set and click "Rerun sample" eight times. Record the values. Note the wild scatter — sometimes one is near 0.9 while the other is near 0.1; sometimes both are 0.5. State why their SUM is much more stable: the data identify (the direction of the design's long axis), but not the individual coefficients (the direction the design barely covers).
- Pen-and-paper. Derive from scratch for the two-predictor case. Hint: ; expand for the centred 2×2 case and factor out the no-collinearity baseline. The remaining factor IS where is the predictor-predictor correlation.
- Pen-and-paper. Write the FIX MATRIX as a 2×3 table: rows = "what is biased" / "what has wrong SEs"; columns = linearity-fail / heteroscedasticity / autocorrelation / multicollinearity / outliers / non-Normal-errors. For each cell, name the recommended fix. This is the one-screen reference an analyst keeps next to the keyboard.
Pause and reflect: §4.2 has built the diagnostic stack on top of the §4.1 geometry. Each Gauss-Markov assumption has a SPECIFIC RESIDUAL-PLOT SIGNATURE — curvature in residuals-vs-fitted for linearity failure; fan in residuals-vs-fitted plus rising trend in scale-location for heteroscedasticity; snake in residuals-vs-index for autocorrelation; off-line tails in Q-Q for non-Normality; off-cloud points in residuals-vs-fitted plus off-line tail point in Q-Q for outliers; pair-wise sample-correlation near 1 (VIF > 10) for multicollinearity. Each failure has a CONSEQUENCE for OLS: linearity failure BIASES ; heteroscedasticity / autocorrelation / heavy tails make SEs unreliable but leave unbiased; outliers BIAS ; multicollinearity makes UNSTABLE (large CIs). And each failure has a FIX, all of which preserve the §4.1 projection geometry: enlarge col(X) for linearity, change the inner product for GLS, change the loss for robust regression, change the variance estimator for sandwich SEs, shrink col(X) for collinearity, add a penalty for ridge. The four-panel diagnostic + VIF check is the operational checklist; §4.3 dives deeper into per-observation diagnostics (Cook's distance, DFFITS, DFBETAS) that supplement these panels for influence detection.
What you now know
You can RECITE the five Gauss-Markov assumptions plus the optional Normality assumption, and for each one you can state which residual-plot panel reveals its violation, what is broken in OLS as a result (biased ? wrong SEs? both?), and what fix is appropriate.
You can READ the four canonical residual-plot panels: residuals vs fitted (linearity + mean structure), Q-Q plot of standardised residuals (Normality), residuals vs index (independence), scale-location √|standardised residual| vs fitted (homoscedasticity). You recognise each panel's specific signature and you can read off the Breusch-Pagan statistic (≈ under H₀ of constant variance), the Durbin-Watson statistic (≈ 2 under no autocorrelation), and the skewness / excess kurtosis of standardised residuals.
You can state the LINEARITY-FAILURE story: curvature in residuals-vs-fitted; β̂ is biased; fix with polynomial / spline terms (§4.6) or transform Y. You can state the HETEROSCEDASTICITY story: fan in residuals-vs-fitted, rising scale-location, BP statistic flags it; β̂ unbiased, SEs wrong; fix with White (1980) sandwich SEs, WLS / GLS (§4.4), or transformation. You can state the AUTOCORRELATION story: snake in residuals-vs-index, DW drops below 2; β̂ unbiased, SEs understated; fix with Newey-West (1987) HAC SEs, explicit ARMA model, or cluster-robust SEs.
You can distinguish OUTLIERS (a single high-influence point) from HEAVY-TAILED ERRORS (a distributional property). Outliers bias ; heavy tails preserve consistency but degrade efficiency and break small-n exact inference. Fixes: robust regression (§4.5) for both; bootstrap CIs (§1.7, §3.2) for the inference side of heavy tails; subject-matter investigation for any single suspect point.
You can state MULTICOLLINEARITY quantitatively: VIF_j = 1 / (1 − R_j²); thresholds VIF ≤ 5 fine, 5 < VIF ≤ 10 mild concern, VIF > 10 severe. For two predictors, both VIFs collapse to 1 / (1 − r²) where r is the sample correlation. Consequence: β̂ is unstable across samples; SEs blow up; CIs swell. Fix: drop a redundant predictor, combine via PCA, or use ridge (Part 9 §9.2).
You can articulate the HONEST CAVEATS: diagnostics are visual first and statistical second; multiple violations can mask each other; Normality of RESIDUALS conditional on X, not of Y marginally; statistical tests for assumption violations have low power in small n. The assumption-diagnostics-suite widget operationalises the four-panel workflow; the vif-multicollinearity widget makes VIF inflation and CI swelling inhabitable.
You can articulate why the §4.1 GEOMETRY survives every fix: each fix is a specific perturbation of the projection picture — enlarge col(X) for linearity, change the inner product for GLS, change the loss for robust regression, change the variance estimator for sandwich SEs, shrink col(X) for multicollinearity, add a penalty for ridge. The projection-first framing of §4.1 is the unifying backbone for the rest of Part 4 and major chunks of Parts 5 and 9.
Where this lands. §4.3 dives deeper into per-observation diagnostics — leverage (already introduced in §4.1), Cook's distance, DFFITS, DFBETAS, partial-regression plots, added-variable plots, and the formal hypothesis tests for individual and joint coefficient significance. §4.4 develops GLS and WLS for known and estimated covariance . §4.5 develops robust regression (M-estimators, S-estimators, MM-estimators) for heavy-tailed errors and outlier robustness. §4.6 handles interactions, polynomial terms, and basis expansions as concrete ways to enlarge . §4.7 covers model selection (AIC, BIC, cross-validation). §4.8 closes Part 4 with the causal-interpretation warnings. Part 9 §9.2 develops ridge and lasso as principled fixes to multicollinearity and over-fitting.
References
- White, H. (1980). "A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity." Econometrica 48(4), 817–838. (The original "sandwich" estimator. Replaces with ; consistent for the true variance of under heteroscedasticity. The starting point of modern robust-SE practice.)
- Breusch, T.S., Pagan, A.R. (1979). "A simple test for heteroscedasticity and random coefficient variation." Econometrica 47(5), 1287–1294. (The BP statistic: regress on a vector of variance-determinants; is asymptotically under H₀ of homoscedasticity. The standard formal test.)
- Durbin, J., Watson, G.S. (1950). "Testing for serial correlation in least squares regression: I." Biometrika 37(3-4), 409–428. (The DW statistic for first-order autocorrelation. Sequel papers (1951, 1971) extend the distribution theory. Standard tables give exact critical bounds depending on and the design.)
- Newey, W.K., West, K.D. (1987). "A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix." Econometrica 55(3), 703–708. (HAC standard errors. Generalises White (1980) by adding lagged cross-products with a Bartlett (triangular) kernel weight, ensuring positive semi-definite estimates even with long-range autocorrelation.)
- Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. (The foundational reference on residual diagnostics, leverage, Cook's distance, DFFITS, DFBETAS, VIFs, and condition indices. The VIF > 10 threshold used in §4.2 and the threshold from §4.1 come from this book.)
- Cook, R.D., Weisberg, S. (1982). Residuals and Influence in Regression. London: Chapman & Hall. (Companion volume to Belsley-Kuh-Welsch focused on residual and influence diagnostics. Cook's distance, partial-residual plots, added-variable plots, and the full diagnostic toolkit that §4.3 builds on come from this book.)
- Greene, W.H. (2018). Econometric Analysis, 8th ed. New York: Pearson. (Standard graduate-level econometrics reference. Chapter 4 covers Gauss-Markov assumptions and consequences of their failure; Chapter 9 covers heteroscedasticity in depth (including White SEs and FGLS); Chapter 20 covers serial correlation and Newey-West. The applied-econometrics complement to ESL.)
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. New York: Springer. (Chapter 3 develops linear regression with residual diagnostics integrated into the geometric / projection presentation. Chapter 3.4 introduces ridge regression as the principled response to multicollinearity. The canonical modern reference for statistical learning theory.)
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer. (Chapter 13 covers linear regression in the mathematical-statistics tradition. The compact derivation of residual diagnostics and consequences of assumption failure complements the applied-econometrics treatment in Greene.)