Diagnostics: residuals, leverage, influence

Part 4 — Linear regression, done seriously

Learning objectives

  • Distinguish outlier, leverage, and influence as three SEPARATE properties of an observation
  • Compute and interpret standardized vs externally studentized residuals
  • Apply Cook's distance, DFFITS, and DFBETAS to identify high-influence points
  • Read the canonical four-panel regression-diagnostic display
  • Decide when a high-influence point should be kept, investigated, or replaced by a robust fit

§4.2 catalogued how each Gauss–Markov assumption can fail. §4.3 narrows in on a single but commonly confused diagnostic: identifying observations that materially distort the fit. The trap is that "outlier", "high-leverage", and "high-influence" are three DIFFERENT properties, often conflated by software defaults and by the casual term outlier. This section makes the distinctions sharp and lays out the canonical OLS-diagnostic toolkit.

Three concepts, three positions in the (X, Y) plane

  • Outlier — an observation whose YiY_i value lies far from the line implied by the rest of the data (large residual after fitting).
  • Leverage — an observation whose xi\mathbf{x}i lies far from the bulk of the design (large hiih{ii} from the hat matrix in §4.1).
  • Influence — an observation whose removal materially changes β^\hat{\boldsymbol{\beta}}. Influence is approximately the product of how outlying Y is and how far X is from the bulk: roughly influenceresidual×leverage\text{influence} \approx \text{residual} \times \text{leverage}.

A point can have leverage without being influential (large h_{ii} but on the same line as everyone else — it just anchors the fit). It can be a residual outlier without being influential (large e_i but small h_{ii} — the rest of the data has enough weight to ignore it). The DANGEROUS case is high leverage AND large residual, where a single point bends the entire fit.

Residual types

  • Raw residual ei=YiY^ie_i = Y_i - \hat{Y}i. Even under correct OLS, Var(ei)=σ2(1hii)\mathrm{Var}(e_i) = \sigma^2 (1 - h{ii}) — smaller variance at high-leverage points. Comparing raw residuals across observations is misleading.
  • Standardized (internally studentized) residual ri=eiσ^1hiir_i = \dfrac{e_i}{\hat{\sigma} \sqrt{1 - h_{ii}}}. Under correct model, riN(0,1)r_i \sim N(0, 1) approximately. Reweights so the variance is comparable across all observations.
  • Externally studentized residual ti=eiσ^(i)1hiit_i = \dfrac{e_i}{\hat{\sigma}{(-i)} \sqrt{1 - h{ii}}}, where σ^(i)\hat{\sigma}{(-i)} is the residual standard error from the regression EXCLUDING observation i. Under H_0 (the model fits observation i too), titnp1t_i \sim t{n-p-1}. Better for outlier hypothesis testing because σ^(i)\hat{\sigma}_{(-i)} is not contaminated by i.

Rules of thumb: ri>2|r_i| > 2 is moderately large; ri>3|r_i| > 3 deserves attention; ti>2|t_i| > 2 tested against tnp1t_{n-p-1} gives a (rough, multiplicity-unadjusted) outlier flag.

Influence diagnostics

The three standard influence measures answer slightly different questions.

Cook's distance (Cook 1977):

Di=ei2hiipσ^2(1hii)2=ri2phii1hii.D_i = \dfrac{e_i^2 \, h_{ii}}{p \, \hat{\sigma}^2 (1 - h_{ii})^2} = \dfrac{r_i^2}{p} \cdot \dfrac{h_{ii}}{1 - h_{ii}}.

A single number summarising how much β^\hat{\boldsymbol{\beta}} would change if observation i were removed. Standard threshold: Di>4/nD_i > 4/n warrants investigation; Di>1D_i > 1 is a strong red flag.

DFFITS:

DFFITSi=tihii1hii.\mathrm{DFFITS}_i = t_i \sqrt{\dfrac{h_{ii}}{1 - h_{ii}}}.

The change in Y^i\hat{Y}_i when observation i is removed, standardized by the SE of Y^i\hat{Y}_i. Threshold: DFFITSi>2p/n|\mathrm{DFFITS}_i| > 2\sqrt{p/n}.

DFBETAS_{i,j} for each coefficient j:

DFBETASi,j=β^jβ^j,(i)SE(β^j,(i)).\mathrm{DFBETAS}_{i,j} = \dfrac{\hat{\beta}_j - \hat{\beta}_{j,(-i)}}{\mathrm{SE}(\hat{\beta}_{j,(-i)})}.

Per-coefficient influence: which specific coefficients does observation i pull, and by how much (in SE units)? Threshold: DFBETASi,j>2/n|\mathrm{DFBETAS}_{i,j}| > 2/\sqrt{n}.

The four-panel diagnostic display

R's plot(lm) emits the canonical 4-panel display, also reproduced by most modern regression libraries:

  • Residuals vs fitted — checks linearity and homoscedasticity. Patterns indicate model misspecification; fanning indicates heteroscedasticity.
  • Q–Q plot of standardized residuals — checks Normality. Straight diagonal = Normal; S-curves = skewness; fat tails = heavy-tailed errors.
  • Scale–location (ri\sqrt{|r_i|} vs fitted) — a sharper view of heteroscedasticity than panel 1. Trend in the running average reveals variance changes with the prediction.
  • Leverage vs standardized residual with Cook's distance contours overlaid — the influence diagnostic. Points in the corners with Di>0.5D_i > 0.5 contours visible are influence cases.

Influence Quadrant ExplorerInteractive figure — enable JavaScript to interact.

Diagnostic PlotsInteractive figure — enable JavaScript to interact.

Try it

  • In the influence-quadrant explorer, drag a point ALONG the regression line at high leverage. Watch Cook's D stay tiny even as leverage climbs. Then pull the same point VERTICALLY off the line: Cook's D explodes. Confirm: influence requires BOTH leverage and residual.
  • Drag a point with low leverage (near the X centroid) far up the Y axis. Standardized residual grows; leverage stays small; Cook's D stays modest. Confirm: a residual outlier is not the same thing as influential.
  • In the diagnostic-plots widget, switch to the curvature scenario. Read each of the 4 panels in turn. Identify which panel reveals curvature (panel 1) vs heteroscedasticity (panel 3) vs Normality failures (panel 2) vs leverage (panel 4).
  • Switch to the heteroscedastic scenario. Note that the Q–Q plot can look acceptable even when scale–location is clearly fanning. Why? (Q–Q measures Normality of the MARGINAL distribution of standardized residuals; it does not check whether the conditional variance is constant.)
  • Switch to the high-influence-outlier scenario. Compare standardized vs externally studentized residual for the influential point. Which is larger in magnitude, and why? (The externally studentized version is larger because σ^(i)\hat{\sigma}_{(-i)} is smaller without the outlier inflating it.)

If a point has h_{ii} = 0.4 but lies exactly on the OLS line through the rest of the data, what is its Cook's D? What does this say about removing high-leverage points by default?

What you now know

Outlier, leverage, and influence are three orthogonal properties of an observation. The diagnostic 4-panel display reveals each assumption failure with a distinct visual signature. Cook's distance compresses the combined leverage×residual concern into one number, and the standardized vs externally studentized residual distinction matters for inference about whether a point is genuinely outlying versus inflating its own residual variance estimate. §4.4 takes the diagnosis seriously: when residuals show heteroscedasticity, refit by GLS or with sandwich SEs. §4.5 generalises further: if you have outliers you don't want to manually exclude, switch to a robust regression estimator with a bounded influence function.

References

  • Cook, R.D. (1977). "Detection of influential observations in linear regression." Technometrics 19(1), 15–18. (Introduces Cook's distance.)
  • Cook, R.D., Weisberg, S. (1982). Residuals and Influence in Regression. New York: Chapman & Hall. (Book-length development of the influence framework.)
  • Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. (Introduces DFFITS and DFBETAS; the canonical source for influence measures and thresholds.)
  • Hoaglin, D.C., Welsch, R.E. (1978). "The hat matrix in regression and ANOVA." The American Statistician 32(1), 17–22. (The accessible introduction to leverage and the hat matrix.)
  • Fox, J. (2016). Applied Regression Analysis and Generalized Linear Models, 3rd ed. Thousand Oaks: Sage. (Modern applied treatment of regression diagnostics with R worked examples.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.