Diagnostics: residuals, leverage, influence

Part 4 — Linear regression, done seriously

Learning objectives

Distinguish outlier, leverage, and influence as three SEPARATE properties of an observation
Compute and interpret standardized vs externally studentized residuals
Apply Cook's distance, DFFITS, and DFBETAS to identify high-influence points
Read the canonical four-panel regression-diagnostic display
Decide when a high-influence point should be kept, investigated, or replaced by a robust fit

§4.2 catalogued how each Gauss–Markov assumption can fail. §4.3 narrows in on a single but commonly confused diagnostic: identifying observations that materially distort the fit. The trap is that "outlier", "high-leverage", and "high-influence" are three DIFFERENT properties, often conflated by software defaults and by the casual term outlier. This section makes the distinctions sharp and lays out the canonical OLS-diagnostic toolkit.

Three concepts, three positions in the (X, Y) plane

Outlier — an observation whose $Y_i$ value lies far from the line implied by the rest of the data (large residual after fitting).
Leverage — an observation whose $\mathbf{x}$ lies far from the bulk of the design (large $h$ {ii} $h_{ii}$ from the hat matrix in §4.1).
Influence — an observation whose removal materially changes $\hat{\boldsymbol{\beta}}$ . Influence is approximately the product of how outlying Y is and how far X is from the bulk: roughly $\text{influence} \approx \text{residual} \times \text{leverage}$ .

A point can have leverage without being influential (large h_{ii} but on the same line as everyone else — it just anchors the fit). It can be a residual outlier without being influential (large e_i but small h_{ii} — the rest of the data has enough weight to ignore it). The DANGEROUS case is high leverage AND large residual, where a single point bends the entire fit.

Residual types

Raw residual $e_i = Y_i - \hat{Y}$ . Even under correct OLS, $\mathrm{Var}(e_i) = \sigma^2 (1 - h$ {ii}) $Var (e_{i}) = σ^{2} (1 - h_{ii})$ — smaller variance at high-leverage points. Comparing raw residuals across observations is misleading.
Standardized (internally studentized) residual $r_i = \dfrac{e_i}{\hat{\sigma} \sqrt{1 - h_{ii}}}$ . Under correct model, $r_i \sim N(0, 1)$ approximately. Reweights so the variance is comparable across all observations.
Externally studentized residual $t_i = \dfrac{e_i}{\hat{\sigma}$ , where $\hat{\sigma}$ is the residual standard error from the regression EXCLUDING observation i. Under H_0 (the model fits observation i too), $t_i \sim t$ {n-p-1} $t_{i} \sim t_{n - p - 1}$ . Better for outlier hypothesis testing because $\hat{\sigma}_{(-i)}$ is not contaminated by i.

Rules of thumb: $|r_i| > 2$ is moderately large; $|r_i| > 3$ deserves attention; $|t_i| > 2$ tested against $t_{n-p-1}$ gives a (rough, multiplicity-unadjusted) outlier flag.

Influence diagnostics

The three standard influence measures answer slightly different questions.

Cook's distance (Cook 1977):

D_i = \dfrac{e_i^2 \, h_{ii}}{p \, \hat{\sigma}^2 (1 - h_{ii})^2} = \dfrac{r_i^2}{p} \cdot \dfrac{h_{ii}}{1 - h_{ii}}.

A single number summarising how much $\hat{\boldsymbol{\beta}}$ would change if observation i were removed. Standard threshold: $D_i > 4/n$ warrants investigation; $D_i > 1$ is a strong red flag.

DFFITS:

\mathrm{DFFITS}_i = t_i \sqrt{\dfrac{h_{ii}}{1 - h_{ii}}}.

The change in $\hat{Y}_i$ when observation i is removed, standardized by the SE of $\hat{Y}_i$ . Threshold: $|\mathrm{DFFITS}_i| > 2\sqrt{p/n}$ .

DFBETAS_{i,j} for each coefficient j:

\mathrm{DFBETAS}_{i,j} = \dfrac{\hat{\beta}_j - \hat{\beta}_{j,(-i)}}{\mathrm{SE}(\hat{\beta}_{j,(-i)})}.

Per-coefficient influence: which specific coefficients does observation i pull, and by how much (in SE units)? Threshold: $|\mathrm{DFBETAS}_{i,j}| > 2/\sqrt{n}$ .

The four-panel diagnostic display

R's plot(lm) emits the canonical 4-panel display, also reproduced by most modern regression libraries:

Residuals vs fitted — checks linearity and homoscedasticity. Patterns indicate model misspecification; fanning indicates heteroscedasticity.
Q–Q plot of standardized residuals — checks Normality. Straight diagonal = Normal; S-curves = skewness; fat tails = heavy-tailed errors.
Scale–location ( $\sqrt{|r_i|}$ vs fitted) — a sharper view of heteroscedasticity than panel 1. Trend in the running average reveals variance changes with the prediction.
Leverage vs standardized residual with Cook's distance contours overlaid — the influence diagnostic. Points in the corners with $D_i > 0.5$ contours visible are influence cases.

Try it

In the influence-quadrant explorer, drag a point ALONG the regression line at high leverage. Watch Cook's D stay tiny even as leverage climbs. Then pull the same point VERTICALLY off the line: Cook's D explodes. Confirm: influence requires BOTH leverage and residual.
Drag a point with low leverage (near the X centroid) far up the Y axis. Standardized residual grows; leverage stays small; Cook's D stays modest. Confirm: a residual outlier is not the same thing as influential.
In the diagnostic-plots widget, switch to the curvature scenario. Read each of the 4 panels in turn. Identify which panel reveals curvature (panel 1) vs heteroscedasticity (panel 3) vs Normality failures (panel 2) vs leverage (panel 4).
Switch to the heteroscedastic scenario. Note that the Q–Q plot can look acceptable even when scale–location is clearly fanning. Why? (Q–Q measures Normality of the MARGINAL distribution of standardized residuals; it does not check whether the conditional variance is constant.)
Switch to the high-influence-outlier scenario. Compare standardized vs externally studentized residual for the influential point. Which is larger in magnitude, and why? (The externally studentized version is larger because $\hat{\sigma}_{(-i)}$ is smaller without the outlier inflating it.)

If a point has h_{ii} = 0.4 but lies exactly on the OLS line through the rest of the data, what is its Cook's D? What does this say about removing high-leverage points by default?

What you now know

Outlier, leverage, and influence are three orthogonal properties of an observation. The diagnostic 4-panel display reveals each assumption failure with a distinct visual signature. Cook's distance compresses the combined leverage×residual concern into one number, and the standardized vs externally studentized residual distinction matters for inference about whether a point is genuinely outlying versus inflating its own residual variance estimate. §4.4 takes the diagnosis seriously: when residuals show heteroscedasticity, refit by GLS or with sandwich SEs. §4.5 generalises further: if you have outliers you don't want to manually exclude, switch to a robust regression estimator with a bounded influence function.

References

Cook, R.D. (1977). "Detection of influential observations in linear regression." Technometrics 19(1), 15–18. (Introduces Cook's distance.)
Cook, R.D., Weisberg, S. (1982). Residuals and Influence in Regression. New York: Chapman & Hall. (Book-length development of the influence framework.)
Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: Wiley. (Introduces DFFITS and DFBETAS; the canonical source for influence measures and thresholds.)
Hoaglin, D.C., Welsch, R.E. (1978). "The hat matrix in regression and ANOVA." The American Statistician 32(1), 17–22. (The accessible introduction to leverage and the hat matrix.)
Fox, J. (2016). Applied Regression Analysis and Generalized Linear Models, 3rd ed. Thousand Oaks: Sage. (Modern applied treatment of regression diagnostics with R worked examples.)