Robust regression
Learning objectives
- Diagnose when outliers are contaminating an OLS fit (vs when heteroscedasticity is the issue)
- Apply Huber M-regression: bounded influence on residuals, ~95% Gaussian efficiency at k=1.345·σ
- Apply MM-estimators: 50% breakdown point + high Gaussian efficiency
- Choose between robust regression vs OLS + sandwich SEs vs outlier removal
- Recognise that robust regression is NOT 'OLS but ignore outliers' — it has a coherent statistical theory
§1.8 introduced robust and M-estimators for the location problem. §4.5 extends to regression: when OLS coefficients are pulled by outliers (high residual + high leverage from §4.3), robust regression uses a bounded influence function to limit the damage. The natural follow-up to §4.4's response to NON-CONSTANT variance: §4.5 is the response to OUTLIERS.
The problem with OLS under contamination
OLS minimises . The quadratic loss makes residuals at the data's tails count quadratically more than residuals near the centre. A single outlier with residual 4σ contributes 16× more to the sum than a typical residual at 1σ. The fit is pulled toward the outlier to reduce its squared cost — sometimes dramatically.
Specifically: OLS has BREAKDOWN POINT 0. A single outlier at infinity drags the fit arbitrarily far. This is why outliers + high leverage = catastrophic damage.
The robust regression idea
Replace the quadratic loss with one that GROWS LESS THAN QUADRATICALLY for large residuals. The estimator solves
where is the loss function. The influence function controls how much each residual pulls the fit:
- OLS: , . Unbounded — every residual matters arbitrarily.
- L1 / median regression: , . Bounded by ±1. Resistant but inefficient at Gaussian (~64%).
- Huber M-regression: for , for . Quadratic near 0 (efficient), linear in tails (bounded influence).
- Tukey biweight: saturates entirely beyond . Influence drops to ZERO for extreme outliers — they get completely ignored.
Huber M-regression
The workhorse robust regression. Tuning constant (using a robust scale estimate like MAD/0.6745) gives ~95% efficiency at exactly-Gaussian errors and good resistance up to ~10-15% contamination. Solved via Iteratively Reweighted Least Squares (IRLS):
- Initial fit (e.g., LS or LAD).
- Compute residuals and weights .
- Re-fit by WLS with these weights.
- Iterate until convergence (typically 3-10 iterations).
R: MASS::rlm. statsmodels: RLM in statsmodels.api.
MM-estimators: high breakdown + high efficiency
Huber has good Gaussian efficiency but only ~10-15% breakdown. Yohai (1987) proposed MM-estimators:
- First stage: S-estimator gives a 50%-breakdown initial fit (highly resistant but inefficient).
- Second stage: refit with a smooth bounded influence function (Tukey biweight, tuning for 95% Gaussian efficiency).
Result: 50% breakdown AND ~95% Gaussian efficiency. The modern default for robust regression. R: robustbase::lmrob; Python: statsmodels.RLM or scikit-learn.linear_model.HuberRegressor.
Choosing the right tool
- OLS + sandwich SEs (§4.4): heteroscedastic data, no outlier concerns.
- Huber M-regression: light-to-moderate contamination (5-15%), Normal-tailed otherwise.
- MM-regression: severe contamination (up to 50%), or unknown contamination level.
- OLS without sandwich, after outlier removal: BIAS-INDUCING and brittle. Avoid.
Honest caveats
- Robust regression assumes the bulk of the data follows a single model with some outliers. If the data is a MIXTURE of two regimes, robust regression fits the larger regime and ignores the other — which may not be what you want.
- Tuning constants ARE tunable. Defaults (k=1.345 Huber, c=4.685 Tukey) achieve 95% Gaussian efficiency. Lower values are more resistant but less efficient.
- Inference (SEs, CIs) for robust regression is less standardised than OLS. R's
rlmuses asymptotic SEs based on ' (X'X)-1;lmrobuses sandwich-style SEs. Bootstrap CIs are often the safer choice.
Try it
- Start with clean Gaussian data. Confirm OLS and Huber give nearly identical estimates (within ~0.01) and nearly identical SEs (within ~5%).
- Add a single high-leverage outlier (drag a point to a far X position and pull it vertically off the line). Watch the OLS line rotate sharply; Huber barely moves.
- Add 5 outliers to a sample of n = 40. Now OLS is severely biased; Huber is still close to truth; MM is essentially unbiased.
- Push contamination to 30% (12 of 40 points are outliers). Huber starts to break; MM is still fine.
- Push to 55% contamination. Even MM breaks — by definition, the breakdown point is 50% and the "majority" the estimator follows has flipped.
A colleague says "I removed 3 outliers from my dataset by eye, then ran OLS. The conclusion held; I'll report that." What are the TWO methodological objections — and what would you advise them to do instead, using the §4.5 toolkit?
What you now know
Robust regression replaces quadratic loss with a bounded-influence loss, sacrificing some Gaussian efficiency in exchange for resistance to outliers. Huber M-regression is the moderate-contamination default; MM-estimators give 50% breakdown with 95% Gaussian efficiency. Robust regression is the principled alternative to manual outlier removal — coherent theory, reproducible, no p-hacking risk. §4.6 turns to a related question: what if the underlying RELATIONSHIP is nonlinear or has interactions OLS doesn't capture?
References
- Huber, P.J. (1964). "Robust estimation of a location parameter." Annals of Math. Stat. 35(1), 73–101. (The foundational M-estimator paper.)
- Huber, P.J. (1981). Robust Statistics. Wiley. (First edition; the canonical book-length treatment.)
- Yohai, V.J. (1987). "High breakdown-point and high efficiency robust estimates for regression." Annals of Statistics 15(2), 642–656. (The MM-estimator paper.)
- Rousseeuw, P.J., Leroy, A.M. (1987). Robust Regression and Outlier Detection. Wiley. (The applied-regression treatment, including LMS, LTS, S-estimators.)
- Maronna, R.A., Martin, R.D., Yohai, V.J. (2006). Robust Statistics: Theory and Methods. Wiley. (The modern comprehensive reference.)