Robust and M-estimators

Part 1 — Estimation

Learning objectives

  • State the contamination model — the gross-error / ε-contamination mixture F=(1ε)F0+εHF = (1 - \varepsilon) F_0 + \varepsilon H — and explain why the sample mean has breakdown point 0: a single observation with arbitrary value can drag the estimate to ±∞
  • Define the breakdown point of an estimator and rank the classical and robust estimators: mean = 0, sample max = 0, OLS = 0, α-trimmed mean = α, median = 50%, MAD = 50%, LMS = 50%, MM-estimator = 50%
  • Define the influence function IF(x;T,F)=limε0[T((1ε)F+εδx)T(F)]/ε\operatorname{IF}(x; T, F) = \lim_{\varepsilon \to 0} [T((1-\varepsilon) F + \varepsilon \delta_x) - T(F)] / \varepsilon and read 'bounded IF ⇔ robust estimator' as the working criterion
  • Define the M-estimator family by θ^=argminθiρ((Xiθ)/σ)\hat\theta = \arg\min_\theta \sum_i \rho((X_i - \theta) / \sigma) and recognise that ψ(u) = ρ'(u) IS (proportional to) the influence function at standardised residual u
  • Write down the Huber loss ρH(u)=u2/2\rho_H(u) = u^2/2 for uk|u| \le k and ρH(u)=kuk2/2\rho_H(u) = k|u| - k^2/2 otherwise, derive ψH(u)=max(k,min(k,u))\psi_H(u) = \max(-k, \min(k, u)) (a clipped identity), and identify the canonical tuning k = 1.345 for 95% Gaussian efficiency
  • Recognise the absolute-loss case ρ(u) = |u| as the median and the squared-loss case ρ(u) = u²/2 as the mean — both M-estimators with the corresponding influence functions, sitting at the endpoints of the robustness / efficiency trade-off
  • Recognise the Tukey biweight ρT(u)=(c2/6)[1(1(u/c)2)3]\rho_T(u) = (c^2/6)\,[1 - (1 - (u/c)^2)^3] for uc|u| \le c as a redescending loss (ψ(u) → 0 as |u| → c) with canonical c = 4.685 for 95% Gaussian efficiency; understand the multiple-root caveat
  • State the IRLS (Iteratively Reweighted Least Squares) algorithm: at each iteration compute residuals ri=yixiβ^r_i = y_i - \mathbf{x}_i^\top \hat{\boldsymbol\beta}, scale σ^=MAD(r)1.4826\hat\sigma = \operatorname{MAD}(r) \cdot 1.4826, weights wi=ψ(ri/σ^)/(ri/σ^)w_i = \psi(r_i / \hat\sigma) / (r_i / \hat\sigma), and re-solve the weighted normal equations
  • Connect to OLS: OLS solves the M-estimator score equation with ψ(u)=u\psi(u) = u (squared loss). Because ψ is unbounded, a single high-leverage outlier rotates the OLS hyperplane arbitrarily — OLS breakdown is 0 in the regression context
  • Recognise the high-breakdown regression estimators — LMS (Rousseeuw 1984, 50% breakdown but only n^(-1/3) consistent), LTS (Rousseeuw 1984, 50% breakdown, n^(-1/2)), S-estimators (Rousseeuw & Yohai 1984), and MM-estimators (Yohai 1987, 50% breakdown AND high Gaussian efficiency) — as the modern defaults when contamination cannot be assumed small
  • Articulate three honest caveats: (i) robust ≠ outlier removal (don't pre-filter and then run OLS on the residual; the resulting estimator is no longer robust), (ii) the 'right' tuning constant depends on the contamination model you anticipate, (iii) above 50% contamination NO equivariant estimator can recover the clean signal — robust statistics have a fundamental ceiling

§1.7 ended with an honest disclaimer: "the bootstrap is plug-in, not magic — it does not fix a fundamentally broken estimator." §1.8 is the response. We will take the broken estimator and replace it. The classical sample mean and the classical OLS regression both have breakdown point 00 — one contaminated observation suffices to drag the answer to ±∞. The bootstrap on a broken estimator reports the broken estimator's sampling distribution honestly; what we want is a NEW estimator with positive breakdown and bounded influence. M-estimators, introduced by Huber (1964), are the tunable family that delivers exactly that.

The §1.8 arc is: the contamination model (ε-contamination, gross-error); the breakdown point as a quantitative ranking; the influence function as the local "leverage per perturbation" derivative; the M-estimator family as the unifying framework that includes the mean (squared loss), the median (absolute loss), Huber (clipped), Tukey biweight (redescending), and Hampel (three-part redescending); the IRLS algorithm that solves them; the move from location to regression (OLS as the unbounded-influence baseline, Huber M-regression as the immediate bounded fix, LMS / S / MM as high-breakdown alternatives); and three honest caveats that delimit what robust statistics can and cannot do. The two widgets make ρ, ψ, and the regression fits tactile.

The contamination model: where outliers come from

Robust statistics starts by formalising "what could be wrong with the data". The standard framework is Huber's (1964) gross-error model: the observed sample is drawn from a mixture

F  =  (1ε)F0  +  εH,F \;=\; (1 - \varepsilon)\, F_0 \;+\; \varepsilon\, H,

where F0F_0 is the assumed (typically Gaussian) target distribution, HH is an arbitrary contaminating distribution, and ε[0,0.5]\varepsilon \in [0, 0.5] is the contamination fraction. We do not know which observations come from F0F_0 and which from HH. We only know that a fraction ε\varepsilon are corrupted, and we want estimators that perform well — that is, return answers close to the target functional of F0F_0 — uniformly over the choice of HH.

This framework formalises a sentiment that is otherwise vague. "The data has outliers" becomes "the data comes from a contamination mixture with ε > 0". "How robust is this estimator?" becomes "how does the estimator behave as we vary ε and H within stated bounds?". The framework is parsimonious — one number ε\varepsilon summarises the contamination level — and it captures most of what working scientists mean by outliers without committing to a specific outlier-generating mechanism.

The breakdown point: how much contamination is too much?

The breakdown point of an estimator TT at a sample of size nn is the smallest fraction m/nm/n of observations that can be changed (to arbitrary values) such that the resulting T(contaminated sample)T(\text{contaminated sample}) can be made arbitrarily different from T(original sample)T(\text{original sample}). Lower is worse; the worst possible is 0 (a single observation suffices); the best possible is (n+1)/2/n0.5\lfloor (n+1)/2 \rfloor / n \to 0.5 (an equivariant estimator cannot survive majority contamination).

The sample mean has breakdown point 0. The proof is one line: a single contamination at value voutv_{\text{out}} moves the mean by (voutXˉ)/n(v_{\text{out}} - \bar{X})/n; send voutv_{\text{out}} \to \infty and Xˉ\bar{X} follows. The same holds for the sample standard deviation (quadratically), the sample maximum (trivially), and, crucially, OLS regression: a single high-leverage outlier can rotate the OLS hyperplane to any orientation.

The classical robust counterparts have positive breakdown:

  • Median: breakdown 50% — the maximum possible.
  • MAD (median absolute deviation from the median): breakdown 50%.
  • α-trimmed mean: breakdown α. The 10% trimmed mean has breakdown 10%.
  • α-Winsorised mean: breakdown α. Same as trimmed but observations are clipped instead of dropped.

For regression, the modern high-breakdown estimators — LMS (Rousseeuw 1984), S-estimators (Rousseeuw & Yohai 1984), MM-estimators (Yohai 1987) — all reach the 50% ceiling. M-regression with a bounded ψ has positive but lower breakdown in the regression context, because high-leverage points can still defeat a bounded-residual influence function. We return to this point below.

The influence function: local leverage per perturbation

The breakdown point is a coarse summary. The influence function (Hampel 1974) is the local complement: it measures how much a single observation at value xx moves the estimator, per unit infinitesimal contamination at that point. Formally, for a statistical functional T(F)T(F) at the target distribution FF,

IF(x;T,F)  =  limε0+T((1ε)F+εδx)T(F)ε,\operatorname{IF}(x;\, T,\, F) \;=\; \lim_{\varepsilon \to 0^+} \frac{T\bigl((1 - \varepsilon)\, F + \varepsilon\, \delta_x\bigr) - T(F)}{\varepsilon},

where δx\delta_x is a point mass at xx. The influence function is the Gâteaux derivative of TT in the direction of point-mass contamination. Read as a plot — IF on the yy-axis, xx on the xx-axis — it answers the question: "if I add one more observation at value xx, how much does the estimator move?".

For the sample mean, IF(x;mean,F)=xμF\operatorname{IF}(x;, \text{mean},, F) = x - \mu_F. Linear, unbounded. Large xx produces large IF — exactly why the mean has breakdown 0.

For the sample median, IF(x;median,F)=sign(xmF)/(2fF(mF))\operatorname{IF}(x;, \text{median},, F) = \operatorname{sign}(x - m_F) / (2 f_F(m_F)) where fFf_F is the density at the median. Bounded — any single observation contributes at most 1/(2fF(mF))1/(2 f_F(m_F)) in absolute value. Exactly why the median has breakdown 50%.

The working criterion that emerges: an estimator is robust if and only if its influence function is bounded. Bounded IF ⇔ no single observation can have arbitrary leverage ⇔ the estimator survives some positive fraction of contamination. The unbounded-IF estimators (mean, OLS, sample variance) are all breakdown-0; the bounded-IF estimators (median, MAD, Huber M-estimator, Tukey biweight, LMS) all have positive breakdown.

M-estimators: the unifying family

Huber (1964) defined the M-estimator of location as the minimiser

θ^  =  argminθi=1nρ ⁣(Xiθσ)\hat\theta \;=\; \arg\min_{\theta} \sum_{i=1}^{n} \rho\!\left(\frac{X_i - \theta}{\sigma}\right)

for a chosen loss function ρ:RR0\rho: \mathbb{R} \to \mathbb{R}_{\ge 0} and a chosen scale σ>0\sigma > 0 (typically estimated from the same data by a robust method like MAD1.4826\operatorname{MAD} \cdot 1.4826). Setting /θ=0\partial / \partial \theta = 0 gives the score equation

i=1nψ ⁣(Xiθ^σ)  =  0,ψ(u)  =  ρ(u).\sum_{i=1}^{n} \psi\!\left(\frac{X_i - \hat\theta}{\sigma}\right) \;=\; 0, \qquad \psi(u) \;=\; \rho'(u).

The function ψ\psi is, up to scaling, the influence function: a point at standardised residual uu contributes ψ(u)\psi(u) to the score equation. Bounded ψ ⇔ bounded influence ⇔ robust estimator. That is the whole framework in one line.

Three choices of ρ\rho recover the three canonical estimators:

  • Squared loss ρ(u)=u2/2\rho(u) = u^2/2. Then ψ(u)=u\psi(u) = u — linear, unbounded. The minimiser is the sample mean. Gaussian efficiency 100%, breakdown 0.
  • Absolute loss ρ(u)=u\rho(u) = |u|. Then ψ(u)=sign(u)\psi(u) = \operatorname{sign}(u) — bounded, discontinuous at 0. The minimiser is the sample median. Gaussian efficiency 2/π63.7%2/\pi \approx 63.7%, breakdown 50%.
  • Huber loss ρH(u)=u2/2\rho_H(u) = u^2/2 for uk|u| \le k and ρH(u)=kuk2/2\rho_H(u) = k|u| - k^2/2 for u>k|u| > k. Then ψH(u)=u\psi_H(u) = u on uk|u| \le k and ψH(u)=ksign(u)\psi_H(u) = k,\operatorname{sign}(u) outside. Mean-like on the bulk, median-like on the tails. The canonical default k=1.345k = 1.345 delivers 95% asymptotic Gaussian efficiency while keeping ψ bounded at ±k.

Huber's loss is the minimax-optimal choice in the gross-error model under the constraint that ψ\psi be bounded (Huber 1964 §3). It is the textbook starting default. A second standard choice is the redescending family:

  • Tukey biweight ρT(u)=(c2/6)[1(1(u/c)2)3]\rho_T(u) = (c^2/6),[1 - (1 - (u/c)^2)^3] for uc|u| \le c and ρT(u)=c2/6\rho_T(u) = c^2/6 for u>c|u| > c. The influence function ψT(u)=u(1(u/c)2)2\psi_T(u) = u (1 - (u/c)^2)^2 on uc|u| \le c is REDESCENDING — it returns to zero as uc|u| \to c. Outliers beyond uc|u| \ge c are IGNORED entirely, not merely clipped. Canonical c=4.685c = 4.685 for 95% Gaussian efficiency.
  • Hampel three-part redescending (Hampel 1974, with knots (a,b,r)(a, b, r)): ψ(u)=u\psi(u) = u on ua|u| \le a, ψ(u)=asign(u)\psi(u) = a,\operatorname{sign}(u) on a<uba < |u| \le b, linearly returning to zero on b<urb < |u| \le r, and zero beyond. Three knots, finer control over the rejection region.

The redescending estimators reject the most extreme outliers entirely; their score equations have multiple roots and IRLS must be initialised from a robust starting point (typically the median or Huber). The first widget makes this visible.

M Estimator LossInteractive figure — enable JavaScript to interact.

Things to verify:

  • Squared: ρ is a parabola, ψ is the identity. Unbounded — every observation contributes proportional to its distance. This is the mean. Gaussian efficiency 100%, breakdown 0.
  • Absolute: ρ is the absolute-value V, ψ is the sign step. Bounded but discontinuous at 0. This is the median. Gaussian efficiency 2/π ≈ 64%, breakdown 50%.
  • Huber at k = 1.345: ρ is the smooth quadratic-then-linear blend; ψ is the clipped identity, flat at ±k for |u| > k. Bounded and continuous. Reports ~95% Gaussian efficiency.
  • Huber at k = 0.5: ρ is almost linear; ψ clips very early. The estimator behaves nearly like the median; efficiency drops toward 64%.
  • Tukey at c = 4.685: ρ saturates at c2/6c^2/6; ψ rises, peaks near u=c/5u = c/\sqrt{5}, and returns to zero by u=c|u| = c. Reports ~95% efficiency; observations beyond ±c have ZERO influence.
  • Hampel preset: ψ is identity → constant plateau → linear ramp-down → zero. Three knots give three explicit zones (full influence, capped influence, rejection ramp, full rejection). The shape is piecewise-linear so the numerical efficiency reads slightly lower than the Tukey smooth, but the qualitative behaviour is comparable.

How M-estimators are computed: IRLS

The score equation iψ((Xiθ^)/σ)=0\sum_i \psi((X_i - \hat\theta)/\sigma) = 0 does not have a closed-form solution for any of the non-trivial losses. The standard algorithm is Iteratively Reweighted Least Squares (IRLS), which rewrites the score equation as a weighted sum: define wi=ψ(ui)/uiw_i = \psi(u_i) / u_i (with the limit wi=ψ(0)=1w_i = \psi'(0) = 1 at ui=0u_i = 0), and observe that

i=1nwi(Xiθ^)  =  0    θ^  =  iwiXiiwi.\sum_{i=1}^{n} w_i \, (X_i - \hat\theta) \;=\; 0 \;\Longleftrightarrow\; \hat\theta \;=\; \frac{\sum_i w_i X_i}{\sum_i w_i}.

That is a weighted mean. The weights depend on θ^\hat\theta through ui=(Xiθ^)/σu_i = (X_i - \hat\theta) / \sigma, but if we hold the weights fixed at the current iterate we get a one-step update. The algorithm: initialise θ^(0)\hat\theta^{(0)} (typically the median); compute residuals ri=Xiθ^(0)r_i = X_i - \hat\theta^{(0)}; estimate scale σ^=MAD(r)1.4826\hat\sigma = \operatorname{MAD}(r) \cdot 1.4826; compute weights wi=ψ(ri/σ^)/(ri/σ^)w_i = \psi(r_i / \hat\sigma) / (r_i / \hat\sigma); update θ^(1)=(iwiXi)/(iwi)\hat\theta^{(1)} = (\sum_i w_i X_i) / (\sum_i w_i); iterate. For Huber the algorithm converges to the global minimum in a few iterations from any sensible start. For redescending losses (Tukey, Hampel) the starting point matters because the score equation has multiple roots; warm-starting from Huber is standard practice.

The weights wiw_i are the working summary of the algorithm: clean observations get wi1w_i \approx 1 (full contribution); marginal observations get wi(0,1)w_i \in (0, 1) (downweighted); extreme outliers under a redescending loss get wi=0w_i = 0 (rejected). Reading the final weights diagnostically tells you which observations the M-estimator treated as outliers — without any pre-filtering.

From location to regression: OLS breakdown and Huber M-regression

The same machinery extends to linear regression. OLS minimises i(yixiβ)2\sum_i (y_i - \mathbf{x}_i^\top \boldsymbol\beta)^2 — squared loss on the residuals. The M-regression estimator minimises

β^  =  argminβi=1nρ ⁣(yixiβσ),\hat{\boldsymbol\beta} \;=\; \arg\min_{\boldsymbol\beta} \sum_{i=1}^{n} \rho\!\left(\frac{y_i - \mathbf{x}_i^\top \boldsymbol\beta}{\sigma}\right),

solved by IRLS with weights wi=ψ(ri/σ^)/(ri/σ^)w_i = \psi(r_i / \hat\sigma) / (r_i / \hat\sigma) and the weighted normal equations (XWX)β^=XWy(\mathbf{X}^\top \mathbf{W} \mathbf{X}), \hat{\boldsymbol\beta} = \mathbf{X}^\top \mathbf{W} \mathbf{y}. Each iteration is one standard weighted least-squares solve. From an OLS warm start, Huber M-regression typically converges in 5-20 iterations.

OLS breakdown in regression is 0. The proof is the regression analogue of the location proof: a single point (x,y)(x^, y^) with arbitrarily large yy^ (or with xx^ pulled far from the bulk and yy^* off the trend line) can rotate the OLS hyperplane to any orientation. The "high-leverage outlier" — a point that is extreme in xx AND off the trend in yy — is the canonical adversary. The second widget makes this concrete: a 30-point clean linear trend, the reader clicks the plot to inject outliers, and the OLS line visibly rotates while Huber M-regression stays close to the clean slope.

Robust Regression Vs OlsInteractive figure — enable JavaScript to interact.

Things to verify:

  • Clean data: OLS and Huber agree closely. Gauss-Markov gives OLS minimum variance among linear unbiased estimators on clean data; Huber pays a small efficiency tax (~5% with k = 1.345) but the slopes are visibly the same.
  • Add one vertical outlier (mode = vertical-only): click in the plot near the middle x range, far above the trend. OLS rotates noticeably; Huber barely moves. The single point is downweighted by Huber's IRLS once its residual exceeds kMADk \cdot \operatorname{MAD}.
  • Add one high-leverage outlier (mode = free 2-D, click far right and far below): OLS rotates dramatically. Huber resists but the leverage shows — bounded-residual ψ does not fully protect against extreme xx-leverage. Toggle on LMS: it stays close to truth because its 50% breakdown applies to ANY contamination, including high leverage.
  • Multiple outliers (5-10 of n = 30, ~20-30% contamination): OLS is wildly wrong. Huber recovers a sensible slope on vertical-only contamination; with high-leverage clusters, Huber bends and LMS pulls ahead. The Tukey biweight (toggle on) goes further still — redescending ψ rejects the cluster entirely.
  • Push contamination past 50%: click many outliers until they outnumber the clean cluster. Even LMS now describes the contamination rather than the clean trend. The 50% breakdown ceiling is fundamental — no equivariant estimator can recover a clean signal that is the minority of the data.

High-breakdown regression: LMS, S, and MM

Huber M-regression has bounded ψ on the residual scale, which gives it positive breakdown — but the breakdown in the regression context is below 50%, because high-leverage points can pull the fit while keeping their residuals small. The high-breakdown regression literature (Rousseeuw 1984, Yohai 1987) developed estimators that reach the 50% ceiling.

  • LMS (Least Median of Squares, Rousseeuw 1984): β^LMS=argminmedianiri2\hat{\boldsymbol\beta}_{\text{LMS}} = \arg\min \operatorname{median}_i, r_i^2. Breakdown 50%; consistency rate only n1/3n^{-1/3}. Used in practice as an initialiser for higher-efficiency methods, not as a final estimator.
  • LTS (Least Trimmed Squares, Rousseeuw 1984): minimise the sum of the hh smallest squared residuals, where h0.5nh \approx 0.5 n. Breakdown 50%; consistency rate n1/2n^{-1/2}. The standard high-breakdown estimator.
  • S-estimators (Rousseeuw & Yohai 1984): minimise a robust scale estimate of the residuals. Breakdown 50%; better than LTS at clean Gaussian data but still well below OLS efficiency.
  • MM-estimators (Yohai 1987): a three-stage procedure — S-estimator for high-breakdown initial scale, then redescending M-regression with that scale held fixed. Breakdown 50% AND Gaussian efficiency tunable up to 95%. The modern default when you need both robustness and efficiency.

Part 4 §4.5 develops the regression-side detail with practical examples. §1.8's job is the location-and-pedagogical scaffolding: see why OLS breaks, see Huber M-regression bound the influence, see LMS push the breakdown to 50%, and recognise MM-estimators as the modern compromise.

Three honest caveats

Robust statistics is genuinely useful. It is not magic, and it has well-defined limits. Three caveats every practitioner should internalise:

1. Robust statistics is NOT outlier removal. Identify-and-remove is a different workflow: spot suspect observations (Tukey's 1.5 × IQR fences, Mahalanobis distance, leverage diagnostics), delete them, and run OLS on the cleaned data. The resulting estimator is NOT robust in the formal sense — it is "OLS conditional on a pre-screening". The pre-screening step has its own (often implicit) breakdown point, and the conditional inference downstream (SEs, p-values, CIs) does not account for the pre-screening uncertainty. Worse, the pre-screening can systematically remove real geological features that look like outliers but are not. The robust-statistics philosophy is the opposite: keep all the data, use an estimator with bounded influence, and let the IRLS weights tell you ex post which observations were treated as outliers.

2. The "right" tuning constant depends on the contamination model you anticipate. Huber's k=1.345k = 1.345 delivers 95% Gaussian efficiency at zero contamination — a clean-data optimality criterion. Under heavy contamination you may want smaller kk (more aggressive clipping, lower clean-data efficiency, higher protection); under suspected light contamination you may want larger kk (closer to OLS). The same applies to Tukey's c=4.685c = 4.685. The canonical defaults are good starting points; they are not universal optima. The first widget lets you wiggle the constants and watch the trade-off.

3. Above 50% contamination, no equivariant estimator can recover the clean signal. This is a hard theorem (Rousseeuw 1984): an estimator that respects the symmetries of the problem (location-equivariant for location, regression-equivariant for regression) cannot have breakdown point above (n+1)/2/n\lfloor (n+1)/2 \rfloor / n. Beyond that, the "clean" data is the minority and no algorithm can tell which subset is which. The second widget reaches this regime — pile outliers until they outnumber the clean cluster and even LMS describes the contamination. The lesson is structural: if you genuinely have >50% contamination, the data does not identify the clean signal and you need EXTERNAL information (prior knowledge, a different experiment, a stratification variable) to recover it. Robust statistics is not a substitute for that.

Where §1.8 fits in Part 1

§1.8 sits between the bootstrap (§1.7) and asymptotic theory (§1.9), and it touches every earlier section:

  • §1.1 introduced bias, variance, MSE. M-estimators trade Gaussian efficiency (variance) for protection against contamination (bias). The 5% efficiency tax of Huber k = 1.345 is an explicit bias-variance trade.
  • §1.2 (MoM) and §1.3 (MLE) produced classical estimators that are typically NOT robust — sample moments and likelihood gradients both have unbounded influence under contamination. §1.8 gives the robust replacements.
  • §1.4's CRLB describes the lower bound on variance for UNBIASED estimators under a CLEAN model. Under contamination the CRLB is irrelevant; the relevant benchmark is the M-estimator's minimax variance under the contamination class.
  • §1.5's bias-variance trade-off appears here in a different costume: classical estimators are efficient at the clean model but biased under contamination; robust estimators are mildly inefficient at the clean model but BOUNDED under contamination. The choice between them depends on how much contamination you anticipate.
  • §1.6's sampling-distribution machinery transfers directly to M-estimators: under regularity the M-estimator is asymptotically Normal with variance Var=σ2E[ψ2]/(E[ψ])2\operatorname{Var}_\infty = \sigma^2 , \mathbb{E}[\psi^2] / (\mathbb{E}[\psi'])^2. The widget reports the resulting asymptotic Gaussian efficiency exactly.
  • §1.7's bootstrap is the natural SE engine for M-estimators. The classical M-estimator SE formula above requires knowledge of ψ\psi and of σ\sigma; the bootstrap dispenses with that and works directly on the data. §1.7 was the prerequisite for §1.8 by design.

Part 4 §4.5 takes the regression-side story further (LMS, MM, robust SEs, leverage diagnostics in robust regression). §3.6 of the GST book did the analogous spatial-statistics story (Cressie-Hawkins and Genton robust variograms). The geo-textbook §1.4 introduced robust DESCRIPTIVE statistics — median, MAD, trimmed mean — as the entry-level toolkit; §1.8 here is the formal M-estimator machinery that organises that family and extends it to regression.

Try it

  • In the m-estimator loss widget, pick Huber. Slide kk from 0.5 to 3.0 and watch the Gaussian efficiency reading. Find the kk that maximises efficiency on a quick eyeball — does it match the canonical k=1.345k = 1.345? Now read the breakdown row: it stays at "50%" across all kk. Huber's clean-data efficiency varies with kk; its breakdown does not.
  • Switch to Tukey. With c=4.685c = 4.685 confirm the efficiency reads ~95%. Drop cc to 2.0 and the efficiency falls fast — at c=2c = 2 the loss starts rejecting observations at standardised residuals beyond 2σ, which loses too much clean-data information.
  • Switch to Absolute. The widget reports Gaussian efficiency = 63.7% — exactly 2/π2/\pi. This is the price of the median's 50% breakdown at clean Gaussian data. Pen-and-paper: derive 2/π2/\pi from Var=1/[4f(0)2]\operatorname{Var}_\infty = 1 / [4 f(0)^2] for unit-variance Gaussian, where f(0)=1/2πf(0) = 1/\sqrt{2\pi}.
  • In the robust regression vs OLS widget, start with the clean sample. Confirm OLS and Huber agree visibly. Click once in the middle of the x range, well above the trend, in vertical-only mode. The OLS line tilts; the Huber line barely moves. Add a second vertical outlier, then a third. Read the slope drift row: OLS drift grows roughly linearly with each outlier; Huber drift stays small until ~5 outliers in 30 (~17%) — at which point the residual scale shifts and Huber starts giving the outliers more weight than its IRLS would on the clean fit.
  • Click Clear outliers. Switch to free 2-D mode. Click ONCE in the bottom-right corner (high x, low y). OLS visibly rotates anticlockwise; Huber rotates a little less but is still pulled. Toggle on the LMS overlay — it stays nearly aligned with the true line. This is the difference between bounded-influence M-regression and high-breakdown LMS at high leverage.
  • Click Add 1 extreme outlier five times to drop five copies of an extreme high-leverage point at the bottom-right corner. With ~14% contamination there OLS is completely off; Huber is still off; LMS is close to truth. Tukey biweight (toggle) tracks LMS more closely. Now ramp contamination past 50% — click 20+ outliers — and watch even LMS abandon the clean line.
  • Without coding: an M-estimator solves iψ(ri/σ)=0\sum_i \psi(r_i / \sigma) = 0. For ψ(u) = sign(u) the equation reduces to #{i:ri>0}=#{i:ri<0}#{i : r_i > 0} = #{i : r_i < 0} — the median condition. Verify on paper for a small sample of 5 numbers that the median is the unique θ̂ satisfying this. Repeat for ψ(u) = u (squared loss) — the equation becomes iri=0\sum_i r_i = 0, i.e. θ^=Xˉ\hat\theta = \bar{X}.
  • Pen-and-paper: the canonical Huber tuning k=1.345k = 1.345 is chosen to give 95% Gaussian efficiency. Using the M-estimator variance formula Var=σ2EZ[ψ2]/EZ[ψ]2\operatorname{Var}_\infty = \sigma^2 , \mathbb{E}_Z[\psi^2] / \mathbb{E}Z[\psi']^2 with ZN(0,1)Z \sim \mathcal{N}(0,1) and the Huber ψk(z)=max(k,min(k,z))\psi_k(z) = \max(-k, \min(k, z)), set up (do not necessarily evaluate analytically) the equation 1/Var(k)=0.951 / \operatorname{Var}\infty(k) = 0.95 and explain why it produces a unique positive kk.
  • The first widget reports breakdown 50% even for Huber. Yet under HIGH-LEVERAGE contamination in REGRESSION, Huber M-regression has lower than 50% breakdown. Reconcile: what is the difference between breakdown of an M-estimator of LOCATION and breakdown of an M-estimator of REGRESSION? (Hint: in location every observation contributes through ψ(residual). In regression every observation contributes through ψ(residual)·x_i. The factor x_i can be arbitrarily large.)
  • The second widget shows OLS rotating dramatically with one outlier; the slope drift |Δb| grows. Compute on paper: for n = 30 clean observations with xˉ=5\bar{x} = 5 and sx28.3s_x^2 \approx 8.3 and slope β^OLS=1.5\hat\beta_{\text{OLS}} = 1.5, adding one outlier at (x,y)=(10,10)(x^, y^) = (10, -10) changes the OLS slope by approximately (xxˉ)(y(α^+β^x))/(nsx2)(x^* - \bar{x})(y^* - (\hat\alpha + \hat\beta x^*)) / (n \cdot s_x^2). Estimate the order of magnitude and compare to what the widget reports.

Pause and reflect: the influence function is a derivative; ψ(u) = ρ'(u). The breakdown point is a global, discontinuous threshold (how much contamination breaks the estimator entirely). One is local, one is global. They are not interchangeable. Can you have a bounded influence function AND breakdown 0? Can you have unbounded IF AND breakdown 50%? (Hint: both are possible if you decouple ρ and the algorithm. The classical link — solve the score equation ψ=0\sum \psi = 0 — is what fuses IF and breakdown for M-estimators specifically.)

What you now know

The contamination model — gross-error mixture F=(1ε)F0+εHF = (1 - \varepsilon) F_0 + \varepsilon H — formalises "data with outliers" with one parameter. The breakdown point counts the fraction of contamination an estimator can survive; the influence function is its local derivative-style complement, "how much one observation moves the answer". Bounded influence function ⇔ positive breakdown ⇔ robust estimator. The classical mean and OLS have unbounded influence and breakdown 0; they break on a single outlier.

M-estimators are the unifying family: minimise iρ((Xiθ)/σ)\sum_i \rho((X_i - \theta)/\sigma) for a chosen loss. Squared loss recovers the mean (efficient but breakdown 0); absolute loss recovers the median (breakdown 50%, efficiency 64%); Huber's clipped-quadratic gives bounded ψ and 95% Gaussian efficiency at k=1.345k = 1.345; Tukey biweight goes redescending (rejects extreme outliers); Hampel's three-part adds finer rejection control. All are computed by IRLS, which iteratively reweights observations by wi=ψ(ui)/uiw_i = \psi(u_i)/u_i from a robust scale estimate.

In regression, OLS has breakdown 0 against high-leverage outliers. Huber M-regression solves the same IRLS recipe and gives bounded-influence regression. High-breakdown regression — LMS (Rousseeuw 1984), LTS, S-estimators, MM-estimators (Yohai 1987) — reaches the 50% breakdown ceiling; MM combines that with high Gaussian efficiency and is the modern default. Three caveats: robust statistics is NOT outlier removal; the tuning constant depends on your contamination model; above 50% contamination no equivariant estimator can recover the clean signal.

§1.9 closes Part 1 with consistency, asymptotic theory, and the delta method — the formal machinery that says M-estimator standard errors converge to σE[ψ2]/(E[ψ])2/n\sigma \sqrt{\mathbb{E}[\psi^2] / (\mathbb{E}[\psi'])^2 ,/, n} under regularity and what "regularity" actually means in finite samples. Part 4 §4.5 builds robust regression operationally; Part 3 §3.2 lifts the bootstrap CI machinery onto M-estimators; Part 8 §8.1 develops the bootstrap-of-M-estimator theory.

References

  • Huber, P.J. (1964). "Robust estimation of a location parameter." Annals of Mathematical Statistics 35(1), 73-101. (The foundational paper. Defines the M-estimator framework, proves Huber's loss is minimax in the gross-error model, and is still the cleanest entry-point to the theory.)
  • Huber, P.J. (1981). Robust Statistics. Wiley. (The first-edition textbook. Chapter 4 develops M-estimators of location; Chapter 7 derives the canonical tuning constants including k=1.345k = 1.345 for 95% Gaussian efficiency.)
  • Hampel, F.R. (1974). "The influence curve and its role in robust estimation." Journal of the American Statistical Association 69(346), 383-393. (Introduces the influence function as the central diagnostic for robustness; introduces the three-part redescending estimator that bears Hampel's name.)
  • Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley. (The standard graduate reference. The IF framework that organises the M-estimator family is developed here in full.)
  • Rousseeuw, P.J., Leroy, A.M. (1987). Robust Regression and Outlier Detection. Wiley. (The standard robust-regression textbook. Develops LMS, LTS, and the regression-specific breakdown theory.)
  • Maronna, R.A., Martin, R.D., Yohai, V.J. (2006). Robust Statistics: Theory and Methods. Wiley. (The modern textbook treatment with emphasis on practical computation. MM-estimators, modern IRLS, scale-equivariant estimators, and standard tuning tables are all here.)
  • Tukey, J.W. (1960). "A survey of sampling from contaminated distributions." In I. Olkin (ed.), Contributions to Probability and Statistics. Stanford University Press. (Tukey's original survey introducing the contamination-mixture viewpoint and motivating later M-estimator developments.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.