Robust descriptive statistics for clustered data

Part 1 — Univariate statistics for geo-data

Learning objectives

  • State the contamination problem in earth-science data: mining drillholes hitting unusually rich veins, fracture-corridor permeability spikes, environmental contamination hot-spots — single observations at 10–1000× the typical value that arrive routinely in geo-datasets
  • Show that the sample mean and standard deviation have BREAKDOWN POINT 0 — a single contaminated observation moves the sample mean by (outlier − mean)/n and the variance by (outlier − mean)²/(n−1), so the classical summaries are arbitrarily wrong under arbitrarily small contamination
  • Define median, IQR, MAD, trimmed mean, and Winsorised mean as robust replacements with breakdown points 50%, 25%, 50%, α, α respectively, and read the MAD × 1.4826 calibration as the consistent estimator of σ under Gaussian data
  • Sketch the M-estimator framework that unifies these — Huber's ρ function is mean-like for small residuals and median-like for large ones — and recognise the bias-robustness trade-off (the 10% trimmed mean keeps 96% efficiency at the Gaussian while gaining a 10% breakdown point)
  • Distinguish robust statistics (cure for OUTLIERS / CONTAMINATION — single-point pathologies) from declustering (cure for SAMPLING BIAS — many-point spatial structure). The two problems are orthogonal; the cures compose
  • Preview robust variogram estimators (§3.6, Cressie–Hawkins, Genton) as the spatial-side analogue: the classical variogram squares differences and so inherits the same outlier-sensitivity that wrecks the classical mean

§1.3 corrected the sample histogram for spatial-sampling bias — preferential drilling, monitoring stations clustered near sources, wells in productive zones — by switching from equal weights to declustering weights. That fixes the SHAPE of the histogram. But a corrected histogram is still vulnerable to a different pathology: a handful of values much bigger (or smaller) than the bulk can wreck the mean and standard deviation by themselves, regardless of how cleanly the rest of the histogram was estimated. The mining literature calls these ultra-rich drillholes, the petroleum literature calls them fracture-corridor spikes, the environmental literature calls them hot spots. They are routine in geo-data, and they break classical descriptive statistics in a particular, formally diagnosable way.

This section introduces robust descriptive statistics: alternative summaries — median, IQR, MAD, trimmed mean, Winsorised mean — that survive a fraction of contaminated observations without breaking. The vocabulary is precise. We will define the breakdown point (the fraction of contamination the estimator can survive), use it to rank classical and robust estimators on the same scale, and run two widgets that make the difference concrete. Then we will preview where this enters the spatial workflow (§3.6 robust variograms) and close with an honest scope warning: robust statistics handle outliers, declustering handles clustering, and a realistic geo-dataset usually needs both.

The contamination problem in earth science

Three canonical examples, each from a domain where geostatistics is heavily used.

  • Mining: an ultra-rich drillhole. A 200-sample drillout of a polymetallic deposit produces grades that fluctuate around a few g/t. One particular hole intersects a previously-unknown narrow vein and assays at 100 g/t — ten or more times the typical sample. The vein may be a real feature worth chasing, or it may be a contamination from a poorly-cleaned sample preparation rig, or it may be a chance intersection with material that does not extend volumetrically. Treating the one assay as a "normal" sample contributes 100/200 = 0.5 g/t to the overall mean — pushing a 4 g/t deposit estimate to 4.5 g/t, which for a 50 Mt deposit at 50/gtadds50/g·t adds1.25 B of nominal value. The classical mean is, by definition, this sensitive.
  • Petroleum: a fracture-corridor permeability spike. A tight matrix rock with a typical permeability of 1 mD is sampled at 200 core plugs. Five plugs happen to intersect a previously unrecognised fracture corridor and measure permeabilities between 50 mD and 100 mD. Geologically the fractures are real and the high-permeability values are real, but they are a fundamentally different population (fracture flow), not the rock-matrix property the rest of the sample characterises. The arithmetic mean of permeability is dominated by these five plugs — even though they are 2.5% of the sample, they contribute roughly (5 × 75 mD) / 200 = 1.9 mD to the mean, while the matrix samples contribute ~1.0 mD. The classical mean reports 2.9 mD — neither the matrix value nor the fracture value. Worse, in geometric units (which is what reservoir engineers actually use for permeability) the arithmetic mean is the wrong functional anyway. The §1.1 dist-viewer's heavy-tail perm preset is this exact case.
  • Environmental: a contamination hot spot. An aquifer-contamination survey has 100 wells with concentrations of an analyte that range from 0.01 to 0.5 mg/L. One well — sited on the historic spill location, before any remediation — reports 50 mg/L. That single observation contributes 50/100 = 0.5 mg/L to the arithmetic mean of the survey. The "average aquifer concentration" reported to the regulator is now indistinguishable from a survey that had no hot spot at all but had ten 5 mg/L wells — yet the spatial truth is wildly different. Reporting the mean conflates a single localised problem with a diffuse uniform-contamination scenario.

The common thread is that classical summary statistics treat every observation equally and let the magnitude of a value carry as much weight as the magnitude says. For geo-data, where rare extreme values are routine — geological, biological, and engineering systems produce them as a matter of physics — that treatment is wrong. The fix is to use summary statistics that limit how much any single observation can move the answer.

Breakdown point: a number for "how broken does this get?"

The breakdown point of an estimator is the fraction of contaminated observations above which the estimator can be pushed to an arbitrary value by changing only the contaminated observations' values. Lower is worse; the worst possible is 0 (a single contamination suffices) and the best possible is 50% (more than half contaminated would make the estimator describe the contamination rather than the data, so 50% is the ceiling).

For the sample mean, the breakdown point is 0. A single contamination at value voutv_{\text{out}} moves the mean by (voutxˉ)/n(v_{\text{out}} - \bar{x})/n. Send voutv_{\text{out}} to infinity and the mean follows. One observation suffices to make the classical mean arbitrarily bad.

For the sample standard deviation and variance, the breakdown point is also 0, for an even more violent reason. Variance is

s2=1n1i=1n(XiXˉ)2.s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2.

A single contamination contributes (voutXˉ)2/(n1)(v_{\text{out}} - \bar{X})^2 / (n-1). Send voutv_{\text{out}} to infinity and the variance follows — quadratically, not linearly. One observation suffices to make the variance not just wrong but explosive.

For the sample minimum or maximum, the breakdown point is also 0 — trivially, since the max of a set is determined by any single value pushed above all others.

Now the robust counterparts.

  • Median: breakdown point 50%. The median is the middle of the sorted data; corrupting up to half the observations can push the median by at most the distance to the next-uncorrupted-order-statistic, which stays bounded. Corrupting more than half pushes it into the corruption itself.
  • Interquartile range (IQR), defined as Q3Q1Q_3 - Q_1: breakdown point 25%. Corrupting up to 25% of the data leaves the Q1Q_1 and Q3Q_3 both bracketed by uncorrupted observations; once corruption exceeds 25% on one side, that quartile is in the corruption.
  • Median absolute deviation (MAD), defined as MAD=medianiXimedian(X)\text{MAD} = \text{median}_i |X_i - \text{median}(X)|: breakdown point 50%. A robust analogue of the standard deviation — the median deviation from the median.
  • α-trimmed mean, defined by dropping the smallest αn\alpha n and largest αn\alpha n observations and averaging what remains: breakdown point α. Setting α=0\alpha = 0 recovers the mean (bp 0); α=0.5\alpha = 0.5 recovers the median (bp 50%). Practical choice: 10% on each side, breakdown 10%, far more robust than the mean.
  • α-Winsorised mean, defined by replacing the smallest αn\alpha n observations with the α\alpha-quantile and the largest αn\alpha n with the (1α)(1-\alpha)-quantile, then averaging: breakdown point α. Similar to the trimmed mean but less wasteful — every observation contributes, just clipped at the tails.

One calibration is worth memorising. For data that is exactly Gaussian, the MAD has an expectation of σΦ1(0.75)0.6745σ\sigma \cdot \Phi^{-1}(0.75) \approx 0.6745\sigma. Multiplying the sample MAD by 1/0.67451.48261/0.6745 \approx 1.4826 produces an estimator of σ\sigma that has the same expected value as the sample standard deviation for clean Gaussian data, but breakdown point 50% rather than 0. This MAD × 1.4826 formula is the standard way to report a "robust standard deviation" on geo-data; it is widely tabulated and almost as easy to compute as the classical SD.

Robust Vs ClassicalInteractive figure — enable JavaScript to interact.

The widget instantiates three earth-science datasets and a contamination slider. The clean lognormal preset is the ideal-distribution case where classical and robust statistics agree (give or take small-sample fluctuation). Slide the contamination knob from 0% up: the classical mean and SD jump immediately, while the median, IQR, MAD, and 10% trimmed mean stay within a few percent of their clean-data values. The drift readout reports the percentage change against the same draw at zero contamination, so the breakdown is visible in numerical terms. The mining grade preset starts at 2% contamination by default — one ultra-rich-drillhole-equivalent in 200 samples — and the table shows the classical mean already overstating the central tendency. The heavy-tail permeability preset is the §1.1 dist-viewer dataset replayed, with the fracture-corridor cluster baked in.

Watching the breakdown happen

A small dataset — n = 20 standard-normal samples — makes the discrete-step nature of breakdown points easy to see. The second widget lets you click on individual samples to flip them to outliers (value = +30, far above the bulk). As you contaminate samples one at a time, the eight estimators on the right panel report their current value and the drift from clean-data baseline (measured in clean-data σ units). Each row also displays its theoretical breakdown threshold: the minimum number of contaminated samples needed before the estimator is theoretically broken.

Breakdown Point DemoInteractive figure — enable JavaScript to interact.

The breakdown pattern is precise. The mean, SD, and sample max break at k = 1 — flip the first sample and they are already wrong. The 10% trimmed mean and 10% Winsorised mean break at k = ⌊0.10 × 20⌋ + 1 = 3: at exactly two contaminations the trimming boundary still catches them, at three contaminations one of them slips past the trim. The IQR breaks at k = ⌊0.25 × 20⌋ + 1 = 6. The median and MAD hold all the way to k = ⌊0.50 × 20⌋ + 1 = 11. The drift bars stay green (in spec), turn yellow (drifting but functional), and finally red (broken) at these thresholds. The same pattern with the same proportions appears in any sample size; only the absolute counts scale.

The bias-robustness trade-off

Robust estimators give up something for their resistance to contamination. At a perfectly clean Gaussian distribution — the "ideal" — the sample mean is the most efficient estimator of the population mean. The median is asymptotically efficient at only 2/π63.7%2/\pi \approx 63.7%: an n = 200 Gaussian sample reports a median that fluctuates as widely around the truth as the mean of an n ≈ 127 Gaussian sample. Trimmed means recover most of the gap: a 10% trimmed mean is about 96% efficient at the Gaussian. The MAD is similarly less efficient than the SD on clean data.

The trade-off is summarised by what statisticians call the influence function: a plot of how much the estimator moves per unit shift at a single point. The classical mean has a linear, unbounded influence function (any single point can pull the estimate arbitrarily). The median has a bounded influence function — corruptions beyond the bulk are essentially ignored. The trimmed mean splits the difference: linear in the middle, completely insensitive outside the trimming boundary. Huber's M-estimator framework gives a unified treatment: define the estimator as the minimiser of iρ(Xiθ)\sum_i \rho(X_i - \theta) for a loss function ρ\rho. Setting ρ(u)=u2/2\rho(u) = u^2/2 recovers the mean (linear influence, breakdown 0). Setting ρ(u)=u\rho(u) = |u| recovers the median (bounded influence, breakdown 50%). Huber's recommended in-between loss is ρ(u)=u2/2\rho(u) = u^2/2 for u<k|u| < k and ρ(u)=kuk2/2\rho(u) = k|u| - k^2/2 for uk|u| \ge k: mean-like for residuals inside [k,k][-k, k], median-like outside. The result is bounded influence and breakdown above 0, with efficiency close to 100% at clean Gaussian data when kk is set to its standard value of 1.345σ1.345\sigma. The M-estimator machinery is treated formally in §1.8; for §1.4 the framework is mainly there to organise the trimmed-mean / Winsorised-mean / median family in your head.

Why this matters in the spatial workflow

The univariate descriptive-statistics application — replacing mean / SD with median / MAD when reporting "the typical value" and "the spread" of a geo-dataset — is immediate and almost trivially valuable in mining, petroleum, and environmental reports. It is also the simplest entry-point to a much larger spatial-statistics issue.

The classical variogram is non-robust by design. Part 3 builds the experimental variogram

γ^(h)  =  12N(h)(i,j)N(h)(Z(si)Z(sj))2,\hat{\gamma}(h) \;=\; \frac{1}{2 N(h)} \sum_{(i,j) \in N(h)} \bigl(Z(\mathbf{s}_i) - Z(\mathbf{s}_j)\bigr)^2,

which is half the average squared difference between value pairs at lag hh. The same quadratic dependence that makes the sample variance explode under one outlier here makes the variogram explode at every lag the outlier participates in. A single contaminated sample paired with neighbours at twenty different lags raises γ^(h)\hat{\gamma}(h) at all twenty lags simultaneously — and a sample with very high value is paired with more neighbours than a typical sample because pair counts go like the number of points within the lag bin. The downstream consequence is a kriging variance that is uniformly inflated, a kriged map that is over-smoothed, and a sequential Gaussian simulation Part 7 that has an over-large nugget effect.

The fix on the spatial side is a robust variogram estimator. The two most common are:

  • Cressie–Hawkins (1980): replace squared differences with the fourth root of the average fourth-root of squared differences, with a small-sample correction. The fourth-root transform tames outliers in the same way the median tames the mean.
  • Genton (1998): use the median (or MAD) of pairwise squared differences rather than the mean. Breakdown point 50% (rather than 0 for the classical variogram).

Both are treated in detail in §3.6. For now, the connection is the lesson of §1.4 applied to the spatial side: the univariate observation that "squared distances make outliers explosive" carries over, lag by lag, into the spatial autocorrelation estimator. If you find yourself reaching for robust descriptive statistics in §1.1's diagnostic toolkit, you will need robust variograms in Part 3.

Beyond the variogram, robust weights also enter variogram model fitting in Part 4. Weighted least squares with outlier-resistant weights (Iteratively Reweighted Least Squares with a Huber loss) is the modern default when the experimental variogram's residuals are not Gaussian — which is, again, the realistic geo-data setting.

Robust statistics are NOT declustering

A practical confusion to clear up. §1.3 introduced declustering: a fix for samples that are spatially over- or under-represented because of preferential location. §1.4 introduces robust statistics: a fix for samples that contain single-point pathologies — outliers, contamination — regardless of where they are located in space. The two problems are orthogonal:

  • A uniformly-sampled dataset with one extreme outlier needs robust statistics but not declustering — the spatial coverage is fine.
  • A clustered (preferentially-sampled) but clean dataset needs declustering but not robust statistics — the values themselves are well-behaved.
  • A real geo-dataset — say, a mining drillout — usually needs both: it is preferentially located AND contains contamination from rare ultra-rich intersections. The two cures compose. Decluster first to get the right weights, then compute robust summaries with those weights (a weighted median, weighted IQR, etc., which Goovaerts 1997 §4.1 develops in detail).

Putting the two together is the practical workflow: declustering corrects the histogram shape so that each sample's contribution is right-weighted; robust summaries then ensure that each sample's value cannot push the summary further than physical reasonableness allows.

What §1.4 deliberately leaves to later sections

This section introduces robust DESCRIPTIVE statistics — the summaries you report for a univariate sample. Two natural extensions belong to later sections:

  • §1.8 — robust and M-estimators in detail. The full M-estimator machinery, the influence function as a formal object, and Huber's minimax optimality result are properly treated there. §1.4 has shown you that the median is the M-estimator with ρ(u)=u\rho(u) = |u| and the trimmed mean is the M-estimator that drops the tails; §1.8 makes that explicit.
  • §3.6 — robust variogram estimators. The spatial-side application that the univariate analogy here motivates: Cressie–Hawkins (1980), Genton (1998), and friends.
  • Robust outlier identification. §1.4 has presented robust SUMMARIES that survive outliers. It has NOT presented methods to identify and remove outliers — Tukey's 1.5 × IQR fences, the modified Z-score, multivariate Mahalanobis-distance methods. Some practitioners prefer to identify-and-remove (an explicit decision); others prefer to leave-and-downweight (using robust summaries directly). Both have honest use cases. Geo-data practitioners typically combine them: identify candidates with 1.5 × IQR or Q-Q plot deviations, decide case-by-case whether each is a real geological feature or a measurement artefact, then report ROBUST summaries on the cleaned sample. The combination is treated in Goovaerts 1997 §2.

One subtle reminder. Robust statistics do not "remove" outliers; they limit the outliers' contribution to the summary. If the outliers themselves are real, geologically meaningful features — fracture corridors, ore shoots, hot spots — you do not want them removed. You want them characterised separately, and you want your overall summary not to be dominated by them. Robust descriptive statistics let you describe the bulk without losing track of the tail.

Try it

  • In the robust-vs-classical widget, start on the Clean lognormal preset with the contamination slider at 0%. Confirm that the mean ≈ median × (1 + skewness-driven offset) and the SD ≈ MAD × 1.4826, give or take small-sample fluctuation. Now slide contamination up to 5%. By how much does the mean drift (in percent)? By how much does the median drift? The ratio of these two drifts is roughly the ratio of breakdown rates near the contamination level.
  • Switch to the Mining grade preset. Default starts at 2% contamination. Slide it down to 0% and read the four classical stats and four robust stats. Slide back to 2%, then up to 5%. At what contamination does the classical mean cross a 10% drift from its clean-data value? At what contamination does the median cross a 10% drift? What does this tell you about reporting a "mean grade" for a deposit that has any ultra-rich drillholes at all?
  • In the breakdown-point widget, contaminate samples one at a time (click them to flip green→red). After k = 1, what colour are the mean / SD rows? After k = 2 (10% contamination), watch the 10% trimmed mean and 10% Winsorised mean rows; predict at k = 3 (15%) what colour they will be. Confirm. Continue up to k = 10 (50% contamination) and observe the breakdown sequence: which is the LAST estimator to break?
  • Still in the breakdown widget, click Reset then contaminate just two samples but choose them to be ADJACENT in sample-index order (e.g. samples 1 and 2). Now click Reset and contaminate samples 1 and 10 instead. Is the resulting drift the same in both cases, or does it depend on WHICH two samples were contaminated? For which estimators does it matter?
  • Without coding: a petroleum engineer has 500 core permeability samples and reports an arithmetic mean of 12 mD and an SD of 60 mD. The Q-Q plot against a lognormal reference shows the bulk straight, with the top ten points lifting off the line — fracture corridor intersections. What do you predict the median, MAD × 1.4826, and 10% trimmed mean would be? Why would you NOT use the arithmetic mean to report "the typical matrix permeability" of this reservoir?

Pause and reflect: a robust estimator IGNORES outliers; a declustering weight REWEIGHTS them. For a hot-spot environmental survey where the hot spot is real and is exactly the thing you want to characterise, neither approach by itself is right — the robust median would hide it, and declustering would weight it less than its information content suggests. What would a sensible reporting strategy be in that case? (Hint: report SEPARATE summaries for the hot-spot zone and the background zone — geological stratification, treated in §1.5 and §2.4 in different ways.)

What you now know

You have the contamination problem: in mining, petroleum, and environmental data, single extreme values arrive routinely and break the classical mean and standard deviation immediately (breakdown point 0). You have the robust toolkit: median (bp 50%), IQR (bp 25%), MAD with the × 1.4826 Gaussian calibration (bp 50%), 10% trimmed mean (bp 10%), 10% Winsorised mean (bp 10%). You have the breakdown-point concept as a quantitative ranking and you have seen the breakdowns happen on a 20-sample dataset where the thresholds are visible as clicks. You have the bias-robustness trade-off framed via efficiency at the clean Gaussian (median ~64%, trim10 ~96%), and the M-estimator framework that unifies the family.

You have the geo-data context: classical variograms inherit the same outlier-sensitivity that wrecks the classical mean, motivating the robust variograms of §3.6. And you have the orthogonality between robust statistics (cure for OUTLIERS) and declustering (cure for SAMPLING BIAS) — both are needed in realistic geo-data.

§1.5 closes Part 1 by composing what you have learned into the move from GLOBAL summaries (a single mean / median for the whole study area) to LOCAL summaries (block- or panel-level averages over neighbourhoods). The local-statistics move is the conceptual bridge to kriging's search ellipsoid in Part 5 and the change-of-support story in Part 5.5. Robust descriptive statistics enter that story exactly where you would expect: a local block summary computed from a few biased and possibly contaminated samples needs both declustering and robust summaries to be honest.

References

  • Huber, P.J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1), 73–101. (The foundational M-estimator paper. Defines the Huber loss, proves the minimax optimality result, and is still the technical entry-point to the modern robust-statistics literature.)
  • Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. (The book that put the median, trimmed mean, IQR, and box plot into routine statistical practice. The robust-EDA philosophy used throughout this section is Tukey's.)
  • Cressie, N., Hawkins, D.M. (1980). Robust estimation of the variogram. Journal of the International Association for Mathematical Geology, 12(2), 115–125. (The first robust variogram estimator — the fourth-root transform that §3.6 develops. Connects univariate robust statistics to the spatial-statistics workflow.)
  • Rousseeuw, P.J., Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the American Statistical Association, 88(424), 1273–1283. (Develops the Sn and Qn statistics — robust scale estimators with breakdown 50% and better efficiency than the MAD at the Gaussian. The natural successors when the MAD's 37% efficiency is too low.)
  • Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley. (The standard graduate reference. The influence-function framework that organises the M-estimator family is developed here in full.)
  • Maronna, R.A., Martin, R.D., Yohai, V.J. (2006). Robust Statistics: Theory and Methods. Wiley. (The modern textbook treatment with emphasis on practical computation. The 10% trimmed mean default and the MAD × 1.4826 calibration are tabulated and discussed.)
  • Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford University Press. (§2.2 on univariate statistics in the geostatistical context — including the robust alternatives discussed here and how they interact with declustering weights.)
  • Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (Chapter 2 builds the descriptive-statistics toolkit for geo-data with practical pointers toward robust alternatives when the dataset contains outliers — the classical applied-geostatistics treatment.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.