Robust variogram estimators

Part 3 — Experimental variograms

Learning objectives

Identify the non-robustness of the classical Matheron estimator $\hat\gamma(h) = (1/(2N(h))) \sum (Z_j - Z_i)^2$ : squared increments are quartic in the magnitude of any contaminated value, and a single 4-5σ outlier paired with $N(h)$ other samples per bin can inflate $\hat\gamma$ by 50-100% at the affected lags
Write the CRESSIE-HAWKINS 1980 robust estimator $\hat\gamma_{CH}(h) = \tfrac{1}{2}\,\bigl[N(h)^{-1} \sum |Z_j - Z_i|^{1/2}\bigr]^4 / (0.457 + 0.494/N(h))$ : the fourth root inside / fourth power outside damps outliers because a value 10× a typical magnitude contributes $\sqrt{10} \approx 3.16\times$ the typical $|\cdot|^{1/2}$ , not $100\times$ as the squared increment would. The $0.457 + 0.494/N$ denominator is the small-sample bias correction so that the estimator is unbiased for Gaussian increments
Write the GENTON 1998 $Q_n$ estimator $\hat\gamma_G(h) = \tfrac{1}{2}\bigl[2.2191 \cdot Q_n(\{Z_j - Z_i\})\bigr]^2$ where $Q_n$ is the Rousseeuw-Croux pairwise-difference order statistic, and recognise it as the variogram-domain analog of the §1.4 robust univariate scale estimator
State the EFFICIENCY-ROBUSTNESS trade-off: classical Matheron is 100% efficient at Gaussian increments but has breakdown point 0; Cressie-Hawkins is ~85% efficient and robust to 10-15% contamination; Genton $Q_n$ is ~70% efficient and robust to 50% contamination. The estimator with the highest breakdown point has the noisiest σ=0 baseline, by an irreducible mathematical trade-off (Maronna et al. 2006 §3-4)
Apply the field-standard practical recommendation: ALWAYS compute at least classical + Cressie-Hawkins on every dataset; if they agree (within ~10% at most lags), the data is effectively clean and the classical estimator is fine; if they DISAGREE substantially, the data has outliers and the robust estimator is the defensible one for downstream model fitting
Recognise the connection to §1.4 robust univariate statistics: §1.4 introduced robust scale and location estimators (median, MAD, trimmed mean) for outlier-resistant DESCRIPTIVE statistics; §3.6 ports the same philosophy to the SPATIAL-PAIR-DIFFERENCE domain. The Matheron γ̂ is the variogram-side analog of the sample mean — same breakdown point of 0
Know when NOT to use robust variograms: truly clean Gaussian data (classical is best), very small sample sizes (n < 50, where all estimators are noisy and the efficiency loss of the robust ones outweighs their robustness gain), and short-range bursts that may be REAL signal rather than contamination (the robust estimator will mistake them for outliers and mis-fit the short-lag structure)
Honestly locate §3.6 in Part 3: the section closes the experimental-variogram chapter. After §3.6 you have an empirical $\hat\gamma(h)$ that is directional, anisotropic, indicator-aware, and outlier-resistant. Part 4 will fit a permissible MODEL to whichever $\hat\gamma$ you trust; Part 5 kriging consumes the fitted model. The choice between classical and robust $\hat\gamma$ feeds downstream.

§3.2 introduced the Matheron 1962 empirical variogram

\hat\gamma(h) \;=\; \frac{1}{2\,N(h)} \sum_{(i,j) \in N(h)} \bigl(Z(\mathbf{s}_i) - Z(\mathbf{s}_j)\bigr)^2,

as the Part 3 workhorse: a sample mean of squared spatial increments, one mean per lag bin. §3.3 made it directional, §3.4 anisotropic, §3.5 categorical via the indicator transform. All four sections kept the squared-increment kernel $(Z_j - Z_i)^2$ at the heart of the estimator. §3.6 admits the cost of that choice. Squaring is the high-leverage operation of classical statistics: it makes large residuals dominate the sum, and it makes the sample mean and sample variance the textbook examples of estimators with breakdown point 0. The variogram inherits that pathology directly. A single contaminated value pairs with $N(h)$ other samples in every lag bin where its separations land — and at each of those pair contributions, its squared difference $(Z_{\text{out}} - Z_j)^2$ is quadratic in its magnitude. A 4σ outlier in a clean dataset can inflate $\hat\gamma(h)$ by 50-100% at the lags where its pairs sit. The classical Matheron estimator is, by construction, this fragile.

§3.6 is the variogram-side analog of §1.4. There, in the univariate setting, the sample mean and sample standard deviation were replaced by robust counterparts — median, IQR, MAD, trimmed mean — that survive a fraction of contaminated observations without breaking. The breakdown point graduated from 0 (classical) to 25-50% (robust), in exchange for a modest loss of efficiency under truly Gaussian data. Now we walk the same path one level up. Cressie and Hawkins (1980) developed a robust variogram estimator that takes the fourth root of the absolute increment, averages, and raises to the fourth power — a smoothed median-of-magnitude operation that damps outliers without throwing them away. Genton (1998) went further with $Q_n$ , the Rousseeuw-Croux pairwise-difference order statistic, achieving the maximum theoretical breakdown point (50%) at the cost of further Gaussian efficiency. Both estimators are field-standard and both are available in modern geostatistical software (R's gstat, GSLIB's gamv, SGeMS's variogram routines).

The §3.6 widgets stage the comparison directly. The first overlays all three estimators (Matheron, Cressie-Hawkins, Genton) on the same lag axis alongside the TRUE γ(h), and lets the reader inject 0-10 outliers at variable magnitude to SEE the divergence. The second fixes a lag and sweeps outlier σ from 0 to 5, plotting the three breakdown curves on a single axis — a one-look summary of how each estimator degrades under increasing contamination. By the end of §3.6 you will be able to (i) recognise when an experimental variogram has been corrupted by outliers, (ii) compute and compare the three estimators on your own data, and (iii) make a defensible choice between classical and robust $\hat\gamma$ for downstream Part 4 model fitting.

Why the classical Matheron estimator breaks under contamination

The arithmetic is direct. Suppose a clean dataset of $n$ samples has true variogram $\gamma(h) = 1$ at some lag $h$ . With $N(h)$ pairs in the lag bin, the Matheron estimator has expectation $E[\hat\gamma(h)] = 1$ and standard deviation $\hat\gamma(h)\sqrt{2/N(h)} \approx \sqrt{2/N(h)}$ — the classical sampling-distribution result for the sample mean of $N(h)$ squared Gaussian increments. Now replace ONE sample at position $\mathbf{s}_k$ by a value $Z_k + \Delta$ where $\Delta$ is the contamination magnitude. The sample $\mathbf{s}_k$ appears in roughly $\bar{n}_k \approx N(h) \cdot 2/n$ pairs within that lag bin (each sample participates in about $2 N(h)/n$ pairs on average, two endpoints per pair). The pair contributions from $\mathbf{s}_k$ 's pairs were originally $(Z_k - Z_j)^2 \approx 2\gamma(h)$ each; they become $(Z_k + \Delta - Z_j)^2 \approx 2\gamma(h) + 2\Delta(Z_k - Z_j) + \Delta^2$ . The cross-term sums roughly to zero (the $(Z_k - Z_j)$ values are zero-mean), but the $\Delta^2$ term contributes $\Delta^2$ to every one of $\mathbf{s}_k$ 's pair entries.

The bias added to $\hat\gamma(h)$ by a single $\Delta$ contamination is therefore

\text{Bias}_{\text{Matheron}}(h) \;\approx\; \frac{\bar{n}_k \cdot \Delta^2}{2\,N(h)} \;=\; \frac{\Delta^2}{n}.

With $n = 150$ samples and a contamination of $\Delta = 4\sigma = 4$ , that bias is $16/150 \approx 0.107$ — about a 10% upward shift in $\hat\gamma(h)$ at every lag bin the outlier participates in. With $\Delta = 10\sigma$ (a serious laboratory-prep error or a transcription mistake), the bias jumps to $100/150 \approx 0.67$ — a 67% inflation. The growth is quadratic in $\Delta$ : the larger the outlier, the more disproportionately it dominates. This is the canonical breakdown-point-zero pathology, ported from §1.4 into the variogram. A single contamination of arbitrary magnitude moves $\hat\gamma$ to an arbitrary value.

Real-data manifestations of this pathology are routine. A mining exploration drillhole that intercepts an unexpected vein contributes one ultra-rich assay to the dataset. A permeability core plug that intersects a fracture corridor returns 50-100 mD against a matrix typical of 1 mD. An environmental sampling well sitting on a historical spill location reports 50 mg/L against a regional background of 0.1 mg/L. None of these are necessarily DATA ERRORS — they may be perfectly real measurements of a process that is locally distinct from the regional bulk. But for the purpose of estimating the stationary $\gamma(h)$ of the regional bulk, they act as contamination. The classical Matheron estimator includes them with full quadratic weight; the robust estimators do not.

The Cressie-Hawkins 1980 estimator

The first practical robust variogram estimator was proposed by Cressie and Hawkins (1980), working from a simple observation: if the increments $Z_j - Z_i$ are approximately Gaussian (as second-order stationarity often implies), their fourth root absolute values $|Z_j - Z_i|^{1/2}$ follow a much heavier-tailed distribution whose mean is far less sensitive to extreme values than the mean of squared increments. By taking the fourth-root mean, then raising the result to the fourth power, you recover an estimator that targets the same variogram but with much higher resistance to outlying pairs.

The Cressie-Hawkins estimator is

\hat\gamma_{CH}(h) \;=\; \dfrac{1}{2}\, \dfrac{\Bigl[ N(h)^{-1} \sum_{(i,j) \in N(h)} |Z_j - Z_i|^{1/2} \Bigr]^{4}}{0.457 \;+\; 0.494 / N(h)}.

The numerator is the fourth power of the mean of the square-roots of absolute increments. The denominator $(0.457 + 0.494/N(h))$ is the small-sample bias correction: under Gaussian increments, the fourth-power-of-mean-of-square-roots is biased by a calculable factor, and dividing by $(0.457 + 0.494/N(h))$ recovers an estimator that is approximately unbiased for $\gamma(h)$ at finite $N$ . As $N(h) \to \infty$ , the correction converges to $1/0.457 \approx 2.19$ .

To see why this is robust, compare a single outlier's contribution. With the classical estimator, a contaminated pair with $|Z_j - Z_i| = 10$ (say, a 10σ excursion) contributes $100$ units to the sum; the $N(h) - 1$ uncontaminated pairs contribute roughly $2\gamma(h) = 2$ each. The outlier dominates $(100)/(N(h)\cdot 2 + 100) \approx 100/300 = 33%$ of the sum for $N(h) = 100$ . With Cressie-Hawkins, the contaminated pair contributes $10^{1/2} \approx 3.16$ ; the uncontaminated pairs contribute $\sqrt{2}\approx 1.41$ on average. The outlier now dominates $3.16 / (100 \cdot 1.41 + 3.16) \approx 2.2%$ of the sum — a 15-fold reduction in leverage. After the fourth-power step, the smoothed-out outlier influence is propagated forward, but the FOURTH-ROOT step already removed the leverage upstream. The fourth power on the back end ensures the units are right (back to a variance / variogram scale).

Cressie and Hawkins originally derived the estimator by analogy with location estimators: the fourth root of a Gaussian-squared variate is approximately Gaussian, and so the mean-of-fourth-roots is a robust analog of the mean-of-squares. The technical machinery is in Cressie & Hawkins 1980 and Cressie 1993 §2.4. The practical implementation requires only a single pass through the pair set per lag bin — no sorting, no iteration — making it computationally identical in cost to Matheron. Most modern software computes both side by side as a matter of routine.

What does Cressie-Hawkins COST relative to classical? Under truly Gaussian increments with no contamination, its variance is about $1/0.85$ times the classical variance — equivalently, its Gaussian efficiency is ~85%. So if you have 100 truly-Gaussian-increment pairs and use Matheron, the standard error of $\hat\gamma$ is 100% (by convention); if you use Cressie-Hawkins on the same data, the standard error is roughly $\sqrt{1/0.85} \approx 108%$ . An 8% loss in efficiency. In exchange you get an estimator that absorbs ~10-15% contamination without noticeable bias. For most field datasets, this is an excellent trade.

The Genton 1998 $Q_n$ estimator

Where Cressie-Hawkins damps outliers through a smooth fourth-root transform, Genton (1998) took a more aggressive route: replace the sum-of-squares-or-roots entirely with an ORDER STATISTIC of the pairwise differences. The estimator is

\hat\gamma_G(h) \;=\; \dfrac{1}{2}\,\bigl[\, 2.2191 \cdot Q_n\bigl(\{Z_j - Z_i\}_{(i,j) \in N(h)}\bigr)\bigr]^{2}

where $Q_n$ is the Rousseeuw-Croux 1993 robust scale estimator applied to the sample of pair-differences in the lag bin. Q_n is defined as

Q_n(\{x_1, \ldots, x_N\}) \;=\; \bigl\{ |x_i - x_j|\,;\; i < j \bigr\}_{(k)}

where $(k)$ is the $k$ -th order statistic of the $\binom{N}{2}$ pairwise absolute differences, and $k$ is chosen as $\binom{h}{2}$ with $h = \lfloor N/2 \rfloor + 1$ . The constant $2.2191$ makes $Q_n$ a consistent estimator of $\sigma$ for Gaussian samples. Effectively, Q_n is the first-quartile-like order statistic of pairwise differences — a robust analog of the standard deviation that uses ALL pairs of observations rather than just the deviations from a single central reference. It has breakdown point 50% — the maximum theoretically possible.

Applying Q_n to the in-bin pair-differences ${Z_j - Z_i}$ recovers a robust estimate of the increment standard deviation $\sigma_{\Delta Z}(h)$ . Squaring gives the increment variance $2\gamma(h)$ , dividing by 2 gives $\gamma(h)$ . So $\hat\gamma_G(h) = \tfrac{1}{2}(2.2191 \cdot Q_n)^2$ . The key insight is that Q_n is essentially a MEDIAN — and the median of any sample of pairwise differences is barely affected by a few extreme entries.

What does Genton COST? Under truly Gaussian increments with no contamination, the Genton estimator is about $0.7$ times as efficient as Matheron — i.e., its standard error is roughly $\sqrt{1/0.7} \approx 1.20\times$ the classical standard error. A 20% loss in efficiency. In exchange you get an estimator that can tolerate up to 50% contamination (any more and the order statistic itself would be drawn from the contamination). The trade is significantly steeper than Cressie-Hawkins's, but for datasets where contamination is suspected or known to be heavy, Genton is the safer bet.

The computational cost of Q_n is higher than Cressie-Hawkins. Computing all pairwise differences within each lag bin is $O(N(h)^2)$ — for $N(h) = 100$ pairs, that's 4950 pairwise comparisons. The widget below subsamples to keep it fast, but in production code on small datasets the full pairwise computation is the standard. Rousseeuw and Croux 1993 give an $O(N \log N)$ algorithm that is used in the R robustbase package and equivalent C/Fortran libraries.

Side-by-side: three estimators on a contaminated field

The first §3.6 widget runs all three estimators on the same 150-sample Gaussian field with a known Spherical variogram (range 0.30, sill 1.0). Use the two sliders to inject 0-10 outliers at magnitudes between 0 and 6σ, and watch the three empirical-variogram curves diverge.

Three things to do with this widget. First, set both sliders to zero (no contamination). All three estimators should track the true Spherical γ(h) — gold (Matheron), blue (Cressie-Hawkins), and green (Genton) lines overlapping closely with the grey dashed truth curve. There is some scatter because of finite sampling; Matheron is the least scattered (highest efficiency at Gaussian), Genton the most scattered. The three "MAE vs truth" readout pills below the canvas should agree to within a factor of two of each other.

Second, slide the outlier count to 2 at magnitude 4σ. The Matheron curve will lift above truth at the lags where the outlier pairs sit — the effect is most pronounced near and beyond the range, where many pairs cross the affected samples. The Cressie-Hawkins curve barely moves. The Genton curve hugs truth even more tightly. The Matheron MAE jumps to 2-4× what it was at zero contamination; the others change by ~10-20%. This is the regime that motivates routine use of robust estimators: just a few outliers at moderate magnitude are enough to make classical Matheron quantitatively wrong, even though the field "looks fine" visually.

Third, push the slider to 10 outliers at 5σ. The Matheron curve is now wildly off — typically 2-4× the true sill at some lags. Cressie-Hawkins is bent upward but stays in the right neighbourhood. Genton tracks truth surprisingly closely. Click "New realisation" to see this behaviour is stable across resamples — it's a structural property of the estimators, not a feature of one particular draw. The widget's verdict message reports the MAE ratios and provides the diagnostic interpretation: the Matheron-vs-CH gap itself is a powerful indicator that the data is contaminated, and the practical recommendation is to ALWAYS compute and compare at least these two.

The efficiency-robustness trade-off

The three estimators occupy three points on a fundamental statistical trade-off curve. Under truly Gaussian increments with no contamination, the Matheron estimator is the minimum-variance unbiased estimator for $\gamma(h)$ — by definition, no other unbiased estimator can be more efficient. Cressie-Hawkins gives up ~15% of that efficiency in exchange for ~10-15% contamination tolerance. Genton gives up ~30% of efficiency in exchange for ~50% contamination tolerance. This is not a coincidence: the Hampel theorem from robust statistics (Maronna et al. 2006 §3-4) shows that any estimator's efficiency at the assumed model and its breakdown point are linked by a mathematical inequality. You cannot have both maximum efficiency and maximum breakdown point in a single estimator.

	Matheron 1962	Cressie-Hawkins 1980	Genton 1998 Q_n
Form	arithmetic mean of $(\Delta Z)^2$	fourth power of mean of $\|\Delta Z\|^{1/2}$	squared order-statistic of pairwise diffs
Gaussian efficiency	100%	~85%	~70%
Breakdown point	0	~10-15%	~50%
Computational cost	O(N(h)) per bin	O(N(h)) per bin	O(N(h)²) per bin
Bias correction	none needed	denominator (0.457 + 0.494/N)	multiplier 2.2191
Default availability	everywhere	GSLIB, R gstat, SGeMS	R robustbase + variogram packages

The TABLE summarises the trade-off. Note that the breakdown points are FRACTIONS of the sample — for $N(h) = 100$ pairs in a bin, Matheron breaks under any contamination, Cressie-Hawkins under ~10-15 contaminated pairs (~10-15 outliers per bin), Genton under ~50 contaminated pairs. The cost-per-bin column matters in practice: Matheron and Cressie-Hawkins are linear-time and add a few microseconds per bin even for large datasets; Genton is quadratic and can be 100-1000× slower for typical bin sizes, though the Rousseeuw-Croux 1993 $O(N \log N)$ algorithm is usually deployed.

Breakdown curves: γ̂(h*) as a function of outlier σ

The second §3.6 widget visualises the breakdown trade-off directly. It fixes a lag at $h^* = 0.15$ — inside the variogram's rising leg, where Matheron is most sensitive to contamination — and sweeps the outlier magnitude from 0 to 5σ. For each σ value it runs 6 independent realisations with the chosen number of outliers (1, 3, 5, or 10) and reports the trial-mean $\hat\gamma(h^*)$ for each estimator. The three curves trace the breakdown behaviour cleanly.

What to observe. With 1 outlier (1% of $N = 100$ samples), the Matheron curve already rises VISIBLY above truth by σ = 1-2 and accelerates from there — the breakdown-point-zero pathology in action. Cressie-Hawkins rises more slowly but does accelerate at very large σ (3-5σ outliers). Genton is essentially flat — at 1% contamination, the order-statistic-based estimator is barely affected, regardless of how extreme any single outlier becomes.

Switch to 5 outliers (5% contamination). Matheron rises steeply across the entire σ axis — by σ = 5 it has typically drifted 2-3× the true γ. Cressie-Hawkins still bends upward but stays within shouting distance of truth. Genton remains very flat — well below its 50% breakdown threshold.

Switch to 10 outliers (10% contamination). This is right at the Cressie-Hawkins breakdown threshold. The blue curve now climbs noticeably at large σ — its protective effect is exhausting. Genton still holds because it has plenty of breakdown budget left.

The pedagogical takeaway is the SHAPE of these curves. Each estimator has its own breakdown ceiling: 0% (Matheron), ~10-15% (Cressie-Hawkins), ~50% (Genton). Below the ceiling, the estimator is stable; at and above the ceiling, the estimator starts to inherit the contamination. The slider lets you find each estimator's personal cliff edge under the controlled conditions of the synthetic field, and that intuition transfers directly to real data: classical Matheron will fall off ITS cliff with the very first outlier, Cressie-Hawkins around 1-in-10 contamination, Genton around 1-in-2.

The practical comparison workflow

The single most useful practical guidance from §3.6 is: compute Matheron and Cressie-Hawkins on EVERY new variogram, side-by-side, always. The cost is negligible (a few extra microseconds per bin), and the comparison serves three roles simultaneously.

Diagnostic for contamination. If Matheron and Cressie-Hawkins agree to within ~10% across all kept lag bins, the data is effectively clean. Trust either estimator; use Matheron for downstream fitting (it has the highest efficiency at Gaussian, so smaller error bars). If the two estimators DISAGREE substantially at one or more lags, the data has outliers and the disagreement IS the diagnostic. Investigate the high-magnitude pairs in the offending bins — are they real geological / engineering features (in which case maybe a different stationarity domain is warranted), or are they instrumental glitches (in which case trim them)?
Robust baseline for downstream fitting. Even if the data appears clean, having the Cressie-Hawkins curve in hand gives you a sanity-check baseline against which to compare any Part 4 model fit. If the fitted model agrees with Matheron but conflicts with Cressie-Hawkins at the rising leg, that's a signal worth investigating before trusting the kriging predictions in Part 5.
Compatibility with industry-standard software. GSLIB's gamv, R's gstat::variogram(), and SGeMS's variogram routines all compute both estimators natively. Comparing the two is the field-standard exploratory step, and the reviewers / collaborators downstream will expect to see it.

Genton enters the workflow as the THIRD option when the data is suspected to be heavily contaminated — historical drillhole assays from poorly-curated archives, multi-source environmental datasets with inconsistent QA, or any setting where the operator does not have control over the data-collection process. In those settings, computing Genton alongside Matheron and Cressie-Hawkins gives a three-way comparison that resolves more ambiguity than the two-way comparison alone. The cost is computational time and (more importantly) a more complex story to explain to stakeholders.

Connection to robust univariate statistics in §1.4

§3.6 and §1.4 are the same philosophy, applied to different statistical objects. In §1.4, the OBJECT was the univariate distribution of $Z$ values — and the classical mean, standard deviation, and variance all had breakdown point 0 because their definitions involve summing values weighted by linear or quadratic functions of magnitude. The robust counterparts (median, MAD, trimmed mean) limit the influence of any single observation through bounded or ordinal transformations.

In §3.6, the OBJECT is the variogram γ(h) — and the classical Matheron estimator has breakdown point 0 for the same reason: it sums squared spatial-pair increments, weighted quadratically by their magnitudes. The robust counterparts (Cressie-Hawkins, Genton) limit the influence of any single outlier through bounded transformations of the increments. Cressie-Hawkins is the variogram-side analog of the trimmed mean — both involve a smooth damping transform that softens the contribution of extreme values without throwing them away. Genton $Q_n$ is the variogram-side analog of the MAD — both involve a robust scale estimator built from pairwise differences.

Object	Classical (bp 0)	Robust trimming-style	Robust order-statistic-style (bp 50%)
Univariate location (§1.4)	sample mean	10% trimmed mean	median
Univariate scale (§1.4)	standard deviation	Winsorized SD	MAD, Q_n on raw data
Variogram (§3.6)	Matheron 1962	Cressie-Hawkins 1980	Genton 1998 Q_n

The map is exact at the conceptual level. Practitioners who have internalised §1.4 already know §3.6 — they just need to translate the vocabulary: "the classical sample mean has breakdown point 0" becomes "the classical Matheron variogram estimator has breakdown point 0"; "the trimmed mean is a smooth robust analog" becomes "Cressie-Hawkins is a smooth robust analog"; "the median is the maximum-breakdown-point location estimator" becomes "Genton Q_n is the maximum-breakdown-point variogram estimator".

When NOT to use robust variograms

The robust estimators are not free lunches. There are situations where the classical Matheron is the right choice and the robust alternatives mislead.

Truly clean Gaussian data. If you have laboratory-grade data with no contamination — well-characterised analytical errors, no instrument glitches, no transcription mistakes — Matheron is the minimum-variance estimator and you should use it. The robust estimators will give you systematically larger error bars on $\hat\gamma$ for no benefit.
Very small datasets ( $n < 50$ ). All three estimators are noisy at this sample size. The bias-vs-variance trade-off shifts: with few samples, the efficiency loss of the robust estimators may outweigh their robustness gain. In this regime the practical advice is to use Matheron and fit the variogram model with a heavy regularisation prior, rather than relying on $\hat\gamma$ alone.
Short-range bursts that are real signal. A common pathology of robust variograms is to interpret legitimate short-range, high-amplitude spatial bursts (turbulent gas plumes, fracture-flow corridors, micro-fault offsets) as "outliers" and damp them. The result is a robust $\hat\gamma$ that under-estimates short-range structure — exactly the regime where Part 4 kriging models depend most heavily on accurate variogram values. Diagnostic: if the robust estimator agrees with the classical one at long lags but is systematically LOWER at short lags, suspect this pathology. The fix is to investigate the high-amplitude short-range pairs case-by-case rather than letting the estimator decide.
Multimodal data with two equally-real populations. If your dataset is genuinely a mixture of two stationarity domains (e.g., shale matrix + fracture network, or background contamination + plume core), neither classical nor robust variograms are the right tool — you need to either (i) declare two separate domains and compute $\hat\gamma$ on each, or (ii) use indicator variograms from §3.5 to capture the cross-domain connectivity structure. The robust estimator on the joint dataset will damp one population in favour of the other and give a misleading single curve.

The honest summary is that robust variograms are the right default for routine field data — not a magic bullet, not always required, but cheap to compute and rarely actively harmful. The single decision rule that captures most of the practical wisdom is: compute classical + Cressie-Hawkins as a matter of routine; investigate when they disagree; use the robust one for downstream fitting only when the diagnostic shows contamination.

Try it

In the robust-variogram-comparison widget, set both sliders to 0 (no contamination). Look at the three "MAE vs truth" readout pills. Which is smallest? Which is largest? Why? (Hint: at zero contamination, Matheron has the highest efficiency at Gaussian, so its MAE should be smallest. Genton has the lowest efficiency, so its MAE should be largest. The ratio should be roughly 1:1.1:1.3 — efficient → less efficient.)
Same widget. Set outliers = 3, magnitude = 4σ. Look at the three curves. The Matheron curve should sit visibly above truth at most lags; Cressie-Hawkins should sit close to truth; Genton should hug truth even more tightly. Read off the MAE values from the pills. The Matheron MAE is now SEVERAL TIMES the Cressie-Hawkins MAE. This is the regime where the diagnostic VALUE of the comparison really shows.
Same widget. Push outliers to 10 at 6σ. The Matheron curve is now wildly above truth across the full lag axis. Cressie-Hawkins is also bent upward — its breakdown threshold is being exceeded. Genton stays low. What does the relative bias of CH-vs-Genton tell you here? (Hint: at 10/150 = 6.7% contamination CH should still be in its comfort zone, but with σ = 6 the magnitude is extreme enough that the fourth-root damping is showing its limits. Genton's order-statistic approach is the more robust of the two.)
Same widget. Hold outliers = 5 at σ = 5, then click "New realisation" 4-5 times. The three curves jitter from realisation to realisation. Which estimator's curve is MOST stable across draws? (Hint: it should still be Matheron in terms of LOW-MAGNITUDE variability — Matheron has the highest Gaussian efficiency. But it is also CONSISTENTLY BIASED upward, which is a different problem. Robust estimators trade higher variance for lower bias.)
In the outlier-injection-progressive widget, set outliers = 1. The Matheron curve still rises across the σ axis — even one outlier is enough to inflate it. Slide outliers to 10. The Matheron curve now rises much more steeply. How does the Matheron slope scale with outlier count? (Hint: the bias added is roughly $n_{out} \cdot \sigma^2 / N$ , so doubling outlier count roughly doubles the slope.)
Same outlier-injection widget. With 5 outliers, find the σ value at which the Cressie-Hawkins curve crosses 2× truth (roughly 2.0 at this lag). Now do the same for Genton — find the σ at which it crosses 2× truth. The Cressie-Hawkins critical σ is much lower than Genton's. This is the breakdown trade-off, made quantitative.
Without coding: a colleague hands you experimental variograms from a porosity dataset. Their Matheron estimator gives $\hat\gamma$ values ~ 50% higher than their Cressie-Hawkins estimator at most lags. What is the most likely diagnosis? What should they investigate? (Hint: the disagreement IS the diagnostic. 50% gap between classical and robust strongly suggests contamination. Investigate the high-magnitude sample pairs — measurement errors, transcription mistakes, or rare-but-real geological features that warrant separate stationarity treatment.)
Without coding: you have a dataset of 30 samples — small. The Matheron $\hat\gamma$ is very noisy (large error bars on every bin). Your colleague proposes switching to Cressie-Hawkins on the grounds that "robust is always better". Is this advice sound? (Hint: NO. With only 30 samples, the small-sample efficiency loss of robust estimators DOMINATES the robustness gain. Sticking with Matheron and using Part 4's weighted-least-squares fitting with a strong prior on the variogram model is the better strategy. Robust estimators help most when $n$ is large enough that the efficiency loss is affordable.)

Pause and reflect: the classical Matheron estimator is the variogram-side analog of the sample mean. Its breakdown point is zero. The single most useful practical habit you can adopt from §3.6 is to compute Matheron AND Cressie-Hawkins on every new variogram you make, side-by-side, always — exactly as you would automatically compute both the sample mean and the sample median in §1.4. The comparison costs nothing and tells you something important: how much your variogram is being moved around by a handful of outlying pairs you may or may not know about. Which datasets in your domain (mining, petroleum, environmental, hydrology) are most likely to need this comparison, and why?

What you now know — and the close of Part 3

You understand WHY the classical Matheron estimator is non-robust: it sums squared increments, and squared increments are quadratic in the magnitude of any contaminated value. A single 4σ outlier in a 150-sample dataset can shift $\hat\gamma$ by 10-20% at every lag bin the outlier participates in; a 10σ outlier can shift it by 60-70%. The breakdown point is 0. You know the §1.4 connection: the same pathology that hits the sample mean and standard deviation in univariate statistics hits the Matheron variogram in the spatial domain, for the same reason, and is fixed with the same family of techniques.

You can write the Cressie-Hawkins 1980 estimator — fourth root inside, fourth power outside, with the $(0.457 + 0.494/N)$ small-sample bias correction — and explain why the fourth-root transform damps outliers without throwing them away. You can write the Genton 1998 $Q_n$ estimator as a Rousseeuw-Croux pairwise-difference order statistic with the $2.2191$ Gaussian-consistency multiplier. You know the efficiency-robustness trade-off and can read it off the practical table: 100/0, 85/15, 70/50 for the efficiency/breakdown of Matheron / Cressie-Hawkins / Genton respectively.

You can run the field-standard comparison workflow: compute Matheron and Cressie-Hawkins side-by-side on every new variogram; if they agree to within ~10% at most lags the data is effectively clean and Matheron is the better choice for downstream fitting; if they disagree substantially the data has outliers and the disagreement IS the diagnostic. You know when NOT to use the robust estimators — truly clean Gaussian data, small $n < 50$ , short-range bursts that are real signal, mixtures of two equally-real populations. And you know how Genton enters as the third option when contamination is suspected to be heavy.

This closes Part 3. Six sections in, you have a complete experimental-variogram toolkit. §3.1 defined γ(h) and identified covariance / correlogram / variogram as three views of the same object. §3.2 gave you the Matheron empirical estimator with lag binning, the rule-of-30, and the half-domain rule. §3.3 added directional variography. §3.4 added the anisotropy ellipse and the geometric / zonal distinction. §3.5 added the indicator-transform extension for categorical and quantile-thresholded analyses. §3.6 hardened the squared-increment estimator against outliers. You can take a real dataset — irregularly sampled, possibly preferentially clustered, possibly anisotropic, possibly contaminated — and produce a defensible experimental description $\hat\gamma(\mathbf{h}; z_c)$ ready for Part 4 model fitting and Part 5 kriging.

Part 4 picks up here. Where Part 3 PRODUCED an experimental $\hat\gamma$ , Part 4 FITS a permissible model $\gamma(\mathbf{h}; \boldsymbol\theta)$ — a continuous function from a family (Spherical, Exponential, Gaussian, Matérn, nested combinations) that is conditionally negative definite, so the kriging matrices in Part 5 are guaranteed positive definite. The fitting routine (typically weighted least squares) automatically down-weights low- $N$ bins; the choice of variogram family is informed by the empirical shape; and the fitted model is what Part 5 actually uses for predictions. The robust $\hat\gamma$ you computed in §3.6 feeds directly into that fitting step, where it produces a robust variogram model and ultimately robust kriging predictions.

References

Cressie, N., Hawkins, D.M. (1980). Robust estimation of the variogram: I. Journal of the International Association for Mathematical Geology, 12(2), 115–125. (The original paper introducing the fourth-power-of-mean-of-fourth-roots estimator that bears their names. Derives the small-sample bias correction $(0.457 + 0.494/N(h))$ from the Gaussian-increment assumption and demonstrates the estimator's robustness on synthetic and real datasets.)
Cressie, N. (1993). Statistics for Spatial Data (revised ed.). Wiley. (§2.4 covers robust variogram estimation in the context of the broader mathematical-statistics treatment of $\hat\gamma$ . Contains the canonical derivation of the Cressie-Hawkins estimator's bias and variance, and the comparison to the classical Matheron form.)
Genton, M.G. (1998). Highly robust variogram estimation. Mathematical Geology, 30(2), 213–221. (The original paper introducing the $Q_n$ -based variogram estimator. Demonstrates its 50% breakdown point and ~70% Gaussian efficiency, and shows it dominates Cressie-Hawkins under heavy contamination in simulation studies.)
Rousseeuw, P.J., Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the American Statistical Association, 88(424), 1273–1283. (The foundational reference for the $Q_n$ scale estimator that Genton 1998 ports into the variogram domain. Includes the $O(N \log N)$ algorithm used in production code, the $2.2191$ Gaussian-consistency constant, and the comparison to other robust scale estimators.)
Hawkins, D.M., Cressie, N. (1984). Robust kriging — a proposal. Mathematical Geology, 16(1), 3–18. (The natural sequel to Cressie & Hawkins 1980: how the robust variogram feeds into a robust kriging predictor. Argues that downstream prediction inherits the upstream robustness of $\hat\gamma_{CH}$ , and develops the formal framework.)
Maronna, R.A., Martin, R.D., Yohai, V.J. (2006). Robust Statistics: Theory and Methods. Wiley. (The modern reference for robust statistics including the breakdown-point and efficiency machinery used throughout §3.6. Chapters 2-4 cover the location-scale robust estimators (median, MAD, M-estimators, trimmed mean) that §1.4 introduces; chapter 5 discusses Hampel's theorem on the fundamental efficiency-robustness trade-off.)
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford University Press. (§4.3 covers experimental variograms including the Cressie-Hawkins alternative. The practitioner-level treatment of how to compute and compare robust and classical estimators in a realistic mining / reservoir workflow.)
Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (Chapter 7 introduces the empirical variogram including a discussion of outlier sensitivity. The most readable practitioner-level introduction to the issues §3.6 addresses formally.)