Why your sample histogram is biased

Part 1, Univariate statistics for geo-data

Learning objectives

Name the three canonical preferential-sampling settings in geo-data, mining drillouts targeting visible mineralisation, petroleum wells drilled into structural highs, environmental stations placed near suspected contamination, and explain why each produces a sample biased toward high values
State precisely what 'biased' means in this context: the sample histogram Fₙ(z) is no longer a consistent estimator of the population CDF F(z); the unweighted sample mean is no longer a consistent estimator of the population mean
Explain why a 5-10% bias in the mean of a property like ore grade or porosity is a large effect in money, not a small effect in statistics, and propagates into every downstream geostatistical product
Write down the declustering weights as a normalised probability mass on the sample (Σ wᵢ = 1, wᵢ > 0), and recognise that the resulting weighted empirical CDF F_n^{decluster}(z) = Σ wᵢ 1{Zᵢ ≤ z} is the cure that Part 2 builds explicitly via cell declustering and polygonal declustering
Preview the two declustering recipes you will meet in Part 2 (cell, polygonal), and the honest scope warning: declustering corrects for spatial-sampling bias only, it cannot fix missing populations or fundamentally wrong sampling designs

§1.1 and §1.2 share a hidden assumption: when they speak of "the empirical CDF" $F_n(z) = (1/n) \sum_i \mathbf{1}{Z_i \le z}$ as a faithful summary of the data, and when they hand you the recipe $Y_i = \Phi^{-1}(F_n(Z_i))$ to rewrite the sample on standard-normal margins, both quietly assume that the sample is representative of the underlying population, that $F_n$ converges to $F$ as $n$ grows, that the sample mean converges to the population mean, that quantiles read off the sorted data are quantiles of the right distribution. For most domains where statistics is taught, that assumption holds because sampling is done with care: simple random samples in a survey, stratified samples in a clinical trial, exchangeable replicates in a lab. For geo-data, that assumption almost never holds.

This section is Part 1's pivot. The diagnostic toolkit you built in §1.1 and the transform you built in §1.2 still work, they are valid descriptions of whatever sample you hand them. What they describe, though, is not the field. It is your sample of the field, viewed through the particular lens of how those samples got collected. Make the assumption explicit. Then break it. Then preview the cure.

Why geo-data is preferentially sampled

Three canonical examples, drawn from the three domains where geostatistics gets the most use:

Mining exploration. Drillholes are expensive. Companies do not drill them on a uniform grid. They drill where the geology, the surface geochemistry, the magnetic-and-gravity anomaly maps, and earlier reconnaissance pits all converge on "this is where the ore is". By design the resulting sample is concentrated on the high-grade zones of the deposit, that is the point. The drill programme has succeeded if the sample yields a positive intersection with the orebody, and the orebody has not been "missed". But the resulting sample of grade measurements is heavily over-represented in the high tail of the global grade distribution.
Petroleum exploration. Wells are drilled into structural highs, into seismic-anomaly bright spots, into formations that earlier wells flagged as productive. Low-porosity / low-saturation regions of the same reservoir are typically not drilled, because they are not economic targets, even though they are part of the reservoir geology you need to characterise. The result: a porosity sample biased toward the productive zones, a saturation sample biased toward the parts of the reservoir where hydrocarbons accumulated. Net-pay maps drawn from these biased samples can over-estimate the productive volume by 10-20%.
Environmental monitoring. Soil-contamination sampling stations cluster near roads, near factories, near known historical sources of the contaminant. Background-level stations in pristine zones are under-sampled because nobody had a reason to put a sampler there. The resulting contamination distribution looks worse than the actual site-wide distribution; remediation budgets, environmental-impact assessments, and regulatory triggers all depend on this number.

In every case the sampling rule is non-random with respect to the value being measured. High values are over-sampled because high-value zones are where people put their samplers. The technical term is preferential sampling, and it is the default in any field where sample collection has a cost and the cost-benefit of one location vs. another is known before the sample is taken.

What "biased histogram" means, precisely

Let $F(z) = P(Z(\mathbf{s}) \le z)$ be the true marginal CDF of the regionalised variable $Z$ , the value the population would yield if you could measure $Z$ at every point $\mathbf{s}$ in the study area. Let $F_n(z) = (1/n) \sum_i \mathbf{1}{Z(\mathbf{s}_i) \le z}$ be the empirical CDF computed from the $n$ sample locations $\mathbf{s}_1, \ldots, \mathbf{s}_n$ .

For a sample drawn uniformly at random from the study area (or equivalently, for a sample placed on a regular space-filling grid in a stationary field), the Glivenko-Cantelli theorem guarantees $F_n \to F$ uniformly as $n \to \infty$ . The sample CDF converges to the population CDF. Sample means converge to population means. Sample quantiles converge to population quantiles. All the §1.1 machinery is justified.

For a preferentially-located sample, that guarantee evaporates. Each sample location $\mathbf{s}_i$ has been chosen with some inclusion probability $\pi(\mathbf{s}_i)$ that depends on $\mathbf{s}_i$ and (implicitly) on the kind of values $Z(\mathbf{s}_i)$ that part of the field tends to take. The unbiased estimator of the population CDF in that case is not $F_n$ but the Horvitz-Thompson weighted empirical CDF

F_n^{\text{HT}}(z) \;=\; \sum_{i=1}^{n} w_i \, \mathbf{1}\{Z_i \le z\}, \qquad w_i \;\propto\; \frac{1}{\pi(\mathbf{s}_i)}, \quad \sum_i w_i = 1.

Each sample carries a weight inversely proportional to its inclusion probability. Over-sampled locations get a smaller weight per sample because they each carry less "new information" about the field; under-sampled locations get a larger weight per sample to compensate for being rare. The geostatistical literature usually calls this concept declustering weights, and the bridging insight is that the equal-weight $1/n$ used in the unweighted $F_n$ is just the special case of declustering weights when the inclusion probability happens to be constant across the field, that is, when sampling is uniform.

The first widget below shows the failure of the unweighted estimator in a controlled simulation.

The widget holds the synthetic field fixed and lets you change only the sampling rule. Three buttons: Uniform places 100 samples on a jittered grid (the representative case); Clustered (preferential) draws 100 samples with probability proportional to $(z(\mathbf{s}) - z_{\min})^3$ , mimicking a mining drillout aimed at the high zones; Anti-clustered rings the high zone with 100 samples but stays out of it, mimicking the environmental case where stations cluster around a suspected source but the source itself is restricted-access. Compare the coloured sample histogram to the grey true-field histogram, and compare the sample mean (coloured vertical line) to the true mean (grey dashed line). For the clustered rule the sample mean drifts above the true mean by 20-30%; for the anti-clustered rule it drifts below. The redraw button gives you a fresh realisation, and the bias direction holds: it is structural, not noise.

Why a 5-10% biased mean is a large effect

A 5-10% bias in the global mean of a property sounds small, in the way a 5 mph wind sounds small. In the geostatistical workflow, that small effect chains into bigger ones at every later step:

Mean ore grade in reserves estimation. A mining feasibility study computes recoverable ore tonnage by integrating the local grade estimates over the orebody volume. A 5% bias in the mean grade translates directly to a 5% bias in the tonnage, and the dollar value depends on tonnage × grade × price. For a deposit with $1B in nominal in-ground value, a 5% over-estimate is$ 50M of "phantom ore", ore that does not exist, that the mine plan was built around, that will not be there when the shovel arrives. This is not a theoretical concern; it is the modal way mining projects go over budget in the first three years of production.
Kriging weights and kriged maps. Kriging (Part 5) chooses estimation weights to minimise $E[(Z^* - Z)^2]$ , the expected squared prediction error. Both the variogram model (Part 3, Part 4) and the local-search-window selection (Part 5.6) inherit the biased view of the field. The kriged map is then a smoothed, slightly-up-shifted version of the field, with a kriging variance that is also miscalibrated. The decision metrics computed on the map (above-cutoff probability, expected resource within a block) all carry the bias.
N-score transforms. §1.2 already flagged this: the empirical N-score recipe $Y = \Phi^{-1}(F_n(Z))$ is faithful only if $F_n$ is a good estimate of $F$ . For a clustered sample, the rank of the smallest values is too high (because the low tail is under-sampled), and the rank of the largest values is too low (because the high tail is over-sampled). The resulting N-scores systematically rank the sample wrong, the simulator built on top in Part 7 then draws conditioned realisations that propagate the same wrong ranking back into the simulated field.
Variograms. The experimental variogram (Part 3) averages squared differences $(Z(\mathbf{s}_i) - Z(\mathbf{s}_j))^2$ over sample pairs at each lag $h$ . Clustered samples produce many pairs within the high zone (where values are similar) and few pairs across the field (where values differ more). The result is a variogram with an artificially low sill, a too-short apparent range, and a possible spurious nugget. Robust variogram estimators (§3.6) help; declustered pair counts (an active research area) help more.

Each of these is a downstream amplification of a bias that, on its own, looked manageable. The cure is to act early, fix the sample, not the downstream products.

Declustering as a tunable correction

The fix is to replace the equal-weight $1/n$ in $F_n$ with a non-uniform set of weights $w_i$ that downweight clustered samples and upweight isolated ones. The weighted empirical CDF

F_n^{\text{decluster}}(z) \;=\; \sum_{i=1}^{n} w_i \, \mathbf{1}\{Z_i \le z\}, \qquad \sum_i w_i = 1, \quad w_i > 0,

is a closer estimator of the population CDF when the weights are chosen well. Two recipes for choosing them are taken up in Part 2.

Cell declustering (§2.1, after Deutsch 1989's GSLIB DECLUS program). Superimpose a regular grid of cells on the study area. Each occupied cell gets total weight $1/n_{\text{occ}}$ (where $n_{\text{occ}}$ is the count of cells containing at least one sample), and that weight is divided equally among the samples in the cell: $w_i = 1/(n_{\text{occ}} \cdot n_{\text{cell}, i})$ . Trivial to implement. One free parameter, the cell size, that the analyst must tune.
Polygonal declustering (§2.2). Tessellate the study area into Voronoi polygons around each sample; the weight of sample $i$ is proportional to the area of its Voronoi cell. Less arbitrary than cell declustering, no free parameter, but works best when sample locations are not on the convex-hull boundary of the study area (otherwise the boundary polygons are unbounded and need padding).

The second widget below previews cell declustering in action. It samples a clustered draw of 100 points from the same synthetic field, then applies the cell-declustering recipe with an adjustable cell size. Watch what happens to the corrected histogram as you sweep the cell size.

Three observations from sweeping the cell-size slider. (1) At very small cell sizes, small enough that almost every cell holds at most one sample, the per-sample weight collapses to $1/n_{\text{occ}} \approx 1/n$ , the same as the unweighted estimator. The declustered histogram collapses onto the biased histogram. (2) At very large cell sizes, large enough that almost every sample sits in the same cell, the weights are again roughly uniform, and the declustered histogram is once again the biased histogram. (3) In between is a sweet spot where the cell is large enough to bracket multiple clustered samples (each of which then shares the cell's weight, so each one is down-weighted) but small enough that isolated samples sit alone in their own cell (and so get the full cell weight). At the sweet spot, the declustered mean tracks the true mean within 1-2%, while the biased mean is off by 20-30%. This U-shape, too small does nothing, too large does nothing, in the middle works, is fundamental to cell declustering, and §2.1 describes the heuristic for choosing the cell size in practice (broadly: match the typical spacing of the clustered samples).

What declustering can and cannot fix

Declustering corrects for spatial-sampling bias: the bias that comes from samples being preferentially located in space. It does this well, for a clustered sample, declustering can routinely cut the mean bias by 5-10×, and it costs almost nothing computationally. But it is not a panacea, and the literature on declustering is honest about three limitations:

It cannot fix missing populations. If all your samples were taken in the oxidised zone of an orebody and none in the underlying reduced zone, no amount of declustering recovers the reduced-zone grade distribution, those samples simply do not exist. Geological context, not numerical weights, is the cure: if the sampling design has missed a fundamentally different population entirely, you need additional samples or you need to characterise the missing population separately. The same logic applies to seismic-amplitude bright-spot bias (bright spots over-sampled, "dim" reservoirs under-sampled) and to environmental campaigns that never sampled a particular land-use class.
Declustering is approximate, not exact. The Horvitz-Thompson $w_i \propto 1/\pi(\mathbf{s}_i)$ is the mathematically right thing to compute, but you do not know $\pi$ , that is the whole problem. Cell declustering approximates $\pi$ by local sample density; polygonal declustering approximates it by Voronoi-cell area. Both are heuristics, both depend on parameters or assumptions, both produce reasonable corrections in practice but not exact ones. You should always validate the declustered estimate against secondary information (well logs, geological boundaries, declustering-method cross-checks). §2.3 is a side-by-side comparison of the methods on the same dataset.
The biased histogram is still useful for pattern recognition. Even when $F_n$ is a biased estimator of $F$ , it remains a faithful description of the patterns in the sample: outliers, modality, dynamic range, the presence of a heavy tail, evidence of two populations. §1.1's entire toolkit, histograms, Q-Q plots, sorted-values plots, is still valid for diagnosing what is happening in the sample you have. The bias arises only when you take a sample statistic (especially the mean and the quantiles) and read it as a population statistic. Use the diagnostic tools to inspect your data. Use declustering to estimate population summaries.

One subtle warning. The sampling bias is rarely binary, "clustered or not, on or off". It is a continuum: every real geo-dataset is somewhat preferentially located, somewhere on the spectrum from "almost uniform" (regional gas-monitoring stations placed by regulation) to "extreme" (a brownfield where every sample is within 5 m of the suspected leak). The declustering correction is also a continuum: it can reduce a 25% bias to 1%, or a 5% bias to 0.3%, but it does not flip a sample from "biased" to "unbiased" with a single threshold. The right mental model is "decluster, then check how much the corrected statistics moved against the uncorrected ones". A small move suggests sampling was reasonably uniform; a large move says the sampling was strongly biased and the corrected estimates should be trusted over the raw ones.

Try it

In the biased-sampling simulator, start on Uniform. Note the sample mean, it should be very close to the true mean (within a fraction of a percent). Click Redraw realisation a few times; the sample mean drifts a little around the true mean but stays close. Now switch to Clustered (preferential). How much does the sample mean overshoot the true mean? Redraw a few times, does the direction of the bias change, or only the magnitude?
Switch to Anti-clustered. Predict the direction of the bias before looking at the readout, should it be positive or negative? Confirm. Why does this case sometimes UNDER-shoot the true mean by more than you would expect? (Hint: think about how much of the field, by area, the anti-clustered ring covers, and which value range that ring sits in.)
In the declustering preview widget, set the cell size very small (say 0.05). Read the declustered mean. Now set it very large (say 0.50). Read the declustered mean again. In both extremes, the declustered mean should be close to the biased (un-declustered) mean. Why? Now find the cell size that brings the declustered mean closest to the true mean. What is special about that scale relative to the sample-cluster spacing?
Still in the declustering widget, click Redraw clustered realisation a few times with the cell-size slider at the same setting. The biased mean and the declustered mean both jump around (each new realisation samples a different 100 points), but does the declustered mean stay closer to the true mean on average? If you ran this 1000 times, what would the distribution of (declustered − true) and (biased − true) look like?
Without coding anything: a mining engineer has 200 drill samples from a polymetallic deposit. The drill programme targeted the visible mineralisation, so the samples are clustered in the high-grade zones. The engineer reports a "raw" mean grade of 4.2 g/t and a cell-declustered mean of 3.5 g/t. The deposit has 50 Mt of ore. At a price of $50 per gram-tonne, what is the difference in nominal in-ground value between the two estimates, and which estimate would a competent persons report (NI 43-101, JORC) use?

Pause and reflect: the biased histogram is described above as "still useful for pattern recognition, outliers, modality, dynamic range, even when its quantiles are wrong as population estimators". Is that always true? Can you construct a sampling design where the biased histogram is misleading about the PATTERNS too, not just the QUANTILES, for example, a sampling rule that makes a unimodal field look bimodal in the sample, or that introduces a heavy tail that isn't there in the population? Under what assumptions does declustering recover such patterns, and under what assumptions does it fail?

What you now know

You have the mechanism: in mining, petroleum, and environmental practice, samples are placed preferentially, high-value zones get over-sampled because the sampling cost-benefit was known before the sample was taken. You have the consequence: the unweighted empirical CDF $F_n$ is no longer a consistent estimator of the population CDF $F$ , and the unweighted sample mean is no longer a consistent estimator of the population mean. You have the financial framing: a 5-10% mean-grade bias translates to millions of dollars of phantom ore on a real mining project, and the same magnitude bias in a petroleum porosity propagates into a 5-10% net-pay error. You have the cure preview: a weighted empirical CDF $F_n^{\text{decluster}}$ with weights that downweight clustered samples and upweight isolated ones, computed via cell declustering (§2.1) or polygonal declustering (§2.2). And you have the honest scope: declustering corrects for spatial-sampling bias, it is approximate, it cannot fix missing populations or fundamentally wrong sampling designs, and the biased sample histogram itself is still useful for diagnosing patterns.

§1.4 extends the §1.1 toolkit to be ROBUST against the heavy tails and outliers that are typical of geo-data, declustering corrects the histogram shape; robust statistics protect the descriptive summaries from being thrown off by a handful of extreme values. §1.5 then moves from global summaries to LOCAL summaries, block-and-panel averages over neighbourhoods, which preview the kriging system in Part 5 and the change-of-support story in Part 5.5. Part 2 develops the two declustering recipes in full and shows their downstream effect on variograms and kriged maps. By the time you reach Part 5's kriging system, you will have learned to do all the preparation work, distribution inspection, transform, declustering, robust summaries, local statistics, that the kriging system depends on but does not enforce.

References

Journel, A.G. (1983). Nonparametric estimation of spatial distributions. Mathematical Geology, 15(3), 445-468. (The foundational paper that named and formalised the sample-histogram-bias problem for spatial data, with the original argument that the unweighted Fₙ should be replaced by a declustered weighted CDF.)
Deutsch, C.V. (1989). DECLUS: a Fortran 77 program for determining optimum spatial declustering weights. Computers & Geosciences, 15(3), 325-332. (The canonical cell-declustering implementation in the GSLIB suite. The cell-size selection heuristic in §2.1 is the one introduced here.)
Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (Chapters 4 and 10, univariate description WITH declustering, and the "preferential sampling and its consequences" framing used throughout this section.)
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford University Press. (§4.1, "Declustering and Debiasing", is the modern textbook treatment of both cell and polygonal declustering with worked examples on the Walker Lake dataset.)
Pyrcz, M.J., Deutsch, C.V. (2014). Geostatistical Reservoir Modeling (2nd ed.). Oxford University Press. (Reservoir-engineering-focused treatment, including the financial framing of declustering bias used in this section.)