Spatial sampling and its biases

Spatial data fundamentals

Learning objectives

Distinguish RANDOM (design-based) from CLUSTERED / PREFERENTIAL spatial sampling
Recognise the BIAS introduced when samples cluster in regions of high (or low) values
Apply CELL DECLUSTERING to correct the raw mean / histogram
Diagnose preferential sampling from sample density alone
Match sampling design to estimation objective

Statistical inference assumes samples are REPRESENTATIVE of the population. In spatial settings, this assumption is frequently violated: wells get drilled near other producing wells (the OIL FIELD EFFECT), mineral exploration concentrates in high-grade zones, environmental monitoring focuses on suspected hotspots. The resulting samples are CLUSTERED in feature space, producing strongly biased raw estimates of population statistics.

Three sampling regimes

Random (design-based): locations chosen WITHOUT regard to expected values. Raw arithmetic mean is unbiased; standard inference applies.
Clustered: locations concentrate in regions of (presumed) interest. Common in mining (exploration on high-grade zones), oil & gas (delineation around discoveries), environmental (around suspected pollution). RAW MEAN over-represents the cluster.
Preferential: locations chosen ALONG transects, roads, or boreholes for logistical reasons. Reservoir well placements, seismic 2D lines, ice cores. RAW MEAN biased by the line geometry.

How bad can the bias be?

Example: a deposit has true mean grade 1.5 g/t. Exploration drills concentrate 70% of holes in a 1 km² hot-spot (mean 4 g/t), 30% in the background (mean 0.5 g/t). Raw mean = 0.7×4 + 0.3×0.5 = 2.95 g/t — nearly DOUBLE the truth.

This is not a small effect. Mining-prospect valuations, reservoir OOIP estimates, environmental site-assessments — all are routinely wrong by 50%+ when raw means are used on clustered data.

The CELL-DECLUSTERING fix

Divide the domain into a grid of cells. Each sample gets weight $1/n_c$ where $n_c$ is the number of samples in its cell. Normalise so weights sum to 1. The weighted mean:

\hat{m}_{\text{decluster}} = \sum_i w_i z_i

removes the over-representation of clustered cells. Larger cells average more samples (smoother); smaller cells track local structure more. The CELL SIZE is a tunable parameter — common choice: ~½ the variogram range.

Polygonal declustering uses Voronoi tessellation as an alternative — each sample weighted by its Voronoi cell area. More principled in irregular geometries but heavier computationally.

When NOT to decluster

If samples are truly random (random-design survey): declustering INCREASES variance for no bias gain. Don't do it.
If the sampling reflects the population (e.g., environmental monitoring where the hotspots ARE the population of interest): the raw mean of clustered data IS the relevant estimate.

The decision: is the bias REAL (sampling is unrepresentative) or APPARENT (sampling represents the underlying distribution)? Decluster only when the former.

Try it

RANDOM scheme. Raw mean is close to the true mean (≈ within ±0.2 for n = 80). Declustered mean is similar but with slightly higher variance — declustering is unnecessary here.
CLUSTERED scheme. 70% of samples lie inside the central hot-spot. Raw mean is biased HIGH by ~0.5 to 1.0. Declustered mean (cell size ≈ 8) corrects most of the bias; the bias-reduction percentage is shown.
PREFERENTIAL (transects). Samples lie along 3 horizontal lines. Raw mean is biased by whatever values lie under those lines. Declustering helps modestly — the underlying issue is that transects don't sample 2D space uniformly.
Increase the decluster cell size for the clustered case. At very large cells (16), declustering smooths too much and may over-correct (drift toward background). At very small cells (2), each sample is its own cell — declustering becomes trivial.
For each scheme, re-seed several times. Note: raw bias for the clustered/preferential cases is CONSISTENT across seeds (it's structural, not random). Random scheme has random sampling error but no consistent bias.

A mining exploration team has 50 cores: 35 from a known high-grade zone, 15 from background. Without declustering, what is the systematic error in the resource estimate? Suggest the cell size you would choose — and explain how you'd pick it from the variogram.

What you now know

Spatial sampling is rarely truly random. Clustered and preferential sampling produce biased raw means; declustering corrects them. Cell declustering is the simplest practical method; polygonal declustering is the principled alternative. The CELL SIZE is the key tuning parameter. Match the declustering to the underlying spatial scale: ~½ the variogram range is a sensible default. Part 2 of the textbook (Declustering) develops this in full. §0.6 closes Part 0 with exploratory data analysis tools for spatial data.

References

Journel, A.G. (1983). "Nonparametric estimation of spatial distributions." Mathematical Geology 15(3), 445-468. (The foundational cell-declustering paper.)
Deutsch, C.V. (1989). "DECLUS: A FORTRAN 77 program for determining optimum spatial declustering weights." Computers & Geosciences 15(3), 325-332.
Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford. (Chapter 10 — declustering.)
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford.
Pyrcz, M.J., Deutsch, C.V. (2014). Geostatistical Reservoir Modeling, 2nd ed. Oxford.