What makes data spatial?

Spatial data fundamentals

Learning objectives

State what distinguishes a spatial dataset from a classical (aspatial) one — that every observation is paired with a location
Recognise Tobler's First Law of Geography and explain why it justifies distance-weighted prediction
Explain why classical IID assumptions break for spatially autocorrelated data, and what that does to the effective sample size
Preview the three things spatial methods give you that classical statistics cannot: location-specific estimates, calibrated uncertainty, and multiple realisations
Form a realistic expectation of the road ahead: variograms (Part 3) come before kriging (Part 5), which comes before simulation (Part 7)

Most of the statistics you have met so far treats observations as a bag of numbers: a sample $x_1, x_2, \ldots, x_n$ drawn independently from some distribution. The order does not matter. The neighbours do not matter. The histogram captures everything. This section is about what happens when that picture is wrong — when each observation comes attached to a location, and the locations carry information about the values.

This is the founding shift of geostatistics. The rest of the textbook — variograms, kriging, sequential Gaussian simulation, indicator methods, multipoint statistics — is machinery for working with that fact rigorously. But before any of that lands, you need a felt sense of what "spatial data" actually means, and why the ordinary statistical tools cannot describe it on their own. The two widgets in this section make that gap visible.

The defining property: an observation has a location

A spatial observation is not a number. It is a pair $(\mathbf{s}, z)$ where $\mathbf{s} \in \mathbb{R}^d$ is a location (often $d = 2$ for a map, sometimes $d = 3$ for a 3-D volume, occasionally $d = 4$ when time is included) and $z$ is the measured value at that location.

We write this as $Z(\mathbf{s})$ , read as "the random variable that produces the measurement at location $\mathbf{s}$ ". $Z$ is a random function: one realisation gives you a value at every point $\mathbf{s}$ in your study area, and a different draw of the random function would give you a different map. Your data ${z(\mathbf{s}_1), z(\mathbf{s}_2), \ldots, z(\mathbf{s}_n)}$ is one realisation of $Z$ at $n$ specific sampling locations — usually wells, boreholes, gauges, pixels — and you want to say something about $Z(\mathbf{s}^$ at an unsampled location $\mathbf{s}^$ $s^{*}$ . That is the geostatistical question.

Classical statistics ignores the $\mathbf{s}$ entirely. The bag-of-numbers view treats your sample as exchangeable: shuffle the indices and nothing changes. For spatial data that shuffle is a fatal operation — the very thing you need to make a useful prediction at $\mathbf{s}^*$ is the location pattern you just destroyed.

The same values, two different stories

Here is the most efficient way to feel that destruction. The widget below carries the SAME 64 numeric values in both panels — drawn once from a normal distribution. The "Aspatial" panel scatters those values at random locations across the unit square. The "Spatial" panel sorts the values onto a smooth gradient so high values cluster in one corner and low values in the opposite corner. Toggle between the two and watch the bottom histogram.

The histogram is identical. So are the mean, the variance, and the naive standard error — by construction, since the values themselves are unchanged. A classical analyst handed only those summary statistics could not tell these two datasets apart. But click a query point in each panel: in the spatial layout the local average is a much better predictor at that location than the global mean; in the aspatial layout the local average and global mean are nearly the same. Spatial structure lives in the arrangement, not in the values. That is the single most important sentence in §0.1.

Tobler's First Law of Geography

The arrangement is not just incidental — it almost always carries the predictive signal. Waldo Tobler, in a 1970 paper modeling Detroit's urban growth, distilled the regularity into one sentence that the entire field would later call Tobler's First Law of Geography:

"Everything is related to everything else, but near things are more related than distant things." — Tobler (1970)

This is not a theorem; it is an empirical observation that holds across porosity in a reservoir, ore grade in a deposit, soil contamination in a brownfield, rainfall in a watershed, temperature on a grid, elevation on a topographic surface, prices in a city, and species abundance in an ecosystem. It is the regularity that makes geostatistics possible: if values at nearby locations were unrelated, no amount of clever interpolation would help you predict at an unsampled point. You would be stuck with the global mean forever.

Because Tobler's law is so universal, it suggests an obvious estimator. To predict the value at an unsampled location $\mathbf{s}^*$ , take a weighted average of the surrounding samples, where the weights decay with distance. The simplest version uses an exponential kernel:

\hat{Z}(\mathbf{s}^*) \;=\; \frac{\sum_{i=1}^{n} w_i \, z(\mathbf{s}_i)}{\sum_{i=1}^{n} w_i}, \qquad w_i = \exp\!\left(-\frac{\|\mathbf{s}_i - \mathbf{s}^*\|}{r}\right).

Here $r$ — the range — sets the distance scale over which the law operates. Short $r$ means only the very closest neighbours matter; long $r$ means even distant samples contribute. Choosing $r$ well is half the geostatistical game. The widget below makes the choice visible.

Drag the range slider and watch the weighted estimate move. At a very short range, the predictor latches onto one or two near samples — sensitive to noise, and the "effective number" of contributing samples drops near 1. At a very long range, the weights flatten out and the estimate collapses to the global mean — Tobler's law is effectively switched off. Between those extremes is a sweet spot that respects the data's actual correlation length. The whole machinery of variograms in Part 3 and kriging in Part 5 is the principled way to find that sweet spot from the data itself, rather than by guessing with a slider.

Why "spatial" demands different statistics

Classical confidence intervals assume your samples are independent. The standard error of the mean is $\mathrm{SE} = \sigma / \sqrt{n}$ — divide by the square root of $n$ because each independent observation contributes one full "unit of information". For spatially autocorrelated data that division is wrong, often by a lot.

If two samples are taken half a metre apart on a porosity log that has a 100-metre correlation range, those two samples carry essentially the same information — they are not two independent looks at the field. The effective sample size is closer to 1 than to 2. Across a 1000-sample dataset with strong spatial autocorrelation, the effective sample size can easily drop below 100. The 95% confidence intervals computed under the IID assumption are anywhere from twice to ten times too narrow. The downstream consequence is overconfidence: the analyst reports a tight range for the mean property of the reservoir, the next field season punches a well, and the well is well outside the reported interval. The mathematics did not lie — the assumption of independence did.

That overconfidence is one of the three things spatial methods exist to fix. The others are:

Location-specific estimates. Classical statistics produces one number — a global mean, a global variance. Geostatistics produces a map: a separate estimate $\hat{Z}(\mathbf{s})$ and a separate uncertainty at every location in the study area. Kriging, in Part 5, formalises this estimator and chooses its weights to minimise the prediction variance subject to an unbiasedness constraint.
Multiple equiprobable realisations. A kriged map is smooth — it averages out the small-scale variability that the data tell us must be there. For risk decisions (will this reservoir flow above the economic cutoff? does this contaminant plume reach the well?), one smooth map is the wrong question. You want a stack of realisations, each consistent with the data and the spatial statistics, and you want to compute the decision metric on every realisation in the stack. That is sequential Gaussian simulation, in Part 7.

Each of these capabilities rests on the same foundation: a quantitative description of how the value at one location depends on the value at another, as a function of the separation between them. That description is the variogram, and it is the object that Parts 3 and 4 of this textbook spend a great deal of time building.

An honest scope warning

If you came in expecting a quick fit-a-model-on-coordinates exercise, geostatistics will surprise you. The covariance-as-function-of-distance machinery — the variogram — is the actual workhorse, and the first reliable kriging or simulation result is four or five parts away. Specifically:

Parts 0–2 set up the data and clean it (this section, plus stationarity, support, declustering, EDA).
Parts 3–4 build the variogram — the empirical estimator and the permissible model families.
Part 5 uses the variogram inside the kriging system to produce estimates.
Parts 6–7 validate the estimates and convert kriging into simulation.

None of those steps can be skipped. Skipping ahead to "I will just krige my data" without understanding the variogram is the modal way kriging projects fail in practice — Part 5's section on kriging pathologies catalogues five or six varieties of how that goes wrong. The reward for the slow build is that the kriging and simulation results you produce at the end are calibrated: their uncertainty bands are correct, their confidence intervals cover at the nominal rate, and downstream decisions can rely on them.

Try it

In the spatial-vs-aspatial widget, set the local-average window radius small (say 0.10) and drop a query point near the top-right of the spatial layout. Compare the local average to the global mean. Now do the same in the aspatial layout — what changes? Now widen the radius to 0.30. What changes in each layout, and why?
In the Tobler-law widget, click somewhere near the edge of the unit square. Slide the range from 0.05 to 0.80 and watch the weighted estimate move. At what range does the estimate stop changing? What does that distance tell you about the data's natural correlation length?
Still in the Tobler-law widget, find a query point where the weighted estimate at small $r$ and the global mean at large $r$ are nearly the same. What does that say about the local neighbourhood of that point? (Hint: think about what the histogram looks like locally vs. globally there.)
Without coding anything: a reservoir engineer has 1000 porosity measurements taken every 1 m along a borehole, in a formation whose porosity has a vertical correlation length of about 5 m. Roughly, what is the effective sample size? If the analyst computes a 95% confidence interval for the mean porosity using the naive $\sigma / \sqrt{n}$ formula, how badly will they understate the width?

Pause and reflect: Tobler's law is an empirical regularity, not a theorem. What kinds of phenomena would you expect to violate it — situations where nearby values are not more similar than distant values? Are those phenomena common in your domain, or rare? And how would you know, just from looking at a dataset, whether the law actually applies?

What you now know

You have the working definition of a spatial dataset: an observation is a pair $(\mathbf{s}, z)$ , and the value is a sample from a random function $Z(\mathbf{s})$ . You have Tobler's First Law, the empirical regularity that nearby values are more related than distant ones — and the simplest predictor it suggests, an exponential-kernel weighted average with a range parameter you control. You have a felt understanding of why classical IID-based confidence intervals are wrong for spatially autocorrelated data, and a preview of the three things geostatistics gives you that classical methods cannot: location-specific estimates, calibrated uncertainty, and multiple realisations. Section §0.2 picks up from here and asks the next obvious question: under what assumptions about $Z$ can we actually estimate the spatial covariance from a single realisation of the field? That is the question of stationarity and ergodicity, and it is the bridge between the abstract random function and the empirical variogram you will compute in Part 3.

References

Tobler, W.R. (1970). A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(2), 234–240. (The original statement of the First Law of Geography.)
Matheron, G. (1963). Principles of geostatistics. Economic Geology, 58(8), 1246–1266. (The foundational paper that defined the discipline and introduced the regionalised-variable formalism $Z(\mathbf{s})$ .)
Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (Chapter 1, "Why are spatial statistics different?" — the canonical introductory treatment for the questions in this section.)
Cressie, N. (1993). Statistics for Spatial Data (revised ed.). Wiley. (Chapter 1; the formal mathematical-statistics reference for the $Z(\mathbf{s})$ framework.)
Chilès, J.-P., Delfiner, P. (2012). Geostatistics: Modeling Spatial Uncertainty (2nd ed.). Wiley. (Modern comprehensive reference; the introduction sets up exactly the road map previewed in this section.)