h-scatterplots and lag binning

Part 3 — Experimental variograms

Learning objectives

Read the h-scatterplot as the pair-level picture of γ(h): for ordered pairs at separation h, the cloud of $(Z(\mathbf{s}_i), Z(\mathbf{s}_j))$ hugs the 45° line when γ(h) is small (high correlation) and disperses to the full bivariate scatter when γ(h) is large (low correlation)
Write the Matheron 1962 empirical-variogram estimator $\hat{\gamma}(h) = (1/(2N(h))) \sum_{(i,j) \in N(h)} (Z(\mathbf{s}_i) - Z(\mathbf{s}_j))^2$ , understand each piece (squared increments, the pair set N(h), the 1/2 factor), and connect it to the §3.1 identity $\gamma(h) = (1/2)\,\mathrm{Var}(Z(\mathbf{s}+\mathbf{h}) - Z(\mathbf{s}))$
Bin irregularly-spaced data by lag distance: equal-width bins $[k\Delta, (k+1)\Delta]$ with a chosen bin width $\Delta$ , and (looking ahead to §3.3) a directional tolerance $\pm\theta$ for anisotropic variograms
Choose $\Delta$ : too narrow ⇒ each bin's $N(h)$ is small ⇒ $\hat{\gamma}$ is noisy; too wide ⇒ structural features (the knee at the range, the nugget origin) are smoothed away. Default rule of thumb: $\Delta \approx$ minimum sample spacing
Apply the two practical rules of thumb that protect against the modal failure modes of experimental variography: (i) trust $\hat{\gamma}(h)$ only for bins with $N(h) \ge 30$ , since the estimator's variance scales like $1/N(h)$ ; (ii) cap $h_{\max}$ at roughly half the domain size so edge effects do not dominate the long-lag tail
Carry the Part 2 declustering weights through to the variogram with the weighted estimator $\hat{\gamma}_w(h) = (\sum w_i w_j (Z_j - Z_i)^2) / (2 \sum w_i w_j)$ for pairs at lag $h$ . Without this step, short-lag bins over-represent dense regions and γ̂ is biased exactly where the variogram model needs the data most
Diagnose common artefacts: bumpy long-lag tail = small $N(h)$ (mask, do not model), monotonic decline at long lag = drift / non-stationarity (consider detrending, §0.2), inflated nugget = measurement error OR short-range structure unresolvable at the sampling spacing (§4.2 dissects)
Locate §3.2 honestly in Parts 3-4: this section is 'how to compute $\hat{\gamma}(h)$ '. §3.3-§3.6 extend the estimator (directional, anisotropic, indicator, robust). §4 fits a permissible MODEL to $\hat{\gamma}(h)$ that Part 5 kriging can actually use

§3.1 defined the variogram as a theoretical object: a continuous function $\gamma(h)$ on a stationary population, capturing the half-variance of the increment $Z(\mathbf{s}+\mathbf{h}) - Z(\mathbf{s})$ as a function of separation $h$ . The widgets in that section let you slide the parameters of an idealised model and watch the curve respond — but they took the variogram as given. §3.2 picks up the next question: how do you actually compute $\gamma(h)$ from data?

The answer in one line is the Matheron 1962 estimator,

\hat{\gamma}(h) \;=\; \frac{1}{2\,N(h)} \sum_{(i,j) \in N(h)} \!\left(Z(\mathbf{s}_i) - Z(\mathbf{s}_j)\right)^2,

where $N(h)$ is the set of all unordered sample pairs whose separation falls in a lag-bin around $h$ , and $|N(h)|$ — also written just $N(h)$ by mild abuse of notation — is its size. It is the population variance-of-difference under second-order stationarity, estimated by an average of squared differences over the data pairs that fall in each bin. It is the Part 3 workhorse: §3.3 makes it directional, §3.4 makes it anisotropic, §3.5 makes it categorical via indicator transforms, §3.6 makes it robust to outliers. But all four of those extensions sit on top of the same basic binning machinery this section develops.

The basic machinery has three moving parts and a small handful of artefacts. The reader who finishes §3.2 should be able to compute $\hat{\gamma}(h)$ from any reasonable dataset, choose a defensible bin width, know which bins to trust and which to mask, and recognise the common ways an empirical variogram lies. The two widgets below stage the journey: the first connects $\gamma(h)$ to the pair-level picture (the h-scatterplot), the second walks the full binning pipeline end-to-end with a known generating process so you can SEE how well $\hat{\gamma}$ recovers it. By the end of §3.2 you have an experimental variogram. Part 4 will fit a permissible model to it.

The h-scatterplot — γ(h) made visible at the pair level

Before the average over pairs, look at the pairs themselves. The h-scatterplot at lag $h$ is the 2-D scatter of points $(Z(\mathbf{s}_i), Z(\mathbf{s}_j))$ for every ordered pair of samples whose separation $|\mathbf{s}_j - \mathbf{s}_i|$ is approximately $h$ . It is the bivariate joint distribution of the field at two locations a distance $h$ apart, sampled empirically from the data.

Its shape tells you $\gamma(h)$ visually. Drop in two facts from §3.1:

If $\gamma(h)$ is SMALL — pairs at lag $h$ have nearly the same value — the cloud hugs the 45° identity line $Z(\mathbf{s}_j) = Z(\mathbf{s}_i)$ .
If $\gamma(h)$ is LARGE — pairs at lag $h$ are nearly uncorrelated — the cloud spreads to the full bivariate scatter, looking like a 2-D blob of size set by $\sigma^2 = C(0)$ .
In between, the cloud is a tilted ellipse — narrow along the 45° direction (small variance of the increment) and wider perpendicular to it.

The first widget for §3.2 shows the connection directly. Drag the lag slider and the cloud tightens or disperses; read off $\hat{\gamma}(h)$ as the cloud's spread perpendicular to the 45° line, since for any pair $(Z_i, Z_j)$ the perpendicular distance to the 45° line is $|Z_j - Z_i|/\sqrt{2}$ , and the variogram is the half-variance of $Z_j - Z_i$ .

The left panel shows the 80 samples on a unit square, drawn from a known exponential-variogram field (range $a = 0.30$ , sill 1.0, no nugget). Up to three example pairs at the current lag are highlighted as line segments so you can see which pairs the right panel is computing on. The right panel is the h-scatterplot itself: each point is an ordered pair at separation $|\mathbf{s}_j - \mathbf{s}_i|$ in the band $[h - \Delta/2,, h + \Delta/2]$ . The 45° line is overlaid. The numeric readout reports $N(h)$ (the unordered pair count in the bin), $\hat{\gamma}(h)$ (computed live), and the true $\gamma(h)$ for the generating process so you can compare.

Three things to do with this widget. Set the lag short (say $h = 0.05$ ) and watch the cloud collapse onto the 45° line — pairs that close are almost the same value, so $\hat{\gamma}$ is small. Slide the lag up past the range ( $h \ge 0.30$ ) and watch the cloud spread to roughly circular dispersion — the variogram has saturated at the sill. Finally, drop the tolerance $\Delta/2$ down to its minimum (0.01); the cloud becomes sparse and noisy because $N(h)$ collapses to a handful of pairs — this is the pair-level version of the noise problem that motivates lag binning in the first place. Wider $\Delta$ packs more pairs into each bin and tightens the estimate at the cost of lag resolution; the trade-off is the central practical decision in experimental variography, and the second widget below stages it explicitly.

The empirical variogram estimator: from pairs to γ̂(h)

Average the h-scatterplot cloud across one lag bin and you have a single value of $\hat{\gamma}$ . Do that across many bins and you have the empirical variogram. The Matheron 1962 estimator is

\hat{\gamma}(h_k) \;=\; \frac{1}{2\,N(h_k)} \!\!\sum_{(i,j) \in N(h_k)} \!\!\left(Z(\mathbf{s}_i) - Z(\mathbf{s}_j)\right)^2,

where $N(h_k)$ is the set of unordered pairs $(i, j)$ with separation $|\mathbf{s}_j - \mathbf{s}_i| \in [k\Delta,, (k+1)\Delta]$ for bin index $k = 0, 1, 2, \ldots$ , and the bin centre is $h_k = (k + \tfrac{1}{2})\Delta$ . The estimator is computed at one or a few dozen bin centres; the points are then plotted against $h_k$ to produce the familiar empirical-variogram curve. Each point is a sample mean of squared increments — it is unbiased for $\gamma(h_k)$ under second-order stationarity (Matheron 1962; Cressie 1993 §2.4), and its variance scales like $1 / N(h_k)$ (the standard variance-of-a-sample-mean result, applied to the squared-increment population).

The factor of 2 in the denominator — easy to forget — comes from the $\gamma(h) = \tfrac{1}{2}\mathrm{Var}(\cdot)$ definition itself. The naïve plug-in estimator $(1/N(h))\sum (Z_i - Z_j)^2$ estimates the full increment variance $2\gamma(h)$ , so dividing by $2 N(h)$ gives $\hat{\gamma}$ . If your computed empirical variogram is twice as high as the §3.1 model curves are telling you to expect, this is the first place to look.

Lag binning in practice — equal-width bins, with directional tolerance to come

Real spatial data is irregularly spaced. With 100 samples scattered across a study area, no two pairs have exactly the same separation. So you cannot compute $\hat{\gamma}$ at a single value of $h$ — you bin pairs into lag intervals and treat the bin centre as a single representative $h_k$ .

The standard scheme is equal-width binning: choose a bin width $\Delta$ and define bins $[0, \Delta],, [\Delta, 2\Delta],, \ldots,, [k\Delta, (k+1)\Delta], \ldots$ . Any pair $(i, j)$ with separation in that interval contributes to bin $k$ . The number of bins is set by the chosen maximum lag $h_{\max}$ as $n_{\text{bins}} = h_{\max} / \Delta$ . Both $\Delta$ and $h_{\max}$ are reader choices, and both have rules of thumb attached.

Looking ahead to §3.3: when the field is anisotropic (e.g. fluvial channels with along-channel continuity longer than across-channel), the binning gains a SECOND dimension. Pairs are kept only if their separation vector $\mathbf{h}_{ij} = \mathbf{s}_j - \mathbf{s}$ falls within $\pm\theta$ of a target azimuth, and they go into a one-dimensional binning by $|\mathbf{h}$ {ij}| $∣ h_{ij} ∣$ for that direction. The result is one empirical variogram per direction. §3.2 stays isotropic — every pair counts regardless of direction — but the same binning machinery applies once you turn the angular filter on.

End-to-end: from samples to γ̂(h) with error bars

The second widget for §3.2 runs the full pipeline on 150 samples drawn from a known Spherical variogram (range $a = 0.30$ , structural sill $c = 0.95$ , nugget $c_0 = 0.05$ , total sill 1.00). The top panel shows $\hat{\gamma}(h_k)$ as points with error bars of size $\pm\hat{\gamma}(h_k) \sqrt{2 / N(h_k)}$ — the large-sample standard deviation of the Matheron estimator (Cressie 1985, 1993 §2.4). The TRUE $\gamma(h)$ is overlaid in gold so you can see how well $\hat{\gamma}$ recovers it. The bottom panel is the $N(h)$ histogram — one bar per lag bin — with the chosen cut-off threshold drawn as a horizontal red line.

Slide the bin width $\Delta$ from its minimum to its maximum and watch the tension between two opposite failure modes play out. At narrow $\Delta = 0.02$ you get many bins, each with a small $N(h_k)$ — points scatter wildly around the gold curve, error bars are large, and the long-lag tail is essentially noise. At wide $\Delta = 0.10$ you get few bins, each with many pairs — points hug the gold curve closely, error bars shrink, but the SHARP knee at the range smooths into a slow rounded shoulder and small structural features are washed out. The sweet spot is somewhere in the middle, and it depends on your data's sample spacing: a common rule of thumb (Goovaerts 1997 §4.3; Isaaks & Srivastava 1989 ch. 7) is to choose $\Delta$ on the order of the MINIMUM sample spacing, so each bin has the resolution of one sampling interval and roughly twenty to fifty pairs per bin in the structural range.

N(h) is the noise control — the rule-of-30 and the half-domain rule

Two very practical rules govern when you should TRUST a bin and when you should ignore it.

The rule of 30. Each $\hat{\gamma}(h_k)$ is a sample mean of $N(h_k)$ squared increments. By the same argument that gives the $1/\sqrt{n}$ scaling of the classical standard error, the noise in $\hat{\gamma}$ scales like $1 / \sqrt{N(h_k)}$ ; the coefficient of variation is roughly $\sqrt{2 / N(h_k)}$ for a Gaussian field (Cressie 1985, 1993 §2.4). Below $N(h_k) \approx 30$ this CV is above 25% — the binned estimate at that lag is at least a quarter off in either direction, often more, and the visual artefacts (jagged dips, spikes, monotonicity violations) become indistinguishable from real structural features. The widely-used (Goovaerts 1997, Isaaks & Srivastava 1989) convention is to plot but DO NOT MODEL bins with $N(h_k) < 30$ . The widget dims those bins to make the cut-off visible.

The half-domain rule. A second source of long-lag noise is edge effects: at lags approaching the size of the study area, only a few pairs CAN exist (the ones connecting opposite corners of the domain), and those pairs disproportionately sample whatever non-stationarity is present at the largest scales. The conventional cap (Journel & Huijbregts 1978 ch. 3; Deutsch & Journel 1998 §II.1) is $h_{\max} \le L/2$ where $L$ is the domain size — half the study-area extent. Beyond that, the empirical variogram is no longer estimating the spatial structure of a stationary process; it is showing you the consequences of finite sampling on a bounded domain. The widget marks the half-domain line at $h = 0.5$ on a unit square so you can see where the trustworthy region ends. Slide $h_{\max}$ past it and notice that the long-lag bins go red — under the cut-off — almost everywhere.

Declustering revisited — weights carry through to γ̂

Part 2 introduced cell and polygonal declustering as a fix for the preferential-sampling bias in the histogram and the mean: clustered samples in high-grade zones inflate the global statistics, declustering weights pull them down to a defensible estimate. The same bias contaminates the empirical variogram. Short-lag bins are dominated by pairs from densely-sampled regions, so they over-represent whatever structure is present in those regions. If the dense clusters are in high-grade hotspots, the short-lag $\hat{\gamma}$ is artificially LOW (high-grade is locally smooth); if the dense clusters are at facies boundaries, the short-lag $\hat{\gamma}$ is artificially HIGH (facies contacts are locally rough). Either way, the variogram model fitted to a naïve $\hat{\gamma}$ will be biased exactly where the kriging system depends on it most — the short-lag region.

The fix is the same as for the histogram: weight the pairs. The declustering-weighted variogram estimator is

\hat{\gamma}_w(h) \;=\; \frac{\sum_{(i,j) \in N(h)} w_i w_j (Z_j - Z_i)^2}{2 \sum_{(i,j) \in N(h)} w_i w_j}

where $w_i$ is the Part 2 declustering weight of sample $i$ . A pair (cluster, sparse) gets weight $w_{\text{cl}} \cdot w_{\text{sp}}$ , which is lower than the all-cluster pair weight $w_{\text{cl}}^2$ ; the dense-region pairs are systematically downweighted in the variogram exactly as they were in the histogram. Both Goovaerts 1997 §4.4 and Deutsch & Journel 1998's GSLIB program gamv implement this weighting natively. The naïve estimator drops out as the special case $w_i = 1$ for all $i$ .

When does the weighting matter? For uniformly-spaced samples (e.g., a regular grid), declustering weights are all equal and the weighted estimator collapses back to Matheron. For preferentially-sampled data — which is most reservoir-characterisation, mining-exploration, and contamination-survey datasets — the unweighted and weighted estimators can differ by 30–50% at short lags. Part 4's fitting routines and Part 5's kriging system inherit that bias if it is not corrected here. The declustering pipeline started in Part 2; this is where it actually pays off.

Common artefacts: what an empirical variogram tells you when it lies

An experimental variogram in the wild rarely looks like the smooth clean curves of §3.1. Three artefacts come up often enough that diagnosing them is part of the working routine.

Bumpy long-lag tail. Beyond about half the domain, $N(h)$ drops and the variance of $\hat{\gamma}$ explodes. The bumps and dips you see are noise, not structure. The fix is the cut-off and the half-domain rule, both above. DO NOT fit your variogram model to that tail — Part 4.5 develops the weighted-least-squares fitting routine that down-weights low- $N$ bins automatically (Cressie 1985 is the foundational reference).
Monotonic decline at long lag. If $\hat{\gamma}$ rises to a peak and then falls back down at large $h$ , you are looking at a stationarity violation, not a stationary process with a quirky variogram. The most common cause is a regional trend (a smoothly varying mean across the domain — porosity decreasing east-to-west, ore grade higher in one corner). The textbook fix is to FIT the trend, subtract it, and compute $\hat{\gamma}$ on the residuals. Universal kriging in Part 5.3 and KED handle the same problem at prediction time, but pre-residualising for variogram estimation gives you a cleaner picture. §0.2 introduced the stationarity ladder; this is where it bites.
Inflated nugget. A $\hat{\gamma}$ whose lowest-lag bin sits well above zero is showing you a nugget effect — but the SOURCE of the nugget is not directly diagnosable from the variogram alone. It could be measurement error in the assay; it could be real structural variability at scales shorter than your closest sample spacing; or it could be a bin-width effect (a very narrow short-lag bin can have so few pairs that the noise alone inflates the estimate). The fix is to compare independent re-measurements of the same sample if you have any — that gives you the measurement-error component directly — and to compare $\hat{\gamma}$ at multiple bin widths to rule out the bin-width artefact. §4.2 takes up nugget dissection in detail.

Where §3.2 ends and §3.3-§3.6 + §4 begin

Be precise about scope. §3.2 gave you the basic isotropic Matheron estimator $\hat{\gamma}(h_k)$ with equal-width lag bins, the noise-management rules (cut-off, half-domain), and the declustering correction. The next four sections of Part 3 extend the same estimator in four orthogonal directions, each addressing a different way that real data departs from the basic isotropic Gaussian-field assumption:

§3.3 — directional variograms. Drop the isotropy assumption. Compute $\hat{\gamma}(h)$ separately along each compass direction with an angular tolerance $\pm\theta$ . The output is several variograms — one per direction — which often differ in range and sill.
§3.4 — anisotropy modelling. Bundle the directional variograms into a 2-D (or 3-D) anisotropic structure: range as an ellipse with major / minor axes, "geometric" anisotropy when only the range varies, "zonal" anisotropy when the sill also varies with direction.
§3.5 — indicator variograms. Threshold the continuous variable $Z$ at a cut-off $z_c$ to produce a 0/1 indicator $I(\mathbf{s}; z_c) = \mathbf{1}{Z(\mathbf{s}) \le z_c}$ . Compute the empirical variogram of $I$ . This is the bridge to indicator kriging (Part 8) for categorical or quantile-thresholded data.
§3.6 — robust variogram estimators. Replace the $(Z_i - Z_j)^2$ in the sum by something less sensitive to outliers: the Cressie–Hawkins 1980 estimator is the classic. The §1.4 robust-statistics ideas (M-estimators, breakdown points) port directly into the variogram domain.

After Part 3 you have a defensible empirical variogram. Part 4 fits a PERMISSIBLE model to it — a continuous function from a family (Spherical, Exponential, Gaussian, Matérn, nested) that is conditionally negative definite, so the kriging matrices in Part 5 are guaranteed positive definite. Part 4.5 covers the fitting routines (weighted least squares, manual fitting, automated cross-validation). The Part 5 kriging system finally consumes the fitted model. Everything in Parts 3 and 4 exists so that Part 5's linear system has the right entries.

Try it

In the h-scatter-explorer, set the lag $h = 0.05$ (very short) and the tolerance $\Delta/2 = 0.025$ . Look at the cloud: does it hug the 45° line, or is it spread? Note the value of $\hat{\gamma}(h)$ in the readout. Now slide $h$ to 0.50 (long, beyond the field's range of 0.30). What happens to the cloud? Why does $\hat{\gamma}(h)$ change so much?
Still in the h-scatter-explorer, hold $h = 0.20$ (mid-range) and slide $\Delta/2$ from its minimum (0.005) to its maximum (0.10). At small $\Delta/2$ , $N(h)$ collapses to a handful of pairs and the cloud is sparse and noisy. At large $\Delta/2$ , $N(h)$ is large and the cloud is well-populated, but it now mixes pairs at very different separations (0.10 to 0.30). What is the trade-off you are making?
In the empirical-variogram-builder, set $\Delta = 0.020$ (narrow) and look at the top panel. Many bins; many error bars; long-lag tail jagged and noisy. Now set $\Delta = 0.080$ (wide). Few bins; small error bars; the knee at the range smooths into a slow shoulder. Where between these two would you set $\Delta$ to capture the spherical knee without drowning in noise? (Hint: try $\Delta = 0.04$ - $0.05$ .)
In the empirical-variogram-builder, set $h_{\max} = 0.80$ and look at the $N(h)$ histogram. How many bins are above the cut-off line? How many are below (red)? Now slide $h_{\max}$ down to 0.50 — the half-domain rule. Which long-lag bins disappeared, and were they trustworthy? (Hint: compare to where the half-domain dashed-red line was.)
Click "New realisation" four or five times in the empirical-variogram-builder, holding all the sliders fixed. The same GENERATING variogram produces empirical variograms that wiggle around the gold curve from run to run. Estimate by eye how much each bin wiggles between realisations. Bins with $N(h) \ge 30$ wiggle less than the small-N ones — this is the $1/\sqrt{N(h)}$ noise scaling, made visible.
Without coding: a colleague brings you 200 porosity samples on a roughly-square 1 km × 1 km domain, with the closest pair 20 m apart and an expected range around 150 m. What bin width $\Delta$ would you suggest? What $h_{\max}$ ? Why? (Hints: $\Delta \approx$ minimum spacing = 20 m gives 7 or 8 bins inside the range — enough to resolve the knee; $h_{\max} \approx 500$ m = half the diagonal stays safely inside the half-domain rule.)
Without coding: the same colleague reports that they ran $\hat{\gamma}$ both unweighted and with cell-declustering weights, and the short-lag value of $\hat{\gamma}$ dropped from 0.018 to 0.012 once weights were applied. What does this suggest about the sampling pattern? Should they use the weighted variogram for downstream kriging? (Hint: a 33% drop at short lag means dense clusters were in locally-smooth regions — high-grade hotspots most likely. Yes, use the weighted version; the unweighted one is biased low for nearby pairs.)
Without coding: an experimental variogram peaks at $h = 250$ m with $\hat{\gamma} = 0.08$ and then DECLINES back to 0.04 at $h = 500$ m. The bins at 250 m and 500 m both have $N(h) > 100$ . Is this consistent with a stationary process, or does it suggest a violation? What is the most likely cause? (Hint: declining $\hat{\gamma}$ at long lag indicates a smoothly-varying mean across the domain — a trend. Stationary processes can only have $\hat{\gamma}$ flat or rising at long lag.)

Pause and reflect: the empirical variogram is a sample statistic, and you have just inherited all the small-sample diagnostic discipline that classical statistics requires for any sample statistic — confidence intervals, sample-size requirements, outlier sensitivity. Where in your domain (reservoir, mine, watershed, brownfield, weather grid) would a 200-sample dataset be considered "lots of data" for variogram estimation, and where would it be considered tight? What single rule of thumb (the rule-of-30, the half-domain, declustering) would you defend most strongly to a colleague who wants to skip it for speed?

What you now know — and what §3.3 onward will build on it

You have the Matheron 1962 empirical-variogram estimator $\hat{\gamma}(h_k) = (1/(2N(h_k)))\sum (Z_i - Z_j)^2$ , with the bin centre $h_k$ and the pair set $N(h_k)$ . You have the pair-level intuition for it: the h-scatterplot, where short-lag pairs hug the 45° line and long-lag pairs disperse to the full bivariate scatter. You have the lag-binning machinery: equal-width bins $[k\Delta, (k+1)\Delta]$ controlled by the choice of $\Delta$ (resolution vs noise), and a maximum lag $h_{\max}$ controlled by the half-domain rule. You have the two practical rules of thumb that govern when to trust a bin: $N(h_k) \ge 30$ for the noise floor, and $h_{\max} \le L/2$ for the edge-effects ceiling. You have the declustering-weighted variant $\hat{\gamma}_w(h) = (\sum w_i w_j (Z_j - Z_i)^2) / (2 \sum w_i w_j)$ that carries the Part 2 declustering weights into Parts 3 and 5. And you have a working diagnostic vocabulary for the three artefacts that an empirical variogram presents in real data: noisy long-lag tails (small $N$ ), monotonic declines (drift / non-stationarity), and inflated nuggets (measurement error vs unresolved short scale).

§3.3 lifts the isotropy assumption. Real fields have direction: along the geological strike, along a fluvial channel axis, along an east-west drainage gradient. Variograms computed in different compass directions look different — different ranges, sometimes different sills. The §3.3 widget will stage the directional binning with a tolerance $\pm\theta$ on the separation azimuth, and you will see the same data produce dramatically different empirical variograms once you slice by direction. §3.4 then bundles those directional curves into an anisotropy ellipsoid that Part 4 can fit a model to. §3.5 swaps the continuous $Z$ for an indicator and shows how categorical/quantile-thresholded data fit the same machinery. §3.6 replaces the squared-difference statistic with a robust counterpart that survives a few wild outliers. By the end of Part 3 you have not just an isotropic $\hat{\gamma}$ but a full experimental description of the spatial structure — direction, anisotropy, categorical thresholds, robust under contamination. Part 4 then turns that experimental description into a permissible model that Part 5 kriging can consume.

The two widgets above are the bedrock. Every later widget in this textbook that says "variogram" — and there are many — runs the §3.2 binning machinery under the hood. Understanding what those bins represent, what makes them noisy, and how to choose a defensible $\Delta$ is the single most important practical skill in geostatistical estimation.

References

Matheron, G. (1962). Traité de géostatistique appliquée, Tome I. Editions Technip, Paris. (The foundational reference for the empirical variogram estimator that bears Matheron's name. Defines the binning machinery and the $\tfrac{1}{2N(h)}\sum(Z_i - Z_j)^2$ form.)
Cressie, N. (1985). Fitting variogram models by weighted least squares. Mathematical Geology, 17(5), 563–586. (The standard reference for the standard-error scaling $\hat{\gamma}(h) \sqrt{2/N(h)}$ that this section's widget uses for the error bars. Also the foundational paper for the WLS fitting routine used in Part 4.5.)
Cressie, N. (1993). Statistics for Spatial Data (revised ed.). Wiley. (§2.4, "Estimation of the Variogram" — the definitive mathematical-statistics treatment of the empirical estimator: bias, variance, asymptotic distribution, robust alternatives.)
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford University Press. (§4.3 on experimental variograms, including the bin-width and minimum-pair rules of thumb. §4.4 on declustering-weighted variograms.)
Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (Chapter 7, "Spatial continuity" — the most readable practitioner-level treatment of the h-scatterplot, lag binning, and the standard cut-off / half-domain rules.)
Deutsch, C.V., Journel, A.G. (1998). GSLIB: Geostatistical Software Library and User's Guide (2nd ed.). Oxford University Press. (The gam and gamv programs that implement the regular-grid and irregular-grid empirical estimators with all the directional and declustering options developed in this textbook.)
Journel, A.G., Huijbregts, C.J. (1978). Mining Geostatistics. Academic Press. (Chapter 3, "Experimental Variogram". The classic mining-geostatistics statement of the binning rules and the half-domain convention; the source of much of the practical wisdom in §3.2.)
Chilès, J.-P., Delfiner, P. (2012). Geostatistics: Modeling Spatial Uncertainty (2nd ed.). Wiley. (Chapter 2 on the variogram; §2.7 on robust estimators that §3.6 will pick up. The modern comprehensive reference.)