Fitting strategies: by eye, by WLS, by likelihood

Part 4, Variogram modeling

Learning objectives

State the THREE established fitting strategies for a permissible variogram model $\gamma(h; \boldsymbol\theta)$ given an empirical $\hat\gamma(h_k)$ : (1) BY-EYE, manual overlay of a permissible-family curve on the binned points, adjusting parameters until the fit looks convincing; (2) WEIGHTED LEAST SQUARES (Cressie 1985), minimise $\sum_k w_k\,(\hat\gamma(h_k) - \gamma(h_k;\boldsymbol\theta))^2$ with bin-count weights; (3) MAXIMUM LIKELIHOOD / REML (Mardia & Marshall 1984), maximise the data's Gaussian likelihood directly. By-eye is the dominant production approach; WLS is the reproducible field standard; REML is the rigorous statistical procedure
Apply BY-EYE FITTING as a structured workflow: (a) inspect $\hat\gamma$ for nugget, structures, anisotropy; (b) sketch the canonical family that matches the shape; (c) slide $c_0, c, a$ until the curve traces the binned points within their visible uncertainty. Strengths: incorporates domain knowledge (a geologist's intuition for ranges and anisotropy); robust to noisy bins (the eye ignores spurious bumps); standard in mining and reservoir production workflows since Krige and Matheron. Weakness: subjective, different analysts produce different numbers
Apply WEIGHTED LEAST SQUARES with CRESSIE'S WEIGHTS $w_k \propto N(h_k)/\gamma(h_k;\boldsymbol\theta)^2$ : bins with MORE PAIRS contribute proportionally more (less sampling noise), and bins where the MODEL $\gamma$ is SMALL contribute more (additive sampling noise is a larger fraction of large γ values). The weights themselves depend on the model, iterate. Strengths: reproducible (every analyst gets the same numbers); computationally cheap (low-dimensional non-linear optimisation); accounts for noise inflation at large h. Weakness: doesn't capture domain priors, and is biased downward by under-fitting near the origin
Apply MAXIMUM LIKELIHOOD / REML under the GAUSSIAN-FIELD assumption: write the joint density of $Z(\mathbf{s}_1), \ldots, Z(\mathbf{s}_n)$ as multivariate normal with covariance $C(h; \boldsymbol\theta)$ , then maximise that likelihood over $\boldsymbol\theta$ . REML (Patterson & Thompson 1971; Mardia & Marshall 1984) projects out the mean to give a likelihood for the variance parameters alone. Strengths: rigorous statistical theory; provides standard errors and confidence intervals on the parameter estimates; handles nugget and anisotropy simultaneously. Weaknesses: requires the Gaussian-field assumption (often weak in practice); computationally expensive for large $n$ (involves an $n \times n$ matrix inverse); over-confidence when the Gaussian assumption breaks
Recognise the CANONICAL FITTING WORKFLOW that real practitioners follow: (1) inspect $\hat\gamma$ to diagnose nugget, structure, anisotropy; (2) by-eye initial fit to set parameter starting values; (3) WLS refinement using those starting values; (4) cross-validation check (leave-one-out kriging); (5) iterate if CV diagnostics flag miscalibration. The three methods are COMPLEMENTARY, by-eye sets priors, WLS produces a reproducible point estimate, REML quantifies uncertainty when the Gaussian assumption is defensible, and cross-validation tests the result. Documented in Goovaerts 1997 §4.2; Chilès & Delfiner 2012 §2.6
Use CROSS-VALIDATION as the EMPIRICAL FIT-QUALITY TEST. Leave-one-out: at each sample location $\mathbf{s}_i$ , hold the value $z_i$ out, krige $\hat z(\mathbf{s}_i)$ from the remaining $n-1$ samples using the candidate variogram, and compute the prediction error $r_i = z_i - \hat z(\mathbf{s}_i)$ and its kriging standard deviation $\sigma_K(\mathbf{s}_i)$ . The STANDARDISED RESIDUAL $z_i^\star = r_i / \sigma_K(\mathbf{s}_i)$ should be $N(0, 1)$ if the variogram is correctly fitted. Mean $\approx 0$ : unbiased prediction. SD $\approx 1$ : correctly calibrated uncertainty. SD $> 1$ means under-fit (kriging variance under-estimates error); SD $< 1$ means over-fit. Goovaerts 1997 §6.5; Chilès & Delfiner 2012 §3.6
Apply the COMMON PITFALLS catalogue: (a) FITTING NOISE rather than signal, too many components, $K \ge 4$ nested structures; (b) IGNORING ANISOTROPY, using an isotropic model when $\hat\gamma$ is clearly directional; (c) TRUSTING REML WITHOUT GAUSSIAN-ASSUMPTION CHECK, heavy-tailed or skewed fields invalidate the likelihood; (d) RELYING ON ANY ONE METHOD ALONE, convergent results from by-eye + WLS + REML are far more defensible than any single method; (e) SKIPPING CROSS-VALIDATION, without it, you have no empirical evidence the fit calibrates kriging variance correctly
Locate §4.5 HONESTLY in Part 4. §4.5 is the PROCEDURAL CLOSE, it gives you the recipes to pin parameters down, but the recipes don't tell you which permissible family to use (§4.1), whether to add a nugget (§4.2), how to handle anisotropy (§4.3), or how many nested components (§4.4). All of those choices feed into the fitting workflow, and a bad upstream choice cannot be rescued by a sophisticated downstream fit. Part 5 plugs the fitted model into kriging; Part 6 develops cross-validation and calibration into the full QC apparatus
Read the THREE METHODS AS COMPLEMENTARY, NOT COMPETING. They give SIMILAR but not IDENTICAL fits on the same data. The differences carry information: by-eye is biased toward what the analyst expects; WLS is biased toward minimising squared error in noisy bins; REML is biased toward whatever maximises a Gaussian likelihood. When the three agree, the fit is robust; when they disagree by more than the data uncertainty, something upstream needs revisiting. The DEFENSIBLE PRACTICE: report all three and document the disagreement

By §4.4 you have a parametric variogram model with $1 + 2K$ free parameters, nugget $c_0$ plus per-component sills $c_k$ and ranges $a_k$ , with optional per-component anisotropy adding up to six more parameters per component in 3-D. §4.4 fixed the FORM of the model (permissible families, sum-of-permissibles, nested structures, anisotropic wrap). It did NOT fix the NUMERICAL VALUES of the parameters. The shape of the empirical variogram $\hat\gamma(h_k)$ , those 12-25 binned points produced by Part 3, has to constrain the parameter vector $\boldsymbol\theta = (c_0, c_1, a_1, c_2, a_2, \ldots)$ . §4.5 is the chapter on HOW that constraining is done in practice.

The §4.5 question is: given a binned $\hat\gamma(h_k)$ and a chosen functional form, how do you find the $\boldsymbol\theta$ that best fits? Three answers have stood the test of fifty years of operational geostatistics. The FIRST is by EYE, overlay the candidate curve on the binned points and adjust the sliders until the fit looks convincing. The SECOND is by WEIGHTED LEAST SQUARES (WLS), minimise the weighted sum of squared residuals, with weights that account for unequal bin reliability. The THIRD is by MAXIMUM LIKELIHOOD (ML) or its restricted form REML, write down the Gaussian likelihood for the data and maximise it over the parameters. Each method has different strengths, different failure modes, and different acceptance in different sub-fields. The CANONICAL WORKFLOW combines them: by-eye to set priors, WLS to produce a reproducible point estimate, REML when the Gaussian assumption is defensible to quantify uncertainty, and CROSS-VALIDATION as the empirical test that the resulting fit calibrates kriging variance correctly.

This section develops all three methods, the canonical workflow that combines them, the cross-validation diagnostic that closes the loop, and the catalogue of common pitfalls. The first widget puts the three methods side by side on a single empirical variogram with a known truth, so the reader can compare the fits and see how each method weights the bins differently. The second widget runs leave-one-out cross-validation under three candidate variograms, an under-fit, the truth, and an over-fit, and shows how the standardised-residual distribution diagnoses each. By the end you can take any empirical variogram, fit a permissible model defensibly, and verify that the fit calibrates the kriging system it will feed into Part 5.

By eye, the production workhorse

The by-eye method is the oldest. An experienced geostatistician plots $\hat\gamma(h_k)$ on the same axis as a candidate permissible-family curve, slides the parameter sliders, and stops when the curve looks like it traces the points within their visible uncertainty. The output is a parameter vector $\boldsymbol\theta$ that the analyst can defend on inspection. The full algorithm:

Inspect $\hat\gamma$ for global features. Where does it intercept the $\gamma$ axis? (nugget $c_0$ .) Does it reach a plateau, and at what height? (sill $c_0 + c$ .) At what lag? (range $a$ .) Is there a change of slope before the plateau? (nested-structure shoulder, §4.4.) Are there directional differences? (anisotropy, §4.3.)
Sketch the canonical permissible family that matches the shape. Linear origin and finite range ⇒ Spherical (§4.1). Linear origin and asymptotic approach ⇒ Exponential. Parabolic origin ⇒ Gaussian (rare and with caveats). Multi-knee shape ⇒ nested structures (§4.4).
Set initial parameter values from the visual diagnosis. Read $c_0$ from the smallest-lag bins (with the §4.2 caveat that the smallest bin underreads true nugget by sampling noise). Read total sill from the plateau height. Read range from where the curve effectively reaches the plateau.
Slide each parameter in turn and watch the curve overlay. Stop when the curve traces the bins through every region, near-origin, transition zone, plateau, within the binned-pair uncertainty.
Document the fit. Report parameter values, the chosen family, the visible features of $\hat\gamma$ that drove the choice, and (for nested models) the rationale for the number of components.

By-eye fitting is the DOMINANT approach in mining and petroleum production geostatistics, Krige and Matheron used it on South African gold deposits in the 1960s, and Goovaerts 1997 §4.2, Chilès & Delfiner 2012 §2.6, Isaaks & Srivastava 1989 ch. 16, and Pyrcz & Deutsch 2014 §5 all describe it as the practitioner default. Its strengths are real. The eye can incorporate DOMAIN KNOWLEDGE that no algorithm captures: a structural geologist knows roughly what range to expect for a given lithology, a reservoir engineer knows the depositional facies scales, an environmental scientist knows the contamination plume size. By-eye fitting starts from those priors. The eye is also ROBUST TO NOISE: a single high bin at h = 0.8 the eye will ignore as a sampling outlier; a least-squares algorithm gives it full weight. And the by-eye output is REPRODUCIBLE IN SPIRIT even if not in numerical detail, two trained analysts will typically agree on family choice, rough range, and rough sill.

The weakness is SUBJECTIVITY. Two trained analysts produce different parameter numbers (often by 5-15% on each parameter) even when they agree on family and order of magnitude. For a regulated reserve estimate or a peer-reviewed scientific result, that level of analyst-to-analyst variation is a documented problem. The fix is to combine by-eye with WLS or REML to produce reproducible numbers, while letting the by-eye step set defensible starting values.

Weighted least squares, Cressie 1985 is the field standard

The fully automated alternative is WEIGHTED LEAST SQUARES (WLS): pick the parameter vector that minimises a weighted sum of squared residuals between the model curve and the binned points. The objective is

\boldsymbol{\hat\theta}_{\mathrm{WLS}} \;=\; \arg\min_{\boldsymbol\theta} \sum_{k=1}^{K_{\mathrm{bins}}} w_k\,\bigl(\hat\gamma(h_k) - \gamma(h_k; \boldsymbol\theta)\bigr)^2,

and the choice of weights $w_k$ is everything. Cressie 1985, the standard reference for variogram-fitting WLS, recommends

w_k \;\propto\; \dfrac{N(h_k)}{\gamma(h_k; \boldsymbol\theta)^2},

where $N(h_k)$ is the number of pairs in bin $k$ . The factor $N(h_k)$ in the numerator gives MORE weight to bins with more pairs (and therefore less sampling noise in $\hat\gamma(h_k)$ ). The factor $\gamma(h_k; \boldsymbol\theta)^2$ in the denominator gives MORE weight to bins where the MODEL γ is SMALL, at small lags, the bin γ values are themselves small, and the absolute residual $\hat\gamma - \gamma$ scales with γ, so dividing by $\gamma^2$ rescales to a roughly constant relative residual across the lag range. The combined Cressie weights account for both pair-count heterogeneity and the variance-mean relationship of γ̂ under a Gaussian field assumption (Cressie 1993 §2.4).

The weights depend on the model (through $\gamma(h_k; \boldsymbol\theta)^2$ ), so the optimisation iterates: start with constant weights $w_k = 1$ , fit, recompute weights from the current model, re-fit, until convergence. Two or three iterations are typically enough. The numerical optimisation is a low-dimensional non-linear least-squares problem, Levenberg-Marquardt or coordinate-descent both work; the parameter space is small enough that the choice rarely matters.

WLS strengths: REPRODUCIBLE (every analyst running the same algorithm on the same data gets the same numbers); COMPUTATIONALLY CHEAP (seconds even for nested models); RIGOROUS in its weighting (the Cressie weights are not arbitrary, they emerge from a Gaussian-field analysis of the sampling variance of $\hat\gamma(h_k)$ ); EXTENSIBLE to anisotropic and nested models without algorithmic change. WLS is the dominant field-standard fitting routine: GSLIB (Deutsch & Journel 1998 §III.4), gstat (Pebesma 2004), GeoStatTrace, Leapfrog Geo, and Petrel all default to a WLS variant.

WLS weaknesses: SENSITIVE to outlier bins (a single noisy bin can pull the fit if the weights happen to give it leverage); doesn't capture DOMAIN PRIORS (the algorithm sees only the bins, not the geologist's intuition for ranges); biased toward UNDER-FITTING the near-origin region when the Cressie weights give small-γ bins large weight without enough corresponding small-γ pairs (Pyrcz & Deutsch 2014 §5 documents this systematic bias). The practitioner remedy: combine WLS with by-eye sanity-check and CV calibration.

Maximum likelihood and REML, the rigorous-statistics route

The THIRD strategy treats variogram fitting as a parametric estimation problem under an explicit probability model. Assume the field is a Gaussian random field with mean $\mu$ (constant or trend-modelled) and covariance $C(h; \boldsymbol\theta)$ . Then the joint distribution of the $n$ sample values is multivariate normal:

\mathbf{z} \sim \mathcal{N}\bigl(\mathbf{1}\,\mu, \, \boldsymbol\Sigma(\boldsymbol\theta)\bigr), \qquad \Sigma_{ij}(\boldsymbol\theta) = C(\|\mathbf{s}_i - \mathbf{s}_j\|; \boldsymbol\theta).

The negative log-likelihood, up to constants, is

-2 \log L(\boldsymbol\theta, \mu) \;=\; \log |\boldsymbol\Sigma| \;+\; (\mathbf{z} - \mathbf{1}\mu)^{\top} \boldsymbol\Sigma^{-1} (\mathbf{z} - \mathbf{1}\mu).

MAXIMUM LIKELIHOOD (ML) estimation maximises this jointly over $(\boldsymbol\theta, \mu)$ . The ML estimator of the mean $\mu$ is a generalised-least-squares average of the data, and the profile likelihood for $\boldsymbol\theta$ is what you optimise after profiling out the mean. The full machinery is in Mardia & Marshall 1984 (the original geostatistics paper) and Diggle & Ribeiro 2007 §6 (the standard modern textbook reference).

ML has a well-known small-sample bias: when $n$ is modest the variance parameters are systematically under-estimated because the likelihood doesn't account for the degrees-of-freedom cost of estimating the mean. The fix is RESTRICTED MAXIMUM LIKELIHOOD (REML), projet $\mathbf{z}$ onto the orthogonal complement of the mean subspace and apply ML to those residual contrasts (Patterson & Thompson 1971; Cressie 1993 §2.6.1). REML eliminates the mean-estimation bias and gives unbiased variance parameter estimates under correct model specification. Mardia & Marshall 1984 develop REML for spatial covariance models specifically.

ML / REML strengths: RIGOROUS STATISTICAL THEORY (asymptotic normality of estimators, likelihood-ratio tests for nested model comparison, profile-likelihood confidence intervals); STANDARD ERRORS on the estimated parameters from the observed information matrix; HANDLES NUGGET AND ANISOTROPY simultaneously in one optimisation (no separate fitting steps); EXTENDS NATURALLY to model-based geostatistics with non-Gaussian likelihoods (Diggle & Ribeiro 2007).

ML / REML weaknesses: REQUIRES THE GAUSSIAN-FIELD ASSUMPTION, heavy-tailed or skewed fields invalidate the likelihood, and the estimator becomes biased in ways that are hard to diagnose without further tests; COMPUTATIONALLY EXPENSIVE, the $n \times n$ matrix inverse in the likelihood is $O(n^3)$ for direct factorisation, prohibitive for $n > 5{,}000$ samples without sparse-approximation tricks (Vecchia 1988; Stein, Chi & Welty 2004); the OVER-CONFIDENCE problem when the Gaussian assumption breaks (the SEs reported under the wrong likelihood are too tight). Practitioners reach for REML when $n$ is moderate (a few hundred to a few thousand samples), the field is plausibly Gaussian after a transformation, and uncertainty quantification matters as much as the point estimate.

The first widget for §4.5 puts all three methods on a single empirical variogram with a known truth. The reader sets the truth (a Spherical model with nugget), the widget generates a binned γ̂(h) with sampling noise, and three fits are computed: BY-EYE (the reader's manual sliders), WLS (Cressie weights, coordinate descent), and REML (Gaussian-likelihood proxy applied to the binned residuals). All three fits and the truth curve are overlaid on the bins; the readout panel reports each method's (c₀, c, a) triple and the SSE-to-truth of the fitted curve.

Three things to do with this widget. First, set the truth to (c₀=0.10, c=0.70, a=0.50) and adjust your by-eye fit to match the bins as well as you can. Toggle "Reveal truth", your by-eye fit is probably within 0.05 of the truth on each parameter, but not identical. Compare to the WLS and REML fits: they should also be close to the truth, with their own characteristic deviations. Note the SSE-to-truth numbers: typically 0.001-0.01 for a well-converged fit. The three methods do NOT give identical numbers, but they all sit in a small neighbourhood around the truth.

Second, click "Resample noise" several times at the same truth. Watch the three fits jump around realisation by realisation. On any single noise realisation, one method might be closer to the truth than the others; on the next realisation, a different method wins. Over many realisations the three methods AGREE ON AVERAGE but disagree on any single dataset by an amount proportional to the binned-empirical uncertainty. This is the sampling-variability cost of fitting a noisy γ̂(h) with finite data, and the empirical reason to report multiple methods rather than trusting any single point estimate.

Third, set the truth to a very-low-noise configuration (large nPairs, easy to read off the binned plateau) versus a high-noise configuration. With clean data the three methods converge tightly; with noisy data they spread out. The DEFENSIBLE PRACTICE, convergent results from by-eye + WLS + REML, is a quality criterion: when the three methods agree, the fit is robust to method choice; when they disagree, the data is the bottleneck and reporting a confidence band on the parameters is the honest move.

The canonical workflow, combine all three, then cross-validate

No practitioner relies on a single fitting method in isolation. The CANONICAL WORKFLOW combines them, with each method playing a different role at a different stage. Goovaerts 1997 §4.2; Chilès & Delfiner 2012 §2.6; Pyrcz & Deutsch 2014 §5 all describe variants of the following procedure:

Inspect $\hat\gamma$ , visual diagnosis. Plot the empirical variogram in two or three directions. Identify nugget, structures (single or nested), anisotropy. This is a §3.2-§3.4 task; §4.5 inherits the output. Choose a candidate family from §4.1 and, if needed, nested-structure decomposition from §4.4.
By-eye initial fit. Slide the parameter sliders to a defensible starting point. This step incorporates DOMAIN KNOWLEDGE that no algorithm captures, what range is geologically reasonable, what nugget makes sense given the sampling protocol, what anisotropy ratio matches the depositional setting.
WLS refinement. Run weighted least squares with the by-eye values as starting points. The output is a REPRODUCIBLE NUMERICAL ESTIMATE, the same data + the same WLS implementation gives the same numbers every time. This is what goes into the production reservoir / mining / environmental model.
Optional REML for uncertainty quantification. If $n$ is moderate and the Gaussian assumption is defensible (check via normal-score transform, §1.2), run REML to get standard errors on the parameters. Report the SE bands alongside the point estimates.
Cross-validation check. Leave-one-out: at each sample location, krige the held-out value from the rest using the fitted variogram, compute the prediction error, normalise by the kriging standard deviation. The histogram of standardised residuals should be $N(0, 1)$ . If it isn't, the fit is miscalibrated, see §6 for the full diagnostic suite.
Iterate. If CV flags miscalibration, return to step 2 with the CV evidence as a constraint. Often the fix is adjusting the nugget (CV miscalibration in the small-prediction-variance regime points to wrong nugget) or adding / removing a nested component (CV miscalibration in particular lag bands points to under-resolved structure).

The workflow is iterative, each method's output informs the others. By-eye sets priors that constrain WLS's starting point; WLS produces reproducible numbers that REML refines with uncertainty bounds; CV provides empirical evidence that the fit calibrates kriging. The output is a parameter vector $\boldsymbol\theta$ with a documented derivation, a defensible numerical estimate, optional uncertainty bands, and an empirical fit-quality check.

Cross-validation, the empirical test for variogram fit

The fit-quality test that bridges Part 4 to Part 6 is LEAVE-ONE-OUT CROSS-VALIDATION. The procedure: for each sample location $\mathbf{s}_i$ , remove the value $z_i$ from the dataset, krige $\hat z(\mathbf{s}_i)$ from the remaining $n - 1$ samples using the candidate variogram, and compute the prediction error

r_i \;=\; z_i - \hat z(\mathbf{s}_i), \qquad \sigma_K(\mathbf{s}_i) \;=\; \text{kriging standard deviation at } \mathbf{s}_i.

The STANDARDISED RESIDUAL is

z_i^\star \;=\; \dfrac{r_i}{\sigma_K(\mathbf{s}_i)}.

If the candidate variogram correctly describes the field, then under the Gaussian-field assumption the standardised residuals are independently $N(0, 1)$ distributed. The empirical histogram of ${z_i^\star}_{i=1}^{n}$ should be approximately the standard-normal density curve. Three diagnostic numbers (Goovaerts 1997 §6.5):

$\text{mean}(z^\star) \approx 0$ , predictions are unbiased. A non-zero mean diagnoses systematic bias (often a wrong drift assumption, Part 5.3).
$\text{SD}(z^\star) \approx 1$ , kriging variance is correctly CALIBRATED. SD > 1 means the kriging variance under-estimates actual prediction error (the variogram is UNDER-FIT, too small a range or too small a sill); SD < 1 means the kriging variance over-estimates actual error (the variogram is OVER-FIT, too large a range relative to true field structure).
$\text{Normal shape}$ , heavy tails diagnose non-Gaussianity in the field (consider normal-score transform, §1.2); skewed shapes diagnose mean-trend issues.

Cross-validation is the EMPIRICAL FIT-QUALITY TEST. Two variogram models that look equally plausible by visual inspection of $\hat\gamma$ may give very different CV diagnostics, and the one with $\text{SD}(z^\star) \approx 1$ is the one that actually calibrates the downstream kriging. CV is what closes the loop: by-eye + WLS + REML give you a candidate fit; CV tells you whether that candidate works on the data it will be deployed on.

The second widget for §4.5 makes the cross-validation test concrete. A synthetic 16-sample dataset is drawn from a known Spherical-with-nugget truth (Cholesky-conditioned so the samples have realistic spatial correlation, not independent draws). Three candidate variograms, UNDER-FIT (range too small), TRUTH (matches generator), OVER-FIT (range too large), are each scored by leave-one-out cross-validation. The widget reports for each candidate the predicted-vs-observed scatter, the standardised-residual histogram against the $N(0, 1)$ reference curve, RMSE, mean(z), and SD(z).

Three things to do with this widget. First, click each of the three candidate buttons (Under-fit, Truth, Over-fit) and watch the histogram change. The TRUTH candidate gives SD(z) close to 1, the histogram bars match the blue $N(0, 1)$ curve. The UNDER-FIT candidate gives SD(z) substantially above 1, the histogram is wider than the normal curve because the kriging variance is too small (the variogram's short range under-estimates actual covariance). The OVER-FIT candidate gives SD(z) substantially below 1, the histogram is narrower than the normal curve because the kriging variance is too large.

Second, click "Resample" several times to see the CV diagnostics under different noise realisations. The TRUTH candidate consistently gives SD(z) near 1, that's the signature of a correctly calibrated fit. The UNDER-FIT and OVER-FIT candidates consistently miscalibrate in their respective directions. The diagnostic is not sensitive to individual realisations, it captures a structural property of the candidate variogram against the field.

Third, look at the predicted-vs-observed scatter for each candidate. The cloud of dots should fall along the y = x reference line; if the dots systematically deviate from y = x at certain value ranges, the variogram is biased in those regions. The error bars are $\pm 1\sigma_K$ : under the TRUTH candidate they bracket the dots approximately correctly; under UNDER-FIT they're too short; under OVER-FIT they're too long. CV is the empirical apparatus that diagnoses these miscalibrations.

Common pitfalls, what defensible fits avoid

The §4.5 honest catalogue of how variogram fits go wrong:

Fitting noise rather than signal. Real $\hat\gamma$ has 12-25 noisy bins. Fitting a permissible family with 3 parameters (c₀, c, a) leaves $\sim 20$ degrees of freedom; fitting a two-structure nested model with 5 parameters leaves $\sim 18$ ; fitting a three-structure model with 7 parameters leaves $\sim 16$ ; fitting a four-structure model with 9 parameters leaves $\sim 14$ . The rule of thumb: STOP at 2-3 nested components (§4.4 caveats). Each additional component eats degrees of freedom and lets the fit chase noise bumps. Cross-validation catches this, over-fit models give SD(z) < 1, the over-confidence signature.

Ignoring anisotropy. Real fields are almost always anisotropic to some degree. Using an isotropic model on an anisotropic dataset compromises the fit: the isotropic variogram averages out the directional differences, producing a model that overstates correlation in the short-range direction and understates it in the long-range. The directional CV diagnostic, compute SD(z) separately for sample-pairs in each directional sector, flags this. The §4.3 anisotropic-ellipsoid machinery is the fix.

Trusting REML without checking the Gaussian assumption. The REML likelihood is built on the multivariate-normal assumption for the data. Real fields are often skewed (positive grades, strictly non-negative concentrations) or heavy-tailed. Run a normal-score transform (§1.2) BEFORE REML, fit on the transformed data, and back-transform the model parameters at the end. Skipping this step produces a Gaussian-likelihood-best estimate on non-Gaussian data, the SEs reported are too tight, and the parameter estimates are biased.

Relying on any single method alone. By-eye is subjective. WLS is sensitive to outlier bins. REML requires the Gaussian assumption. Each method has a failure mode. CONVERGENT RESULTS across the three methods is a quality criterion: when by-eye, WLS, and REML agree to within the data uncertainty, the fit is robust to method choice. When they disagree by more than the data uncertainty, the data is the bottleneck and reporting a confidence band is honest.

Skipping cross-validation. A fit that has not been cross-validated is a fit that has not been tested. CV is computationally cheap (leave-one-out on $n$ samples is $n$ kriging solves at small kernels, affordable even for $n = 10{,}000$ ) and informationally rich. Defensible practice: every reported variogram fit comes with CV diagnostics, mean(z), SD(z), histogram against $N(0, 1)$ .

Honest scope, §4.5 is the procedural close

§4.5 is the procedural close to Part 4. It gives you the recipes to pin parameters down, but the recipes don't tell you which family to use (§4.1), whether to add a nugget (§4.2), how to handle anisotropy (§4.3), or how many nested components (§4.4). All of those choices feed into the fitting workflow, and a bad upstream choice cannot be rescued by a sophisticated downstream fit. If §4.1's family choice is wrong, no amount of WLS refinement will save the fit. If §4.2's nugget is set to zero on data that has real measurement noise, the resulting kriging system will interpolate noise as if it were signal. If §4.3's anisotropy is ignored on directional data, the fit averages out the directional structure and the kriged map blurs the asymmetry. §4.5's machinery only refines the parameter values; it does not fix structural model errors.

Part 5 takes the fitted variogram model and plugs it into the kriging system: the per-pair γ values become the kriging matrix entries, the weights are computed from the matrix inverse, and the predictions inherit whatever fit quality §4.5 produced. A well-fit variogram gives well-calibrated kriging with SD(z) ≈ 1 in cross-validation; a poorly-fit variogram gives miscalibrated kriging that fails its CV check. Part 6 develops cross-validation and calibration into the full QC apparatus, accuracy plots, calibration diagrams, conditional-bias diagnostics, jackknife estimators. The §4.5 leave-one-out diagnostic is the entry point; Part 6 is the full toolbox.

Try it

In variogram-fit-strategies, set the truth to (c₀=0.10, c=0.70, a=0.50). Use the by-eye sliders to fit the bins as well as you can WITHOUT looking at the truth (don't toggle "Reveal truth" yet). Then toggle "Reveal truth" and compare your by-eye fit to the true generator. How far off were you? The SSE-to-truth in the readout panel quantifies the gap. Typical by-eye SSE-to-truth on this generator: 0.001-0.005.
Still in variogram-fit-strategies, compare your by-eye fit to the WLS and REML fits in the readout. The three fits usually sit close to each other, typically within 0.05 on each parameter, but rarely identical. Document the differences. Click "Resample noise" five times and re-record the three (c₀, c, a) triples each time. The WLS fit is reproducible across realisations only if the bins are; same for REML. By-eye varies across analysts but is stable for one analyst.
In variogram-fit-strategies, set the truth to (c₀=0.05, c=0.90, a=0.30), a higher signal-to-nugget ratio, a smaller range. Fit by-eye again. WLS and REML recompute. Compare the SSE-to-truth across the three: typically WLS is closest to truth because the Cressie weights favour high-pair-count bins, but on a high-noise realisation REML can win. Note that the THREE METHODS GIVE SIMILAR BUT NOT IDENTICAL FITS, the spread quantifies sampling-variance cost.
In variogram-fit-strategies, deliberately set a wildly wrong by-eye fit (c₀=0.30, c=0.30, a=1.00). Watch the by-eye SSE-to-truth blow up. Now slide each parameter toward the truth one at a time and watch the SSE drop with each adjustment. This is what an analyst does mentally when fitting by-eye: each parameter change is followed by a check that the fit improved.
In variogram-crossval, click the TRUTH candidate. Read SD(z), it should be close to 1.00. Note the histogram of standardised residuals fits the blue $N(0, 1)$ curve. RMSE should be modest (consistent with the noise level). This is the calibrated case.
In variogram-crossval, click the UNDER-FIT candidate (range too small, a=0.18 vs truth a=0.40). SD(z) is substantially above 1 (typically 1.4-2.0). The histogram is wider than the $N(0, 1)$ curve, kriging variance under-estimates actual error. The predicted-vs-observed scatter shows the error bars are too short to bracket the dots. This is the under-fit signature.
In variogram-crossval, click the OVER-FIT candidate (range too large, a=0.90). SD(z) is substantially below 1 (typically 0.6-0.8). The histogram is narrower than the $N(0, 1)$ curve, kriging variance over-estimates actual error. The scatter error bars are too long. This is the over-fit signature.
Without coding: a reservoir-characterisation team reports a Spherical fit (c₀=0.04, c=0.20, a=80m) with CV diagnostics mean(z)=0.02, SD(z)=1.18. Diagnose. Is the fit calibrated? Is the mean acceptable? Is the SD acceptable? What direction would you adjust the variogram parameters to improve calibration? (Hint: SD(z) > 1 means under-fit, try a larger sill or a longer range.)

Pause and reflect: variogram fitting is not a single algorithm but a workflow. By-eye sets priors, WLS produces reproducible numbers, REML quantifies uncertainty under the Gaussian assumption, and cross-validation tests the result on the empirical data. Each method has a different failure mode; the defensible practice combines all of them. What would you most want to document in a variogram report so that a downstream geomodeller can verify the fit was justified by the data, the parameter values, the family choice, the cross-validation SD(z), the comparison across by-eye / WLS / REML?

What you now know, and the close of Part 4

You can state the THREE established variogram-fitting strategies, by eye, by weighted least squares with Cressie's $w_k \propto N(h_k)/\gamma_k^2$ weights, and by maximum likelihood / REML under a Gaussian-field assumption. You know which method dominates which subfield, by-eye in mining and reservoir production workflows, WLS in software-standard reproducible workflows (GSLIB, gstat, Petrel, Leapfrog), REML in research-statistics and uncertainty-quantification contexts. You can explain the strengths and weaknesses of each method: by-eye is subjective but incorporates domain priors; WLS is reproducible but doesn't capture priors; REML is rigorous but requires the Gaussian-field assumption and is computationally expensive.

You can describe the CANONICAL WORKFLOW that combines all three methods. Inspect $\hat\gamma$ visually; by-eye initial fit; WLS refinement; optional REML for SEs; cross-validation check; iterate if CV flags miscalibration. The workflow is what a competent practitioner does in a production setting; the methods are tools used at different stages, not competing algorithms.

You can compute LEAVE-ONE-OUT CROSS-VALIDATION diagnostics, mean(z) (should be ≈ 0 for unbiasedness), SD(z) (should be ≈ 1 for kriging-variance calibration), and the standardised-residual histogram against the $N(0, 1)$ reference. You know what SD(z) > 1 diagnoses (under-fit, variogram too short-ranged or too small a sill), what SD(z) < 1 diagnoses (over-fit, variogram too long-ranged), and what a normal-shaped histogram with mean ≈ 0 and SD ≈ 1 means (correctly calibrated variogram).

You can apply the COMMON PITFALLS catalogue: stop at 2-3 nested components; check anisotropy; transform to normal scores before REML; report all three methods; always cross-validate. You know that defensible variogram reports include the parameter values, the family choice, the workflow used, the cross-validation diagnostics, and an honest discussion of the residual uncertainty.

This closes PART 4, VARIOGRAM MODELING. Five sections in, you have a complete variogram-modelling toolkit: permissible isotropic families with linear / parabolic origin diagnostic (§4.1); nugget for measurement noise and microscale variability (§4.2); anisotropic ellipsoids for directional bias (§4.3); nested structures for multi-scale physical processes (§4.4); and three formal fitting strategies plus the cross-validation calibration test (§4.5). You can take any empirical variogram $\hat\gamma(\mathbf{h}; z_c)$ produced by Part 3 and fit a defensible permissible model $\gamma(\mathbf{h}; \boldsymbol\theta)$ to it, choice of family informed by the near-origin behaviour and overall shape, parameters pinned down by a combination of by-eye, WLS, and REML, calibration verified by leave-one-out CV.

Part 5 picks up here. Where Part 4 PRODUCED a fitted variogram model, Part 5 USES that model to do kriging, the BLUP that takes the variogram and the sample data and produces predictions at unsampled locations with associated kriging variances. Simple kriging (§5.1) plugs the model into the basic predictor; ordinary kriging (§5.2) adds the unbiasedness constraint; universal kriging and KED (§5.3) handle drift; block kriging (§5.5) handles support changes; cokriging (§5.7) handles secondary data; the kriging variance (§5.4) and its calibration (§6.3) close the uncertainty-quantification loop. The §4.5 fitted variogram is what the kriging system in Part 5 actually reads, get the fit right and the kriged map will be well-calibrated; get it wrong and the kriged map will silently inherit the miscalibration. The cross-validation diagnostic developed in §4.5 is the entry-level fit-quality check; Part 6 develops it into the full QC apparatus.

References

Cressie, N. (1985). Fitting variogram models by weighted least squares. Mathematical Geology, 17(5), 563-586. (The standard reference for variogram-fitting WLS. Derives the recommended weights $w_k \propto N(h_k) / \gamma(h_k)^2$ from the small-sample variance of $\hat\gamma(h_k)$ under a Gaussian-field assumption. The basis for the WLS implementations in GSLIB, gstat, and every modern variogram-fitting routine.)
Mardia, K.V., Marshall, R.J. (1984). Maximum likelihood estimation of models for residual covariance in spatial regression. Biometrika, 71(1), 135-146. (The foundational paper for ML and REML estimation of spatial covariance models. Develops the profile-likelihood machinery for fitting variogram parameters under a Gaussian-field assumption and derives the asymptotic distribution of the estimators. The cited reference for REML in geostatistics.)
Patterson, H.D., Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58(3), 545-554. (The original paper introducing restricted maximum likelihood (REML) in the context of mixed-effects models. The technique projects the data onto the orthogonal complement of the mean subspace and applies ML to those residual contrasts, eliminating the small-sample bias of ML for variance parameters. Mardia & Marshall 1984 carry REML over to spatial covariance models.)
Diggle, P.J., Ribeiro, P.J. Jr. (2007). Model-based Geostatistics. Springer. (The standard modern textbook reference for likelihood-based geostatistics. §6 develops ML and REML for stationary Gaussian fields, the profile-likelihood machinery, parameter standard errors via the observed information matrix, and the diagnostic tools for checking the Gaussian assumption.)
Cressie, N. (1993). Statistics for Spatial Data (revised ed.). Wiley. (§2.6 covers ML and REML estimation of variogram models. §2.4 derives the variance of $\hat\gamma(h_k)$ under a Gaussian-field assumption, which justifies Cressie's WLS weighting. The standard mathematical-statistics treatment of variogram fitting at a graduate level.)
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford University Press. (§4.2 develops the practitioner-oriented variogram-fitting workflow combining by-eye and WLS. §6.5 develops leave-one-out cross-validation as the fit-quality diagnostic. The standard practitioner-textbook reference for the §4.5 material at a graduate-school level.)
Chilès, J.-P., Delfiner, P. (2012). Geostatistics: Modeling Spatial Uncertainty (2nd ed.). Wiley. (§2.6 covers fitting permissible variogram models with both WLS and ML / REML at mathematical-statistics rigour. §3.6 covers cross-validation diagnostics including the standardised-residual interpretation. The cleanest derivation of why SD(z) calibration corresponds to correct variogram fit.)
Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (Chapter 16 introduces variogram-fitting at the practitioner-pedagogy level. Emphasises by-eye fitting with WLS refinement as the dominant production workflow. The most readable entry-level reference for the §4.5 material.)
Pyrcz, M.J., Deutsch, C.V. (2014). Geostatistical Reservoir Modeling (2nd ed.). Oxford University Press. (§5 documents the variogram-fitting workflow used in reservoir-characterisation production. Emphasises the by-eye + WLS combination with cross-validation as the calibration test, and catalogues the common fitting pitfalls observed across hundreds of field projects.)
Deutsch, C.V., Journel, A.G. (1998). GSLIB: Geostatistical Software Library and User's Guide (2nd ed.). Oxford University Press. (§III.4 documents the variogram-fitting routines in GSLIB, VARFIT for WLS fitting, the iterative weight update, and the I/O conventions for nested models. The canonical software-side reference; reading it is how to verify that a WLS fit produced in one package can be re-implemented in another.)
Stein, M.L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer. (§6 covers ML estimation of covariance parameters in spatial models, the asymptotic theory of estimators under increasing-domain and infill asymptotics, and the conditions under which the Gaussian-field assumption is robust to mild non-Gaussianity. The reference textbook for the mathematical-statistics underpinnings of REML in spatial models.)
Vecchia, A.V. (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society B, 50(2), 297-312. (The original paper introducing approximations to the full Gaussian likelihood that make ML / REML tractable for large $n$ . The Vecchia approximation truncates the covariance dependency at each location to its nearest neighbours, reducing the $O(n^3)$ cost to $O(n \log n)$ , and is the basis for modern large-sample ML / REML implementations in spBayes, INLA, and related packages.)