Simple kriging from first principles

Part 5 — Kriging

Learning objectives

State the KRIGING PROBLEM: given $N$ samples $\{(\mathbf{s}_i, z_i)\}_{i=1}^N$ from a regionalised variable $Z(\mathbf{s})$ and a target location $\mathbf{s}_0$ , find a predictor $Z^*(\mathbf{s}_0)$ that is OPTIMAL in some defensible sense — minimum mean-squared error under an unbiasedness constraint. The LINEAR predictor $Z^*(\mathbf{s}_0) = \sum_{i=1}^N w_i Z(\mathbf{s}_i)$ is the workhorse; choosing the weights $w_i$ is the geostatistical problem
Apply the SIMPLE KRIGING ASSUMPTION: the field mean $m = E[Z(\mathbf{s})]$ is KNOWN — typically the global declustered mean from §1.4-§1.5, sometimes a regional reference value from prior studies. Reformulate the predictor on RESIDUALS $Y(\mathbf{s}) = Z(\mathbf{s}) - m$ : predict $Y^*(\mathbf{s}_0) = \sum_i w_i Y(\mathbf{s}_i)$ , then back-transform $Z^*(\mathbf{s}_0) = m + Y^*(\mathbf{s}_0)$ . The residual reformulation centres the problem at zero mean and lets the kriging system stay clean
Derive the SIMPLE KRIGING SYSTEM by MINIMISING THE MEAN-SQUARED ERROR $E[(Z^*(\mathbf{s}_0) - Z(\mathbf{s}_0))^2]$ . Differentiate w.r.t. each $w_i$ , set the gradient to zero, and you get the LINEAR SYSTEM $\sum_{j=1}^N w_j C(\mathbf{s}_i, \mathbf{s}_j) = C(\mathbf{s}_i, \mathbf{s}_0)$ for $i = 1, \ldots, N$ . In matrix form: $\mathbf{K} \mathbf{w} = \mathbf{k}$ where $\mathbf{K}$ is the sample-to-sample covariance matrix (entries $K_{ij} = C(\mathbf{s}_i, \mathbf{s}_j)$ ) and $\mathbf{k}$ is the sample-to-target covariance vector (entries $k_i = C(\mathbf{s}_i, \mathbf{s}_0)$ ). Solve by CHOLESKY or LU decomposition — $\mathbf{K}$ is symmetric positive-definite (permissibility from §4.1 guarantees this)
Compute the KRIGING VARIANCE $\sigma_K^2(\mathbf{s}_0) = C(0) - \mathbf{w}^\top \mathbf{k} = \sigma^2 - \sum_i w_i C(\mathbf{s}_i, \mathbf{s}_0)$ . This is ALWAYS $\ge 0$ because the covariance matrix is PSD — that's the §4.1 permissibility constraint cashed out. The variance DECREASES as more well-placed samples are added (more $w_i k_i$ to subtract) and INCREASES as the target moves farther from samples (smaller $k_i$ ). At zero nugget, $\sigma_K^2 = 0$ exactly at sample locations (kriging is an EXACT INTERPOLATOR there); with nugget, $\sigma_K^2 \to c_0$ at samples (the smoothed estimate differs from the data)
State the BLUE PROPERTY: under the simple-kriging assumption, the kriging predictor is the BEST LINEAR UNBIASED ESTIMATOR. BEST = minimum MSE; LINEAR = of the form $\sum_i w_i Z(\mathbf{s}_i)$ ; UNBIASED = $E[Z^*(\mathbf{s}_0)] = m$ . "Best" is conditional on the assumed variogram — wrong variogram, wrong kriging. "Linear" is a restriction: nonlinear predictors can do better when the field is non-Gaussian, but the linear class admits closed-form solutions and is what the kriging system gives you
Recognise the SCREEN EFFECT — kriging weights can be NEGATIVE. Two samples on the same side of the target: the one farther away gets a smaller (or negative) weight because the closer one already "covers" that direction's information. The kriging system handles this automatically through the off-diagonal terms of $\mathbf{K}$ . Negative weights are not a bug — they are the SK system's way of avoiding redundant information from clustered samples (Isaaks & Srivastava 1989 §13.1; Goovaerts 1997 §5.3)
Identify the SMOOTHING property: a kriged map $Z^*(\mathbf{s})$ is SMOOTHER than the true field. The variance of $Z^*$ over the domain is LESS than $\sigma^2$ — kriging averages out small-scale variability that the variogram model attributes to short-range structure or nugget. This is why "one kriged map" is the wrong question for risk decisions: you want REALISATIONS that preserve the field's roughness (Part 7's sequential Gaussian simulation), not a smoothed best-estimate. The kriging variance is a CALIBRATED measure of how badly the smooth map under-represents the true variability
Apply the COINCIDENT-SAMPLE handling: duplicate samples at the same location make $\mathbf{K}$ SINGULAR (two identical rows). Practical remedies: (a) COMBINE duplicates by averaging their values and treating the result as one sample; (b) PERTURB sample locations by a tiny jitter (smaller than half the closest non-coincident pair distance) — but this is a hack; (c) RAISE the nugget $c_0$ slightly so the diagonal entries dominate over off-diagonals (the regularisation route). A non-zero nugget always makes $\mathbf{K}$ well-conditioned; this is one of the practical reasons many fits include a small nugget even when not strictly required by the data
Walk through a CONCRETE WORKED EXAMPLE in 1-D. Five samples on $[0, 1]$ with values, a Spherical variogram $c_0 = 0, c = 1, a = 50\,\text{(units of x)}$ . Target $s_0 = 0.5$ . Compute the $5 \times 5$ covariance matrix $\mathbf{K}$ , the $5\times 1$ vector $\mathbf{k}$ , solve for weights by Cholesky, compute $Z^*(0.5)$ and $\sigma_K^2(0.5)$ . Verify by hand the signs and rough magnitudes against your intuition for which samples should contribute most. Repeat with a shorter range $a = 10$ — the weights concentrate on the closest samples and the variance rises
Apply the HONEST CAVEATS catalogue. (a) Simple kriging requires KNOWING $m$ ; if you ESTIMATE $m$ from the data and treat it as known, the variance $\sigma_K^2$ is UNDER-ESTIMATED — that's why ordinary kriging (§5.2) drops the known-mean assumption. (b) "BEST" means MSE-best under the ASSUMED variogram — a wrong variogram gives systematically wrong kriging (under-fit → over-confident; over-fit → under-confident — see §6.3 calibration of the kriging variance). (c) The LINEAR restriction is real — non-linear predictors like indicator kriging (§8.2) extract additional information for non-Gaussian fields. (d) Coincident-sample handling matters; the §5.1 framework assumes distinct locations
Locate §5.1 HONESTLY in Part 5. §5.1 is the FOUNDATION. §5.2 (ordinary kriging) drops the known-mean assumption by adding a Lagrange constraint. §5.3 (universal kriging, KED) adds a deterministic drift in addition to the random component. §5.4 (kriging variance) cashes out what $\sigma_K^2$ does and does not mean. §5.5 (block kriging) extends to averaging support changes. §5.6 (neighbourhood selection) restricts the $N$ samples used. §5.7 (cokriging) brings in secondary variables. §5.8 (pathologies) catalogues the modal failure modes. Every section extends or stresses the §5.1 system $\mathbf{K} \mathbf{w} = \mathbf{k}$
Use the WORKFLOW for a defensible simple-kriging report: (1) verify the §1.4-§1.5 declustered mean to use as $m$ ; (2) confirm the variogram model from Parts 3-4 is the right family with sensible nugget and range; (3) build $\mathbf{K}$ and $\mathbf{k}$ at each target location; (4) solve via Cholesky with jitter regularisation; (5) compute $Z^*(\mathbf{s}_0)$ and $\sigma_K^2(\mathbf{s}_0)$ ; (6) verify against cross-validation (§4.5 leave-one-out, Part 6 full QC). Report the predictor, the variance map, and the CV diagnostics that calibrate them

Part 4 ended with a fitted variogram model $\gamma(\mathbf{h}; \boldsymbol\theta)$ — a defensible permissible-family curve with a nugget $c_0$ , sill $c_0 + c$ , range $a$ , optional nested components, optional anisotropic wrap, and a cross-validation pass that confirmed the fit calibrates. That variogram is the input. §5.1 is the section where every page of Parts 3 and 4 finally pays off. We take $\gamma$ , plug it into a linear system whose entries are the covariance values $C(\mathbf{h}) = \sigma^2 - \gamma(\mathbf{h})$ at the data-to-data and data-to-target separations, solve for a vector of weights, and produce a PREDICTION at the unsampled location plus an honest VARIANCE for that prediction. That is kriging.

The word "kriging" honours the South African mining engineer Danie Krige, whose 1951 paper on gold-grade estimation in the Witwatersrand fields anticipated the optimal-weighting framework that Georges Matheron would formalise mathematically a decade later. Matheron 1963 christened the technique with Krige's name and proved its core property: under a known-mean assumption, the linear predictor whose weights solve a particular linear system minimises the mean-squared prediction error subject to unbiasedness. That predictor is what §5.1 calls SIMPLE KRIGING.

Simple kriging is the FOUNDATION of every later kriging variant. Ordinary kriging in §5.2 drops the known-mean assumption by adding a Lagrange constraint. Universal kriging and kriging with external drift in §5.3 add a deterministic mean function. Block kriging in §5.5 extends predictions to spatial averages over a target block. Cokriging in §5.7 brings in secondary variables. Every extension reuses the §5.1 derivation as its starting point and modifies the linear system in one specific, well-understood way. Understanding §5.1 thoroughly is the prerequisite for the rest of Part 5 — and for Parts 6 (validation), 7 (simulation), and 8 (indicator methods), which all rest on the kriging machinery built here.

This section develops the simple-kriging system from first principles. We state the prediction problem, justify the linear-predictor restriction, impose the known-mean assumption, derive the system $\mathbf{K} \mathbf{w} = \mathbf{k}$ by minimising MSE, give the closed-form expression for the kriging variance, walk through a five-sample worked example by hand, explore the properties (BLUE, exact interpolation, smoothing, the screen effect, coincident-sample handling), then catalogue the honest caveats. Two widgets bring the abstractions down to earth — a 1-D step-by-step that exposes every part of the kriging matrix system, and a 2-D estimate/variance side-by-side that teaches the variance map as a "trust map" of the sample layout. By the end you can take any fitted variogram and any sample dataset, build the kriging system, solve it, and produce calibrated predictions at unsampled locations.

The kriging problem

The set-up. You have $N$ samples ${(\mathbf{s}$ where each $\mathbf{s}_i \in \mathbb{R}^d$ is a known location and each $z_i = Z(\mathbf{s}_i)$ is the measured value at that location. The underlying random function $Z$ has covariance $C(\mathbf{h}) = \sigma^2 - \gamma(\mathbf{h})$ pinned down by the Parts 3-4 fitting workflow. You want to predict the value $Z(\mathbf{s}_0)$ at an UNSAMPLED location $\mathbf{s}_0$ . Geometrically: you have a scattered set of dots in a 2-D or 3-D domain, each carrying a number, and you want to assign a number — plus an uncertainty — to every other point in the domain.

The PREDICTOR is some function of the data, call it $Z^*(\mathbf{s}_0)$ . There are infinitely many possible predictors; the kriging family restricts to a particular class and picks the optimal member.

The first restriction is LINEARITY. We insist that $Z^*(\mathbf{s}_0)$ depends LINEARLY on the data values:

Z^*(\mathbf{s}_0) \;=\; \sum_{i=1}^N w_i \, Z(\mathbf{s}_i) \;+\; w_0,

where ${w_i}_{i=1}^N$ and $w_0$ are coefficients we must choose. The class of LINEAR predictors is large enough to capture most geostatistical applications and small enough to admit a closed-form optimal solution. Nonlinear predictors (e.g. indicator kriging in §8.2, or non-parametric methods) can sometimes do better — particularly for heavy-tailed or strongly bimodal fields — but the linear class is the workhorse.

The second restriction is OPTIMALITY in the sense of MINIMUM MEAN-SQUARED ERROR subject to UNBIASEDNESS. Among the linear predictors, choose the ${w_i}, w_0$ that:

Make the predictor UNBIASED: $E[Z^*(\mathbf{s}_0) - Z(\mathbf{s}_0)] = 0$ .
Minimise the MEAN-SQUARED ERROR (MSE): $E!\bigl[(Z^*(\mathbf{s}_0) - Z(\mathbf{s}_0))^2\bigr]$ .

That joint criterion — Best Linear Unbiased Estimator, BLUE — defines the kriging family. Different KRIGING VARIANTS differ in what they assume about the mean of $Z$ , which is what determines the form of the unbiasedness constraint. Simple kriging assumes the MEAN IS KNOWN. Ordinary kriging assumes the mean is CONSTANT BUT UNKNOWN. Universal kriging assumes the mean is a SUM OF KNOWN BASIS FUNCTIONS WITH UNKNOWN COEFFICIENTS. §5.1 — this section — develops the SIMPLE KRIGING case. The other cases follow in §5.2 and §5.3.

The simple-kriging assumption — known mean

SIMPLE KRIGING (SK) makes the strong assumption that the mean of $Z$ is KNOWN:

E[Z(\mathbf{s})] \;=\; m, \qquad \text{constant in } \mathbf{s}, \text{ known a priori}.

"Known a priori" usually means: estimated from the §1.4-§1.5 DECLUSTERED MEAN of the same dataset, then PROMOTED to a known constant for the kriging run. The declustering machinery in Part 2 was designed for exactly this — to produce a defensible estimate of the field mean that is not biased by clustered sampling. Sometimes $m$ comes from REGIONAL PRIOR INFORMATION (a similar reservoir, a regulatory mean concentration, a depositional facies reference value); in those cases $m$ is genuinely known independently of the data being kriged.

The cleanest way to handle a known mean is to REFORMULATE on RESIDUALS. Define the residual random function

Y(\mathbf{s}) \;=\; Z(\mathbf{s}) - m, \qquad E[Y(\mathbf{s})] = 0.

$Y$ is a zero-mean random function with the SAME covariance as $Z$ : $\text{Cov}(Y(\mathbf{s}_i), Y(\mathbf{s}_j)) = \text{Cov}(Z(\mathbf{s}_i), Z(\mathbf{s}_j)) = C(\mathbf{s}_i, \mathbf{s}_j)$ . Kriging the residual is mathematically equivalent to kriging $Z$ with the known-mean correction baked in. We predict $Y$ at the target:

Y^*(\mathbf{s}_0) \;=\; \sum_{i=1}^N w_i \, Y(\mathbf{s}_i) \;=\; \sum_{i=1}^N w_i (Z(\mathbf{s}_i) - m),

and then back-transform:

Z^*(\mathbf{s}_0) \;=\; m + Y^*(\mathbf{s}_0) \;=\; m + \sum_{i=1}^N w_i (Z(\mathbf{s}_i) - m).

Equivalently, written in terms of $Z$ directly (this is the form you will see in code):

Z^*(\mathbf{s}_0) \;=\; \Bigl(1 - \sum_{i=1}^N w_i\Bigr) m \;+\; \sum_{i=1}^N w_i \, Z(\mathbf{s}_i).

Notice that the SK predictor has $w_0 = (1 - \sum_i w_i) m$ : the constant offset is determined automatically by the weights and the known mean. There are only $N$ free coefficients ${w_i}_{i=1}^N$ to choose. Crucially, the SK weights do NOT have to sum to 1 — they sum to whatever the kriging system says, which is typically less than 1. The sum-to-1 constraint is the signature of ORDINARY KRIGING (§5.2); SK gives it up in exchange for a known-mean assumption.

Deriving the kriging system from MSE minimisation

Here is the core derivation. Skip on first read if you trust the result; revisit when you need to debug a custom kriging implementation. The MSE we minimise is:

\text{MSE}(\mathbf{w}) \;=\; E\!\bigl[(Z^*(\mathbf{s}_0) - Z(\mathbf{s}_0))^2\bigr] \;=\; E\!\Bigl[\Bigl(\sum_i w_i (Z(\mathbf{s}_i) - m) - (Z(\mathbf{s}_0) - m)\Bigr)^2\Bigr].

Expanding the square and using linearity of expectation, the MSE decomposes into three covariance sums:

\text{MSE}(\mathbf{w}) \;=\; \sum_{i=1}^N \sum_{j=1}^N w_i w_j \, C(\mathbf{s}_i, \mathbf{s}_j) \;-\; 2 \sum_{i=1}^N w_i \, C(\mathbf{s}_i, \mathbf{s}_0) \;+\; C(0).

The first term is the variance of the predicted residual $Y^*(\mathbf{s}_0)$ — a quadratic form in $\mathbf{w}$ with the sample-to-sample covariance matrix. The second is twice the cross-covariance between the predictor and the truth. The third is the variance of $Y(\mathbf{s}_0)$ at the target, which doesn't depend on $\mathbf{w}$ at all. Minimising MSE means taking the partial derivative with respect to each $w_k$ and setting it to zero:

\frac{\partial \, \text{MSE}}{\partial w_k} \;=\; 2 \sum_{j=1}^N w_j \, C(\mathbf{s}_k, \mathbf{s}_j) \;-\; 2 \, C(\mathbf{s}_k, \mathbf{s}_0) \;=\; 0, \quad k = 1, \ldots, N.

Dividing by 2 gives the SIMPLE KRIGING SYSTEM, one equation per sample:

\sum_{j=1}^N w_j \, C(\mathbf{s}_i, \mathbf{s}_j) \;=\; C(\mathbf{s}_i, \mathbf{s}_0), \qquad i = 1, \ldots, N.

In matrix form:

\boxed{\;\mathbf{K} \, \mathbf{w} \;=\; \mathbf{k}\;}

where

$\mathbf{K}$ is the $N \times N$ SAMPLE-TO-SAMPLE COVARIANCE MATRIX with entries $K_{ij} = C(\mathbf{s}_i, \mathbf{s}$ . Each entry is the covariance between sample $i$ and sample $j$ — a value of the variogram (via $C = \sigma^2 - \gamma$ ) at the pairwise separation distance. The diagonal entries are $K$ {ii} = C(0) = c_0 + c $K_{ii} = C (0) = c_{0} + c$ (the total sill).
$\mathbf{w}$ is the $N \times 1$ vector of KRIGING WEIGHTS, the unknowns.
$\mathbf{k}$ is the $N \times 1$ SAMPLE-TO-TARGET COVARIANCE VECTOR with entries $k_i = C(\mathbf{s}_i, \mathbf{s}_0)$ .

The matrix $\mathbf{K}$ is SYMMETRIC (by symmetry of the covariance) and POSITIVE-DEFINITE (the §4.1 permissibility constraint cashed out — a covariance must be a positive-definite function, and the resulting Gram matrix of any finite sample is therefore PSD; strict PD when the locations are distinct and the variogram has a nugget or a smoothly varying part). Symmetric positive-definite linear systems solve cleanly by CHOLESKY DECOMPOSITION: factor $\mathbf{K} = \mathbf{L} \mathbf{L}^\top$ where $\mathbf{L}$ is lower-triangular, then solve $\mathbf{L} \mathbf{y} = \mathbf{k}$ by forward substitution and $\mathbf{L}^\top \mathbf{w} = \mathbf{y}$ by back-substitution. For modest $N$ (a few hundred), Cholesky is fast and numerically stable. For larger $N$ — block kriging, simulation conditioning, or large-domain estimation — see §5.6 on neighbourhood selection.

For numerical robustness, real implementations add a tiny JITTER to the diagonal: $\mathbf{K}' = \mathbf{K} + \epsilon , \mathbf{I}$ with $\epsilon \sim 10^{-9}$ . This handles the rare cases where samples are nearly coincident (almost-singular $\mathbf{K}$ ) without destabilising the well-conditioned bulk. The widget implementations in this section use jitter regularisation.

The kriging variance

Solving $\mathbf{K} \mathbf{w} = \mathbf{k}$ gives the optimal weights. Plugging back into the MSE expression — and using the optimality condition $\mathbf{K} \mathbf{w} = \mathbf{k}$ to simplify the quadratic-form term — yields the SIMPLE-KRIGING VARIANCE:

\boxed{\;\sigma_K^2(\mathbf{s}_0) \;=\; C(0) \,-\, \mathbf{w}^\top \mathbf{k} \;=\; \sigma^2 \,-\, \sum_{i=1}^N w_i \, C(\mathbf{s}_i, \mathbf{s}_0).\;}

This is the prediction-error variance at the target location $\mathbf{s}_0$ , conditional on the data layout, the variogram, and the SK assumption. Three things to notice:

Always non-negative. $\sigma_K^2 \ge 0$ for any permissible variogram (§4.1) — this is the permissibility constraint paying off. The proof rides on the PSD-ness of $\mathbf{K}$ : with $\mathbf{w} = \mathbf{K}^{-1} \mathbf{k}$ , the term $\mathbf{w}^\top \mathbf{k} = \mathbf{k}^\top \mathbf{K}^{-1} \mathbf{k} \le \mathbf{k}^\top \mathbf{K}^{-1} \mathbf{K} \cdot \mathbf{w} / (\text{factor})$ — and the maximum value $\mathbf{w}^\top \mathbf{k}$ can reach is $C(0)$ when $\mathbf{s}_0$ coincides with a sample.
Bounded above by the total sill. When the target $\mathbf{s}_0$ is far from every sample, $\mathbf{k} \to \mathbf{0}$ , the kriging system gives $\mathbf{w} \to \mathbf{0}$ , and the variance approaches $C(0) = c_0 + c$ . This is the SK variance ceiling — kriging beyond all samples reverts to the prior variance $\sigma^2$ and the predictor reverts to $m$ .
Function of the layout and variogram only. The kriging variance does NOT depend on the data VALUES ${z_i}$ — only on the sample LOCATIONS ${\mathbf{s}_i}$ and the variogram model. This is the basis for KRIGING-VARIANCE-DRIVEN sampling design: you can compute $\sigma_K^2$ at a candidate sample location BEFORE drilling that hole, and pick the location that reduces variance the most. The §5.4 chapter elaborates this point with all its caveats (the kriging variance is a function of layout and variogram, but the calibration of THE NUMBER to actual error depends on the variogram being right — Part 6).

Properties of simple kriging

The simple-kriging predictor has four properties worth committing to memory. They follow from the derivation above and from inspection of the closed-form expression for $\mathbf{w}$ .

BLUE — Best Linear Unbiased Estimator. The kriging weights minimise MSE among linear unbiased predictors of $Z(\mathbf{s}_0)$ under the SK assumption. "BEST" is conditional on the assumed variogram — wrong variogram, wrong kriging, but the predictor IS optimal once the variogram is fixed. This is the cleanest mathematical justification for kriging over any other linear interpolator (inverse-distance-weighted, Tobler-kernel, splines) — those alternatives are not optimal under any defensible spatial-statistics criterion; kriging is.

Exact interpolator (in the no-nugget case). When the variogram has $c_0 = 0$ and $\mathbf{s}_0$ coincides with sample location $\mathbf{s}_i$ , the kriging system reduces to $\mathbf{K} \mathbf{w} = \mathbf{K} \mathbf{e}_i$ (the $i$ -th column of $\mathbf{K}$ ), so $\mathbf{w} = \mathbf{e}_i$ . All weight concentrates on the one matching sample, and $Z^*(\mathbf{s}_i) = z_i$ . The kriged surface PASSES THROUGH the data points exactly. The kriging variance is zero there: $\sigma_K^2(\mathbf{s}$ . With a nonzero nugget $c_0 > 0$ , this exactness BREAKS: the diagonal $C(0) = c_0 + c$ is strictly greater than $\lim$ {h \to 0^+} C(h) = c $lim_{h \to 0^{+}} C (h) = c$ , so the kriged value at a sample differs slightly from the data, and the variance there is $\sigma_K^2(\mathbf{s}_i) = c_0$ . The nugget reflects measurement noise plus unresolved microscale variability (§4.2), and the SMOOTHED-AT-SAMPLES behaviour is the right answer when those are real. Choosing whether to honour the data exactly is a §4.2 modelling choice that propagates to §5.1 via the kriging-system diagonal.

Smoothing. A KRIGED MAP is SMOOTHER than the underlying field. Specifically, the variance of $Z^$ over the domain is LESS than $\sigma^2 = c_0 + c$ . Goovaerts 1997 §5.3 derives this: $\text{Var}(Z^$ ) = \sigma^2 - \overline{\sigma_K^2} $Var (Z^{*}) = σ^{2} - \overline{σ_{K}^{2}}$ where $\overline{\sigma_K^2}$ is the average kriging variance over the domain. Kriging produces a "best guess" surface that averages out short-range variability the variogram model attributes to short-range structure or nugget — the resulting surface is appropriate for ESTIMATION (the most likely value at each location) but underrepresents the true field roughness. This is why one kriged map is the WRONG question for risk assessment, where the spread of plausible field values around the smooth estimate is exactly what matters; the answer there is REALISATIONS from Part 7's sequential Gaussian simulation, each of which preserves the field roughness.

The screen effect. Kriging weights can be NEGATIVE. Consider three samples in a row at $x = 0, 0.4, 1.0$ and a target at $\mathbf{s}_0 = 0.5$ . Naive intuition says weight should fall off with distance, so the closest sample at 0.4 gets the most, the 0.0-sample gets less, the 1.0-sample gets the least — all positive. The kriging system gives a different answer: the 0.4-sample gets a LARGE POSITIVE weight, the 1.0-sample gets a MODEST POSITIVE weight, but the 0.0-sample gets a slightly NEGATIVE weight. Why? The 0.4-sample sits BETWEEN the 0.0-sample and the target — it already conveys whatever spatial information the 0.0-sample would have contributed. The 0.0-sample is SCREENED OUT by the intervening 0.4-sample, and the kriging system's response is to assign it a small negative weight that corrects for the residual redundancy. This is the SCREEN EFFECT (Isaaks & Srivastava 1989 §13.1; Chilès & Delfiner 2012 §3.4.2). Negative weights are not a bug — they are the kriging system handling redundant clustered information automatically. The first widget below makes the screen effect visible in 1-D.

Coincident-sample handling. If two samples are at the same location, the corresponding rows of $\mathbf{K}$ are identical and the matrix is SINGULAR. Cholesky decomposition fails. Practical remedies, in order of cleanliness: (a) COMBINE the duplicates by averaging (or by majority vote for categorical data) and treat the result as one sample at that location — this is the cleanest fix; (b) RAISE the nugget $c_0$ slightly so the diagonal dominates over the off-diagonals — the matrix becomes well-conditioned (this is essentially what jitter regularisation does numerically); (c) PERTURB the locations by a tiny jitter smaller than half the closest non-coincident pair distance — but this is a hack and changes the implied covariance. In practice, real datasets occasionally have coincident or near-coincident samples (re-sampling at the same well; multiple labs on the same core split). The widget implementations in this section use jitter regularisation to handle near-coincident cases gracefully.

A worked five-sample example by hand

Concrete numbers help. Consider five samples on the unit interval $[0, 1]$ with values:

\begin{array}{c|ccccc} i & 1 & 2 & 3 & 4 & 5 \\ \hline x_i & 0.10 & 0.25 & 0.45 & 0.70 & 0.90 \\ z_i & 1.0 & 2.0 & 3.5 & 2.5 & 1.5 \end{array}

Let $m = 2.1$ (the data mean), variogram $\gamma_{\text{Sph}}$ with $c_0 = 0, c = 1, a = 0.5$ . Target $\mathbf{s}_0 = 0.55$ .

Step 1: build $\mathbf{K}$ . Compute $C(|x_i - x_j|) = (c_0 + c) - \gamma(|x_i - x_j|)$ for every pair. With Spherical $\gamma(h) = c (1.5 (h/a) - 0.5 (h/a)^3)$ for $h < a$ and $\gamma = c$ otherwise. The $5\times 5$ matrix has diagonal entries $C(0) = 1$ (sill) and off-diagonals $C(h)$ that drop with separation. For example $h_{12} = |0.10 - 0.25| = 0.15$ , so $\gamma_{12} = (1.5 \cdot 0.3 - 0.5 \cdot 0.027) = 0.4365$ and $C_{12} = 1 - 0.4365 = 0.5635$ . The full $\mathbf{K}$ (rounded to 3 dp) is approximately:

\mathbf{K} \approx \begin{pmatrix} 1.000 & 0.564 & 0.122 & 0.000 & 0.000 \\ 0.564 & 1.000 & 0.432 & 0.014 & 0.000 \\ 0.122 & 0.432 & 1.000 & 0.313 & 0.014 \\ 0.000 & 0.014 & 0.313 & 1.000 & 0.432 \\ 0.000 & 0.000 & 0.014 & 0.432 & 1.000 \end{pmatrix}

(the small but non-zero off-diagonals at separation slightly past $h = a = 0.5$ get clipped to zero by the Spherical model.) Step 2: build $\mathbf{k}$ . Distances from each sample to $\mathbf{s}_0 = 0.55$ :

\mathbf{k} = \begin{pmatrix} C(0.45) \\ C(0.30) \\ C(0.10) \\ C(0.15) \\ C(0.35) \end{pmatrix} \approx \begin{pmatrix} 0.015 \\ 0.208 \\ 0.704 \\ 0.564 \\ 0.122 \end{pmatrix}

Step 3: solve $\mathbf{K} \mathbf{w} = \mathbf{k}$ . Cholesky gives $\mathbf{w} \approx (-0.04,, -0.04,, 0.60,, 0.40,, -0.06)$ . The two closest samples ( $x = 0.45$ and $x = 0.70$ ) take the lion's share (0.60 + 0.40 = 1.00); the outer samples have small NEGATIVE weights — the screen effect. Step 4: compute the estimate. $Z^*(0.55) = m + \sum w_i (z_i - m) = 2.1 + (-0.04)(-1.1) + (-0.04)(-0.1) + 0.60(1.4) + 0.40(0.4) + (-0.06)(-0.6) \approx 2.1 + 0.04 + 0.004 + 0.840 + 0.160 + 0.036 \approx 3.18$ . Step 5: compute the variance. $\sigma_K^2(0.55) = 1.0 - \mathbf{w}^\top \mathbf{k} \approx 1.0 - 0.633 \approx 0.37$ (so $\sigma_K \approx 0.61$ ). The estimate is between the two closest sample values (3.5 and 2.5, weighted toward 3.5 because it's slightly closer) and the variance is well below the prior sill of 1.0 because we have two samples within the range.

Now repeat with a SHORTER range $a = 0.20$ . The covariance $C$ drops off faster, so $\mathbf{K}$ has smaller off-diagonals (more diagonal-dominant), and $\mathbf{k}$ has smaller entries everywhere except for the sample at $x = 0.45$ (distance $0.10 < 0.20$ ) and the sample at $x = 0.70$ (distance $0.15 < 0.20$ ). Solving again, the weights concentrate even more on the two closest samples ( $\mathbf{w} \approx (0.00, 0.01, 0.74, 0.50, 0.00)$ ) and the variance climbs ( $\sigma_K^2 \approx 0.42$ ) because the short-range variogram says distant samples carry little information about the target. The same data, the same target, different variograms — different kriging answers. This is why the §4.5 fitting workflow matters: the variogram you fit IS the kriging system.

The first widget for §5.1 makes the kriging system explorable in 1-D. Six samples on $[0, 1]$ with a known mean $\mu$ , a Spherical variogram with adjustable $c_0, c, a$ . The reader drags the query point $\mathbf{s}_0$ along the x-axis; the widget recomputes the kriging system at each position and shows the result in three panels.

The TOP panel shows the samples plus the KRIGED CURVE $Z^$ sweeping across the domain with a ±1 $\sigma_K$ shaded envelope, plus the current query point as a diamond. Watch what the kriged curve does: it threads the samples (or comes close to them, modulo nugget) and the variance envelope BULGES in regions far from samples. The MIDDLE panel is the bar chart of the six WEIGHTS ${w_i}$ at the current query point. Some bars are POSITIVE (the closest samples contribute toward the prediction); some can be NEGATIVE (the screen effect — an intervening sample shielding a more distant one). The BOTTOM panel reports the numerical solve: $Z^$ (\mathbf{s}_0), \sigma_K^2(\mathbf{s}_0), \sigma_K(\mathbf{s}_0), \Sigma w_i $Z^{*} (s_{0}), σ_{K}^{2} (s_{0}), σ_{K} (s_{0}), Σ w_{i}$ , and the sill $c_0 + c$ .

Three things to do with this widget. FIRST, drag the query between sample 3 and sample 4 (around $x = 0.5$ ). Watch the weights: samples 3 and 4 dominate, while samples 1, 5, and 6 sit near zero or slightly negative. The estimate is essentially the midpoint of samples 3 and 4, weighted toward whichever is closer. The variance is small — both nearest samples are well within the range. Now slide the query far to the right ( $x = 0.97$ ). Sample 6 takes most of the weight; the variance starts to climb because the prior variance $c_0 + c$ kicks in as $\mathbf{k}$ thins out.

SECOND, set the range $a$ very small (0.10). The covariance kernel becomes spiky — only the closest sample matters. Drag the query and watch the weights now concentrate on a SINGLE sample at a time. As you slide past each sample, the dominant weight transfers cleanly between them; the kriged curve has a piecewise nearest-neighbour shape; the variance is high everywhere except right at samples. Now set the range $a$ very large (1.50). The kernel is essentially constant — all samples are "near". Weights spread out, the kriged curve becomes a slowly-varying near-constant, and the variance is uniformly low. The variogram's range controls how strongly local neighbours dominate.

THIRD, raise the nugget $c_0$ to 0.30. The kriging system's diagonal grows relative to the off-diagonals, so the system "trusts" each sample less. The kriged curve no longer passes through the samples — exact interpolation breaks at non-zero nugget. The variance at sample locations doesn't drop to zero; instead it bottoms out at $c_0$ . This is the §4.2 nugget at work in the kriging system: data with measurement noise gets smoothed, and the smoothing is calibrated by the nugget.

The kriging variance as a "trust map"

In 2-D the kriging variance becomes a SECOND MAP overlaid on the estimate map. Where the estimate map tells you "what" the field looks like, the variance map tells you "how trustworthy" each location is — low $\sigma_K^2$ means the prediction is well-constrained by nearby data, high $\sigma_K^2$ means the prediction relies heavily on the prior (the mean $m$ and the variogram). Crucially, the variance map is a function of the SAMPLE LAYOUT and the VARIOGRAM ONLY — it doesn't depend on the data VALUES at the samples.

This data-independence has a profound consequence: you can compute the kriging variance at any candidate sample location BEFORE drilling that hole. The variance reduction expected from a proposed new sample is computable from layout alone. This is the basis of KRIGING-DRIVEN SAMPLING DESIGN — choose the new sample location that maximises the predicted variance reduction, then drill there. The §5.4 chapter develops this in full with appropriate caveats; §6.3 develops the calibration check that confirms the variance NUMBERS are right.

The second widget for §5.1 makes the variance map explicit by drawing the estimate and the variance side by side.

Three layouts are pre-built. CLUSTERED puts ten samples in a tight cluster in one corner of the unit square plus three scattered outliers. UNIFORM places samples on a jittered 4×4 grid. RANDOM places about fourteen samples by Poisson-disk rejection sampling — random but with a minimum separation. For each layout the widget computes the kriged estimate $Z^*(\mathbf{s})$ and the kriging variance $\sigma_K^2(\mathbf{s})$ on a 60×60 grid and renders both as heatmaps. The reader controls the Spherical variogram parameters and the known mean $\mu$ .

Three things to do with this widget. FIRST, pick the CLUSTERED layout. The estimate map (left) shows a smooth surface; the variance map (right) shows a LOW-VARIANCE BLUE PATCH in the cluster region (where information is dense) and HIGH-VARIANCE RED everywhere else. The variance approaches the sill $(c_0 + c)$ where the data have no leverage. Now switch to UNIFORM. The variance map spreads evenly — every grid cell has a nearby sample, so no region is starved. Now switch to RANDOM. Variance lows mirror the sample positions in a more irregular pattern; highs sit in the gaps between samples.

SECOND, slide the known mean $\mu$ up and down. Watch the estimate map shift (the prior reference value moves, so the kriged map shifts) but the VARIANCE MAP DOES NOT CHANGE. This is the data-independence of the variance: it depends only on the sample layout and the variogram, not on the data values or the known mean. This is the property that makes kriging-driven sampling design work — you can plan where to drill next based on the layout alone.

THIRD, vary the variogram range $a$ . Short range (a = 0.10) and the variance map has tight low-variance bubbles around each sample with steep gradients to the sill elsewhere — only very nearby data matters. Long range (a = 0.80) and the variance lows MERGE into a broader region of "well-covered" space; the high-variance regions are confined to the corners. The variogram range encodes how far a sample's information reaches — and the kriging variance map visualises that reach.

Honest caveats — what simple kriging assumes

SK is the foundation of the kriging family, but it rests on assumptions that any honest practitioner should keep visible.

The mean is KNOWN. If you estimate $m$ from the data and promote that estimate to a known constant, the kriging variance $\sigma_K^2$ UNDER-ESTIMATES the actual prediction error. The reason: the SK system treats $m$ as a fixed constant, but the estimate $\hat m$ has its own sampling variance which is not propagated into $\sigma_K^2$ . The fix is ORDINARY KRIGING (§5.2), which drops the known-mean assumption and adds a Lagrange constraint to enforce unbiasedness. In practice, almost all production geostatistics workflows use ordinary kriging by default for exactly this reason; simple kriging is reserved for cases where $m$ is genuinely known from prior information (regulatory references, regional means, depositional facies priors) rather than estimated from the data being kriged.

"Best" is conditional on the variogram. Kriging is BLUE under the assumed variogram. If the variogram is wrong, the kriging is wrong — usually silently. An UNDER-FIT variogram (too short a range, too small a sill) gives a kriging variance that UNDER-ESTIMATES actual error — the kriged map looks well-constrained but isn't. An OVER-FIT variogram gives $\sigma_K^2$ that OVER-ESTIMATES actual error — the predictions are conservative beyond what the data support. The §4.5 cross-validation diagnostic (SD of standardised residuals near 1) is the empirical test that the variogram is right. §6.3 develops the full calibration apparatus.

The LINEAR restriction. Kriging weights are constants — the predictor is linear in the data values. Non-linear predictors (indicator kriging in §8.2, kernel methods, generative models) can extract additional information for non-Gaussian fields. A heavy-tailed dataset where the upper tail matters disproportionately — high ore grades, contamination hotspots — is often better served by indicator methods that target specific cumulative-probability cutoffs rather than the conditional mean. SK gives you a calibrated mean-and-variance map; it doesn't give you the conditional probability of exceeding a threshold.

Coincident or near-coincident samples. The system $\mathbf{K} \mathbf{w} = \mathbf{k}$ becomes singular or ill-conditioned when samples are at the same location. Combine duplicates by averaging, raise the nugget slightly, or use jitter regularisation. Production-quality kriging codes (GSLIB's kt3d, gstat, Petrel's kriging) all handle this defensively.

Stationarity. The kriging derivation assumes second-order stationarity of $Z$ — constant mean and translation-invariant covariance. Real fields often have TRENDS (mean varying with location) that violate this. §5.3's universal kriging and kriging-with-external-drift handle this by modelling the trend explicitly. Simple kriging applied to a trending field will produce systematic bias (mean(z) significantly different from zero in cross-validation).

Computational scaling. Solving an $N \times N$ Cholesky is $O(N^3)$ for direct factorisation. For a few hundred samples, this is fast. For tens of thousands, prohibitive. §5.6 develops neighbourhood selection — using only the $k$ nearest samples to the target — which reduces the cost to $O(k^3)$ per target and is the practical default for production kriging at scale.

§5.1 in the architecture of Part 5

§5.1 is the FOUNDATION. The next seven sections each extend or stress the system $\mathbf{K} \mathbf{w} = \mathbf{k}$ in one specific direction. §5.2 (ordinary kriging) drops the known-mean assumption by introducing a Lagrange multiplier — the matrix grows by one row and column, the right-hand side gets one extra entry, and the weights sum to 1. §5.3 (universal kriging, KED) generalises the mean to a sum of basis functions with unknown coefficients — additional constraint rows for each basis function. §5.4 cashes out what the kriging variance does and does not tell you, with calibration caveats. §5.5 (block kriging) replaces the point target $\mathbf{s}_0$ with a target BLOCK and integrates the covariance over that block — the right-hand side $\mathbf{k}$ becomes a vector of POINT-TO-BLOCK average covariances. §5.6 (neighbourhood selection) restricts which samples enter the kriging system, trading off computational cost against estimation quality. §5.7 (cokriging) brings in secondary variables — the system becomes block-structured with cross-covariance entries. §5.8 catalogues common kriging pathologies and how to spot them — wrong nugget, wrong range, wrong drift, neighbourhood artefacts, screen-effect surprises.

Every later section starts from the §5.1 derivation and modifies the linear system. The pattern is the same throughout: state the prediction problem; write down the linear-predictor form; impose the right unbiasedness constraint; minimise MSE; solve the resulting linear system; report the predictor and the variance. Once §5.1 is internalised, the rest of Part 5 is a tour of the variations.

Parts 6 and 7 then build on the kriging foundation. Part 6 develops the cross-validation and calibration apparatus that confirms a kriged map is actually trustworthy — the SD(z) ≈ 1 standardised-residual diagnostic from §4.5 generalised to the full kriging output, with accuracy plots, calibration diagrams, conditional-bias diagnostics. Part 7 takes kriging from a "best guess" estimator to a STOCHASTIC SIMULATION — sequential Gaussian simulation draws conditional realisations from the kriged conditional distribution, giving you the stack of plausible field realisations that risk-decision workflows actually require. Both rest on §5.1.

Try it

In simple-kriging-step, drag the query point to $x = 0.30$ . Read the dominant weight from the middle panel — sample 2 should take the lion's share. Now drag to $x = 0.50$ . The dominant weight shifts: samples 3 and 4 share most of it. Watch how the kriged curve in the top panel smoothly traces this changing weight pattern.
In simple-kriging-step, set range $a = 0.10$ (the kernel is narrow). Drag the query slowly from x = 0 to x = 1. Each sample takes over the weight as the query passes near it; the curve looks piecewise nearest-neighbour. Now set $a = 1.50$ (the kernel is wide). The weights spread evenly, the curve becomes almost constant, and the variance is nearly uniform. This is the variogram range encoded as kriging behaviour.
In simple-kriging-step, drag the query to $x = 0.97$ (far right). One or two samples take all the weight; the variance climbs because $\mathbf{k}$ thins out. Read the variance ratio in the message box — it should be a substantial fraction of the sill. Now drag back to $x = 0.45$ (right between samples 3 and 4). Variance drops to a small fraction of the sill. This is the trust map paying off.
In simple-kriging-step, find a query location where at least one weight is NEGATIVE. Try $x = 0.10$ with the default variogram. The negative weight is the screen effect — an intervening sample shields a more distant one. Note that negative weights are NOT a bug; they reflect redundant clustered information.
In simple-kriging-step, raise the nugget to $c_0 = 0.30$ . The kriged curve no longer passes exactly through the sample points — exact interpolation breaks at non-zero nugget. At sample locations the variance is non-zero (it bottoms at $c_0 = 0.30$ ). This is the §4.2 nugget propagating into the kriging system.
In kriging-variance-map, pick the CLUSTERED layout. Note the variance map (right): low blue patch around the cluster, high red everywhere else. Slide the known mean $\mu$ from 0.0 to 1.0. The estimate map shifts; the variance map DOES NOT CHANGE. This is the data-independence of the kriging variance.
In kriging-variance-map, compare CLUSTERED, UNIFORM, and RANDOM layouts at the same variogram. Read the "peak variance % of sill" pill for each. CLUSTERED has the highest peak (most empty regions); UNIFORM has the lowest (most evenly covered); RANDOM is in between. Sampling design matters.
In kriging-variance-map, set the range $a$ to 0.10 (short) and then 0.80 (long). At short range, variance lows are tight blue pockets around samples; at long range, the lows merge into broader well-covered regions. The variogram range visualises as the "reach" of each sample's information.
Without coding: a reservoir engineer has 50 well samples with porosity values. The global declustered mean is 0.18. They want to predict porosity at a new well location using simple kriging with a fitted Spherical variogram $c_0 = 0.002, c = 0.005, a = 200,\text{m}$ . The nearest 5 wells are within 80 m of the proposed location; the next 5 are 120-300 m away. Roughly, what would you expect for: (a) the kriging variance at the proposed location (compared to the sill); (b) the sum of the weights (less than 1, equal to 1, or greater than 1); (c) the dominant samples by weight (closest few or distributed widely)?
Without coding: the same engineer's cross-validation gives SD(z) = 1.45 on the simple-kriging predictions. What does this diagnose, and what would you change about the simple kriging set-up (specifically: the assumption about the mean, the variogram, or both)?

Pause and reflect: the kriging variance $\sigma_K^2(\mathbf{s}_0)$ depends on the sample layout and the variogram model, but not on the data values. This is the basis for kriging-driven sampling design: you can compute where the next sample should go BEFORE drilling. But the variance NUMBERS — the actual size of $\sigma_K^2$ — are only as good as the variogram. What three pieces of evidence would you want to see, before trusting a kriging-variance map for a high-stakes sampling-design decision, that the variogram itself is well-fitted to the data?

What you now know — and the open of Part 5

You can state the KRIGING PROBLEM: given $N$ samples and a variogram, predict $Z$ at an unsampled location with calibrated uncertainty. You know the LINEAR-PREDICTOR restriction — $Z^*(\mathbf{s}_0) = \sum_i w_i Z(\mathbf{s}_i) + w_0$ — and the BLUE optimality criterion that picks the weights. You can apply the SIMPLE-KRIGING ASSUMPTION (known mean $m$ ) and reformulate the predictor on residuals $Y(\mathbf{s}) = Z(\mathbf{s}) - m$ .

You can DERIVE the SK system $\mathbf{K} \mathbf{w} = \mathbf{k}$ from MSE minimisation, where $K_{ij} = C(\mathbf{s}_i, \mathbf{s}_j)$ is the sample-to-sample covariance matrix and $k_i = C(\mathbf{s}_i, \mathbf{s}_0)$ is the sample-to-target vector. You can SOLVE the system via Cholesky decomposition (with jitter regularisation for numerical robustness) and you can compute the SIMPLE-KRIGING VARIANCE $\sigma_K^2(\mathbf{s}_0) = C(0) - \mathbf{w}^\top \mathbf{k}$ .

You can recognise the four PROPERTIES of simple kriging — BLUE optimality, exact interpolation at samples (in the no-nugget case), smoothing of the kriged map relative to the field, and the screen effect (negative weights for clustered/redundant samples). You can handle coincident-sample SINGULARITY of $\mathbf{K}$ by averaging duplicates, raising the nugget, or using jitter regularisation.

You can WALK THROUGH a worked example: build $\mathbf{K}$ and $\mathbf{k}$ from a Spherical variogram, solve for weights, compute the prediction and variance. You can interpret the WIDGET output — the kriging weights, the kriged curve, the variance envelope, the side-by-side estimate-and-variance maps. You can identify the KRIGING VARIANCE as a DATA-INDEPENDENT TRUST MAP: it depends only on sample layout and variogram, which makes it usable for sampling design.

You know the HONEST CAVEATS — known-mean assumption matters (§5.2 fixes); BLUE is conditional on the variogram being right (§4.5 cross-validation + Part 6 calibration test); LINEAR is a restriction (§8.2 indicator methods extend); stationarity matters (§5.3 universal kriging handles trends); coincident samples need defensive handling.

This OPENS PART 5 — KRIGING. The next sections extend the §5.1 foundation in specific directions. §5.2 drops the known-mean assumption. §5.3 generalises the mean to a basis-function trend. §5.4 cashes out what the kriging variance does and does not mean. §5.5 generalises to block targets. §5.6 develops neighbourhood selection. §5.7 brings in secondary data. §5.8 catalogues the modal kriging pathologies. Each builds on the linear system $\mathbf{K} \mathbf{w} = \mathbf{k}$ you just derived. Parts 6 and 7 then validate and stochastically extend the kriging machinery to give you calibrated uncertainty and conditional realisations. The variogram models from Parts 3-4 finally do useful work here — every term in the kriging system is a covariance value, hence a variogram value, hence directly produced by the §4.5 fitting workflow. Bad variogram fit → bad kriging; good variogram fit → calibrated estimation and trustworthy uncertainty maps. §5.1 is the foundation; the rest of Part 5 is its working out.

References

Matheron, G. (1963). Principles of geostatistics. Economic Geology, 58(8), 1246–1266. (The foundational paper. Defines the regionalised-variable framework, the variogram, and the kriging estimator. Names the technique after Krige and develops the BLUE derivation. The original cited reference for everything in §5.1.)
Matheron, G. (1971). The Theory of Regionalized Variables and Its Applications. Cahiers du Centre de Morphologie Mathématique, École des Mines de Paris, No. 5. (The expanded mathematical treatment of the 1963 paper. Goes through the kriging system derivation in full rigour, with the connection between permissibility of the variogram and PSD-ness of the kriging matrix made explicit. The reference for the mathematical-statistics underpinnings.)
Krige, D.G. (1951). A statistical approach to some basic mine valuation problems on the Witwatersrand. Journal of the Chemical, Metallurgical and Mining Society of South Africa, 52(6), 119–139. (The original empirical paper that anticipated the optimal-weighting approach to ore-grade estimation. Krige observed that small samples in a high-grade area systematically over-estimated the grade — the conditional-bias phenomenon — and proposed a corrective regression. Matheron 1963 formalised the underlying mathematics and named the technique after Krige.)
Cressie, N. (1993). Statistics for Spatial Data (revised ed.). Wiley. (§3 covers the kriging family at mathematical-statistics rigour. §3.4 develops simple kriging from the BLUE criterion under known-mean stationarity. The reference textbook for the formal derivation and the asymptotic properties of the kriging predictor.)
Chilès, J.-P., Delfiner, P. (2012). Geostatistics: Modeling Spatial Uncertainty (2nd ed.). Wiley. (Chapter 3 is the comprehensive reference for kriging — simple, ordinary, universal, KED, block, neighbourhood — with derivations, properties, and the screen effect. §3.4.2 specifically treats negative weights and the screen effect. The standard modern graduate-school reference.)
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford University Press. (§5.3 develops simple kriging from the practitioner perspective. §5.7 catalogues the kriging properties — BLUE, exact interpolation, smoothing — with worked examples on the WL data. The practitioner-textbook reference for §5.1 material at graduate-school level.)
Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (Chapters 12–14 develop the kriging family at the practitioner-pedagogy level. Chapter 13 specifically treats the screen effect, negative weights, and the geometric intuition for the kriging system. The most readable entry-level reference for §5.1.)
Deutsch, C.V., Journel, A.G. (1998). GSLIB: Geostatistical Software Library and User's Guide (2nd ed.). Oxford University Press. (§IV.1 documents the kt3d program — the canonical GSLIB simple/ordinary/universal kriging implementation. Reading the source is how to verify that a custom kriging implementation matches a reference. The §5.1 widgets in this section implement the same algorithm at smaller scale.)
Pyrcz, M.J., Deutsch, C.V. (2014). Geostatistical Reservoir Modeling (2nd ed.). Oxford University Press. (§4 documents the kriging workflow as used in reservoir-characterisation production. Emphasises the variogram-to-kriging connection — every term in the kriging system is a variogram value, hence the §4.5 fitting workflow is the input to §5.1. Catalogues the practical setup steps for production kriging runs.)
Stein, M.L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer. (The rigorous mathematical-statistics treatment of kriging. Develops the asymptotic theory under increasing-domain and infill asymptotics, the connection between kriging and best linear prediction in general Hilbert spaces, and the conditions under which kriging predictors are consistent. The reference for the theoretical foundations.)
Wackernagel, H. (2003). Multivariate Geostatistics (3rd ed.). Springer. (Chapter 11 develops simple kriging within the broader multivariate framework. Useful for seeing how the §5.7 cokriging system specialises to §5.1 simple kriging when there is only one variable.)