Universal kriging and kriging with external drift

Part 5 — Kriging

Learning objectives

Diagnose the LIMITATION OF ORDINARY KRIGING with respect to non-stationary means. OK assumes $m$ is constant ACROSS THE KRIGING NEIGHBOURHOOD. Real fields often have a smoothly varying mean — porosity declining with depth, temperature decreasing with elevation, water-table depth tracking ground-surface elevation, ore grade following a depositional axis. With a GLOBAL neighbourhood, OK averages this trend into a single number and silently underestimates the data-trend leverage. With a tight LOCAL neighbourhood, OK can absorb a slow trend; with a fast trend, even local OK becomes biased. UNIVERSAL KRIGING (Matheron 1969) handles the trend EXPLICITLY by promoting the mean to a basis-function expansion, replacing the OK constraint $\sum w = 1$ with one constraint per basis function.
State the UNIVERSAL KRIGING ASSUMPTION: the field mean $m(\mathbf{s})$ is a known LINEAR COMBINATION of $K$ basis functions $f_k(\mathbf{s})$ with UNKNOWN coefficients $\beta_k$ : $m(\mathbf{s}) = \sum_{k=0}^{K-1} \beta_k f_k(\mathbf{s})$ . The basis functions are KNOWN — typically POLYNOMIAL in the coordinates. Common bases: zero-order (constant: $f_0 = 1$ , which recovers OK as a special case); first-order ( $1, x, y$ — linear trend in 2-D); second-order ( $1, x, y, x^2, y^2, xy$ — quadratic trend in 2-D). The coefficients $\boldsymbol\beta = (\beta_0, \ldots, \beta_{K-1})^\top$ are NOT known a priori — UK estimates them IMPLICITLY through the augmented kriging system, just as OK estimates the single constant mean through $\sum w = 1$ .
Derive the UK UNBIASEDNESS CONSTRAINTS. Take the expectation of the linear predictor $E[Z^*_{UK}(\mathbf{s}_0)] = \sum_i w_i E[Z(\mathbf{s}_i)] = \sum_i w_i \sum_k \beta_k f_k(\mathbf{s}_i) = \sum_k \beta_k \sum_i w_i f_k(\mathbf{s}_i)$ . For this to equal $E[Z(\mathbf{s}_0)] = \sum_k \beta_k f_k(\mathbf{s}_0)$ regardless of the unknown values of the $\beta_k$ 's, the coefficients of each $\beta_k$ must match individually: $\sum_{i=1}^N w_i f_k(\mathbf{s}_i) = f_k(\mathbf{s}_0)$ for each $k = 0, \ldots, K-1$ . This is $K$ linear constraints on the weights — one per basis function. The OK constraint $\sum w = 1$ is recovered as the $k = 0$ case when $f_0 \equiv 1$ .
Write the UK AUGMENTED SYSTEM. With $\mathbf{K}$ the $N \times N$ sample-to-sample covariance matrix, $\mathbf{k}$ the $N \times 1$ sample-to-target vector, $\mathbf{F}$ the $N \times K$ matrix of basis functions evaluated at the samples (row $i$ : $F_{ik} = f_k(\mathbf{s}_i)$ ), and $\mathbf{f}_0$ the $K \times 1$ vector of basis functions at the target (entry $k$ : $(\mathbf{f}_0)_k = f_k(\mathbf{s}_0)$ ), the UK system is $\begin{pmatrix} \mathbf{K} & \mathbf{F} \\ \mathbf{F}^\top & \mathbf{0} \end{pmatrix} \begin{pmatrix} \mathbf{w} \\ \boldsymbol\mu \end{pmatrix} = \begin{pmatrix} \mathbf{k} \\ \mathbf{f}_0 \end{pmatrix}$ . The augmented matrix is $(N+K) \times (N+K)$ . The bottom-right $K \times K$ block is zero (same reason as in OK: there is no $\mu_p\mu_q$ coupling in the Lagrangian). The matrix is SYMMETRIC INDEFINITE — solve by LU with partial pivoting, the same way OK's $(N+1) \times (N+1)$ system is solved.
Compute the UK kriging variance. Same Lagrangian-substitution argument as in OK: $\sigma_{UK}^2(\mathbf{s}_0) = C(0) - \mathbf{w}^\top \mathbf{k} + \boldsymbol\mu^\top \mathbf{f}_0 = C(0) - \sum_i w_i C(\mathbf{s}_i, \mathbf{s}_0) + \sum_k \mu_k f_k(\mathbf{s}_0)$ (sign convention matching Cressie 1993 / Chilès & Delfiner 2012 — for OK with the single constant basis, $\mu_0 \ge 0$ ; with multiple bases, individual $\mu_k$ may take either sign while the AGGREGATE $\boldsymbol\mu^\top \mathbf{f}_0$ remains non-negative, preserving $\sigma_{UK}^2 \ge 0$ ). The aggregate Lagrange term $\boldsymbol\mu^\top \mathbf{f}_0$ adds a non-negative quantity to the variance — UK's variance EXCEEDS OK's variance EXCEEDS SK's variance at the same target, because each successive variant absorbs more estimation uncertainty into the price (mean, then mean + trend coefficients).
State the KRIGING WITH EXTERNAL DRIFT (KED) variant. KED is structurally identical to UK with $K = 2$ basis functions, but the second basis $f_1(\mathbf{s})$ is NOT a polynomial of coordinates — it is a known SECONDARY VARIABLE $\zeta(\mathbf{s})$ defined DENSELY across the kriging domain. Examples: temperature with elevation as drift ( $\zeta(\mathbf{s}) =$ elevation), groundwater table with ground-surface elevation, ore grade with depth, soil moisture with NDVI, rainfall with elevation. The KED system is the UK system with $\mathbf{F} = [\mathbf{1}, \boldsymbol\zeta]$ where $\boldsymbol\zeta$ is the $N \times 1$ vector of secondary values at samples and $\mathbf{f}_0 = (1, \zeta(\mathbf{s}_0))^\top$ . The drift variable $\zeta$ MUST be defined at every kriging target — that is the key requirement; KED CANNOT be used if the secondary is only sampled at a subset of locations (in that case use cokriging — §5.7).
Apply the PRACTICAL EXAMPLES. (a) MINING — ore grade versus depth in a layered deposit. KED with depth as the drift variable extracts the systematic vertical gradient; ordinary kriging on the residuals captures the lateral variability. (b) METEOROLOGY — air temperature interpolation with elevation as drift. The MAUP-style stations are sparse on mountains, dense in valleys; temperature drops about 6.5 °C per kilometre of elevation gain (the environmental lapse rate). KED with elevation as drift, available everywhere from a DEM, gives much better interpolation than OK alone. (c) HYDROLOGY — groundwater-table depth with surface elevation as drift; KED captures the topography-driven head gradient. (d) ENVIRONMENTAL — rainfall with elevation as drift, soil contamination with distance-from-source as drift.
Use the PRAGMATIC ALTERNATIVE: TREND + RESIDUAL DECOMPOSITION. Step 1 — fit a TREND model $\hat m(\mathbf{s})$ to the data by least-squares regression on the chosen basis (polynomial coordinates for UK-style; secondary variable for KED-style). Step 2 — compute RESIDUALS $r(\mathbf{s}_i) = Z(\mathbf{s}_i) - \hat m(\mathbf{s}_i)$ . Step 3 — fit a VARIOGRAM on the residuals (NOT on the raw data, because the trend contaminates $\gamma(h)$ — see next point). Step 4 — KRIGE the residuals using OK on $r$ . Step 5 — ADD the trend back at every kriging target: $Z^*(\mathbf{s}_0) = \hat m(\mathbf{s}_0) + r^*(\mathbf{s}_0)$ . This is sometimes called REGRESSION KRIGING (Hengl, Heuvelink & Stein 2004 *Geoderma*; Pyrcz & Deutsch 2014 §4.6). It is mathematically NOT identical to UK — UK propagates the trend uncertainty into the kriging variance via the Lagrange multipliers, but regression kriging treats the fitted trend as fixed and so under-reports variance by an amount related to the regression-coefficient uncertainty. Use regression kriging when its simplicity outweighs the slight under-reporting; use full UK/KED when you need calibrated variance.
Recognise the RESIDUAL-VARIOGRAM trap. Under UK or KED, the variogram should be computed on the RESIDUALS $r(\mathbf{s}) = Z(\mathbf{s}) - \hat m(\mathbf{s})$ , NOT the raw data. A variogram computed on data with a strong trend looks like an unbounded power-function — $\gamma(h)$ grows without reaching a sill, because the trend contributes a deterministic $m(\mathbf{s})^2$ -type contamination at every lag. This pseudo-sill-less variogram is the classic SIGNATURE OF UNDETECTED TREND in a dataset and a flag that UK/KED is the right tool. In practice, fit a low-order polynomial (or regress on the secondary), subtract it, then compute the variogram on residuals; the result should plateau at a sensible sill and look like a clean §3/§4 fittable shape.
Identify the HONEST CAVEATS — what UK and KED ASSUME, and how the assumptions can fail. (a) TREND SPECIFICATION: UK assumes the basis is correct. Over-fitting the trend (too many polynomial terms) leaves NOTHING for kriging to do — the residuals are pure noise, the variogram has only nugget, and the kriging weights effectively become trend coefficients with no spatial-correlation contribution. Under-fitting (too few terms) leaves trend IN the residuals — the variogram is contaminated, the kriging is biased. The order of the polynomial should be SMALL (typically 1, sometimes 2). (b) EXTRAPOLATION: kriging beyond the data range with UK is DANGEROUS — the polynomial extrapolates wildly, and outside the data envelope $m^*(\mathbf{s}_0)$ can be hugely off. UK is safe interpolation INSIDE the data range; it is NOT safe extrapolation outside. (c) KED DRIFT VARIABLE: requires $\zeta(\mathbf{s}_0)$ at every target — sometimes a strong assumption (elevation is OK because DEMs are dense; depth in a borehole-only dataset is NOT defined laterally between holes).
Distinguish UK FROM KED in three respects. (i) BASIS FUNCTIONS: UK uses POLYNOMIALS of coordinates $(x, y, z)$ ; KED uses an EXTERNAL secondary variable $\zeta(\mathbf{s})$ . (ii) INFORMATION REQUIRED: UK needs only the sample data and the target coordinates; KED additionally needs $\zeta$ defined densely everywhere. (iii) WHEN TO USE EACH: UK when the trend has a CLEAR GEOMETRIC structure with the coordinates (e.g., gravity sliding on a smooth dipping plane). KED when there is a PHYSICAL EXPLANATORY VARIABLE that drives $Z$ via a known relationship (temperature ~ elevation; ore grade ~ depth in a layered deposit). KED is preferred over UK whenever a defensible explanatory variable exists, because it grounds the trend in physics rather than coordinates.
Walk through a CONCRETE WORKED EXAMPLE for KED. Setup: 8 air-temperature samples at known (x, elevation) locations along a transect. Spherical residual variogram with $c_0 = 0.5, c = 4.0, a = 200\text{ km}$ (typical for residuals after detrending by elevation). KED basis $\mathbf{F} = [\mathbf{1}, \boldsymbol{\zeta}]$ where $\boldsymbol\zeta$ is the elevation column. Target: predict temperature at $(x_0, z_0) = (150\text{ km}, 1800\text{ m})$ . Step 1 — build the $8 \times 8$ covariance $\mathbf{K}$ from sample separations. Step 2 — augment with an $8 \times 2$ $\mathbf{F}$ block of [1, elevation]. Step 3 — build the $8 \times 1$ $\mathbf{k}$ from sample-to-target separations. Step 4 — set $\mathbf{f}_0 = (1, 1800)^\top$ . Step 5 — solve the $10 \times 10$ system by LU. Step 6 — predict $Z^*_{KED}(\mathbf{s}_0) = \sum_i w_i z_i$ . The result combines a temperature pulled toward nearby stations (kriging part) with a correction for the elevation difference (drift part).
Use the WORKFLOW for a defensible UK / KED report. (1) DECIDE which variant is appropriate: examine the data for a coordinate-driven trend (favours UK) versus an external-variable-driven trend (favours KED). (2) FIT the trend by regression — for UK choose the polynomial order (typically 1, rarely 2); for KED identify the secondary variable. (3) COMPUTE residuals and fit the variogram on them (NOT on the raw data — see the residual-variogram trap). (4) BUILD the augmented $(N+K) \times (N+K)$ kriging matrix at each target; in KED, $K = 2$ usually. (5) SOLVE by LU with partial pivoting. (6) VERIFY the unbiasedness constraints on the solved weights — for each $k$ , confirm $\sum_i w_i f_k(\mathbf{s}_i) \approx f_k(\mathbf{s}_0)$ . (7) PREDICT $Z^*(\mathbf{s}_0)$ and compute $\sigma_{UK}^2(\mathbf{s}_0)$ . (8) CROSS-VALIDATE (leave-one-out) — SD of standardised UK / KED residuals should be near 1, just as for OK. (9) REPORT the predictor, the variance map, the trend coefficients (in the regression-kriging variant), and the diagnostics.

§5.2 dropped the simple-kriging known-mean assumption and replaced it with a single unbiasedness constraint $\sum w = 1$ . That constraint says the predictor is a true weighted average — every sample contributes a share, the shares sum to 1, and the mean $m$ never appears explicitly because the constraint absorbs it. Ordinary kriging is the practical default of geostatistics for exactly this reason. But the OK assumption is still STRONG: the mean must be constant — at least across the kriging neighbourhood.

Real fields often have a smoothly varying mean. Porosity declines with depth in a sedimentary reservoir. Air temperature drops with elevation. The groundwater-table depth tracks the surface elevation. Ore grade follows a depositional axis from proximal to distal facies. Soil moisture correlates with NDVI. Pollutant concentrations decay with distance from a source. These are not nuisance variability layered on top of a stationary background — they are PHYSICAL TRENDS, predictable in form, available as either a coordinate-driven polynomial pattern or an external explanatory variable. Forcing OK to handle them as if they were stationary is professional malpractice.

UNIVERSAL KRIGING (Matheron 1969) handles the trend explicitly. The mean is promoted from a constant to a known LINEAR COMBINATION of known BASIS FUNCTIONS — typically polynomial in the coordinates — with unknown coefficients. The OK constraint $\sum w = 1$ generalises to ONE CONSTRAINT PER BASIS FUNCTION. The kriging system grows from $(N+1)\times(N+1)$ to $(N+K)\times(N+K)$ . Solve by LU; predict, compute the kriging variance, report. KRIGING WITH EXTERNAL DRIFT (KED) is the same machinery but the basis is not a polynomial — it is a known SECONDARY VARIABLE defined densely across the kriging domain. UK is the right tool when the trend has a geometric structure in coordinates; KED is the right tool when there is a defensible explanatory variable.

This section develops UK and KED from the same first principles as §5.1 and §5.2. We diagnose the OK limitation with respect to non-stationary means, state the UK assumption (mean as basis-function expansion with unknown coefficients), derive the $K$ unbiasedness constraints, set up the augmented kriging system, solve it, and compute the kriging variance. Two widgets bring the abstractions down to earth — a 1-D field with a clear linear trend that compares OK against UK, and a 1-D field driven by a secondary variable that compares OK against KED. The teaching arc: OK + trend = UK; OK + secondary variable = KED; both rest on the same augmented kriging system that the OK derivation produced.

OK assumes the mean is constant ACROSS THE KRIGING NEIGHBOURHOOD. With a small local neighbourhood, this is mild — even a slowly varying trend is approximately constant over 15-30 nearby samples. With a global neighbourhood and a clear trend, OK is BIASED — the constraint $\sum w = 1$ forces the implicit mean estimate to be a single number across the entire domain.

Concrete example. Suppose the true field is $Z(\mathbf{s}) = 10 + 0.5, x + \epsilon(\mathbf{s})$ where $\epsilon$ is zero-mean stationary noise and $x \in [0, 10]$ . The true mean at $x = 0$ is 10; at $x = 10$ it is 15. If you run OK with a global neighbourhood, the implicit mean estimate is the global average $\approx 12.5$ . The OK estimate at $x = 0$ will be pulled UP toward 12.5 (since the data nearest $x = 0$ are biased down from the global mean by the trend) and the OK estimate at $x = 10$ will be pulled DOWN toward 12.5. The kriged map will SHRINK the actual gradient — the trend will be visible but ATTENUATED.

The diagnostic is simple. A scatterplot of OK residuals $(z_i - z^*_{OK}(\mathbf{s}_i))$ versus a candidate trend coordinate (e.g. $x$ ) will show a CLEAR LINEAR pattern. Cross-validation residuals will not be unbiased — the mean residual at low $x$ will be positive (OK over-predicts) and at high $x$ will be negative (OK under-predicts). This is the signature of an UNDETECTED TREND, and the cleanest fix is to MODEL THE TREND EXPLICITLY rather than try to hide it in a tighter neighbourhood.

The universal-kriging assumption — mean as basis-function expansion

UNIVERSAL KRIGING assumes the mean $m(\mathbf{s})$ is a KNOWN LINEAR COMBINATION of $K$ KNOWN basis functions $f_k(\mathbf{s})$ with UNKNOWN coefficients $\beta_k$ :

m(\mathbf{s}) \;=\; \sum_{k=0}^{K-1} \beta_k \, f_k(\mathbf{s}).

The basis functions $f_k$ are KNOWN — you choose them based on what trend you believe is present. Common choices in 2-D:

Zero-order (constant): $K = 1$ , $f_0(\mathbf{s}) \equiv 1$ . This recovers ORDINARY KRIGING as a special case — the UK constraint $\sum w_i f_0(\mathbf{s}_i) = f_0(\mathbf{s}_0)$ becomes $\sum w_i = 1$ .
First-order (linear): $K = 3$ , basis $1, x, y$ . The mean is a planar function of the coordinates — a constant tilt across the domain.
Second-order (quadratic): $K = 6$ , basis $1, x, y, x^2, y^2, xy$ . The mean is a quadratic surface — can capture a dome, a saddle, or a valley.

The coefficients $\boldsymbol\beta = (\beta_0, \ldots, \beta_{K-1})^\top$ are UNKNOWN. UK does NOT estimate them explicitly — it estimates them IMPLICITLY through the augmented kriging system, just as OK estimates the single constant mean through $\sum w = 1$ . There is no separate regression step in pure UK; the kriging system handles trend estimation and prediction in one solve.

The polynomial order should be SMALL. Most production UK uses order 1 or 2; higher orders over-fit. A useful heuristic from Goovaerts 1997 §5.5: keep the number of trend coefficients $K$ much less than the sample size $N$ (a factor of 5-10 minimum), and use cross-validation to confirm the chosen order produces well-calibrated residuals.

The UK unbiasedness constraints — Σ w_i f_k(s_i) = f_k(s₀)

With the predictor $Z^*_{UK}(\mathbf{s}_0) = \sum_i w_i Z(\mathbf{s}_i)$ and $E[Z(\mathbf{s}_i)] = m(\mathbf{s}_i) = \sum_k \beta_k f_k(\mathbf{s}_i)$ , take the expectation:

E[Z^*_{UK}(\mathbf{s}_0)] \;=\; \sum_i w_i \sum_k \beta_k f_k(\mathbf{s}_i) \;=\; \sum_k \beta_k \, \Bigl(\sum_i w_i f_k(\mathbf{s}_i)\Bigr).

For UNBIASEDNESS — $E[Z^*_{UK}(\mathbf{s}_0)] = m(\mathbf{s}_0) = \sum_k \beta_k f_k(\mathbf{s}_0)$ — REGARDLESS OF the unknown values of $\beta_k$ , the coefficient of each $\beta_k$ must match individually:

\boxed{\;\sum_{i=1}^N w_i \, f_k(\mathbf{s}_i) \;=\; f_k(\mathbf{s}_0), \quad k = 0, 1, \ldots, K-1.\;}

This is $K$ linear constraints on the weights — one per basis function. The OK constraint $\sum w = 1$ is recovered as the $k = 0$ case when $f_0 \equiv 1$ : $\sum w_i \cdot 1 = 1$ . For a linear basis ${1, x, y}$ , we get three constraints: $\sum w_i = 1$ , $\sum w_i x_i = x_0$ , $\sum w_i y_i = y_0$ . The second and third say the WEIGHTED MEAN of the sample x-coordinates (with weights $w_i$ ) must equal the target x-coordinate, and similarly for y — the weights have to "balance" so that the implicit basis-coefficient estimates are unbiased.

The universal-kriging system — augmented (N+K)×(N+K)

Set up the constrained optimisation. Minimise the MSE $E[(Z^*_{UK}(\mathbf{s}_0) - Z(\mathbf{s}_0))^2]$ subject to the $K$ constraints $\sum_i w_i f_k(\mathbf{s}_i) = f_k(\mathbf{s}$ . Use $K$ Lagrange multipliers $\mu_0, \mu_1, \ldots, \mu$ {K-1} $μ_{0}, μ_{1}, \dots, μ_{K - 1}$ — one per constraint. Form the Lagrangian, differentiate with respect to each $w_i$ and each $\mu_k$ , set to zero. The result writes cleanly in BLOCK MATRIX form.

Let $\mathbf{K}$ be the $N \times N$ sample-to-sample covariance matrix as in OK. Let $\mathbf{F}$ be the $N \times K$ design matrix of basis-function values evaluated at the samples (row $i$ : $F_{ik} = f_k(\mathbf{s}_i)$ ). Let $\mathbf{k}$ be the $N \times 1$ sample-to-target covariance vector. Let $\mathbf{f}_0$ be the $K \times 1$ vector of basis-function values evaluated at the target ( $(\mathbf{f}_0)_k = f_k(\mathbf{s}_0)$ ). Let $\mathbf{w}$ be the $N \times 1$ unknown weight vector and $\boldsymbol\mu$ the $K \times 1$ Lagrange-multiplier vector. The UK system is:

\boxed{\;\begin{pmatrix} \mathbf{K} & \mathbf{F} \\ \mathbf{F}^\top & \mathbf{0} \end{pmatrix} \begin{pmatrix} \mathbf{w} \\ \boldsymbol\mu \end{pmatrix} \;=\; \begin{pmatrix} \mathbf{k} \\ \mathbf{f}_0 \end{pmatrix}.\;}

The augmented matrix is $(N+K) \times (N+K)$ . The upper-left $N \times N$ block is the same SPD covariance matrix as in OK. The lower-right $K \times K$ block is ZERO. The off-diagonal blocks are $\mathbf{F}$ and $\mathbf{F}^\top$ — they tie the weights to the trend constraints. Like OK, the augmented matrix is SYMMETRIC INDEFINITE (not positive-definite — the zero block kills PD-ness), so the solver is LU DECOMPOSITION WITH PARTIAL PIVOTING. The matrix is INVERTIBLE whenever (a) $\mathbf{K}$ is SPD (permissibility from §4.1 plus distinct sample locations) and (b) the $K$ columns of $\mathbf{F}$ are LINEARLY INDEPENDENT and there are at least $K$ DISTINCT sample locations among the $N$ samples.

The right-hand side has the same structure as OK extended to $K$ constraints: the upper $N \times 1$ block $\mathbf{k}$ is the same sample-to-target covariance vector as in OK, and the lower $K \times 1$ block $\mathbf{f}_0$ carries the $K$ constraint right-hand sides $f_k(\mathbf{s}_0)$ . Solve once per target by forward/back-substitution on the LU decomposition (which depends only on samples + basis + variogram, not on target).

Once solved, the UK predictor uses the weights directly:

Z^*_{UK}(\mathbf{s}_0) \;=\; \sum_{i=1}^N w_i \, Z(\mathbf{s}_i),

and the constraints $\sum_i w_i f_k(\mathbf{s}_i) = f_k(\mathbf{s}_0)$ are verifiable on the output as a sanity check, one per basis function.

The universal-kriging variance

Plugging the optimal $\mathbf{w}$ and $\boldsymbol\mu$ back into the Lagrangian and using the UK system to simplify gives the UNIVERSAL-KRIGING VARIANCE:

\boxed{\;\sigma_{UK}^2(\mathbf{s}_0) \;=\; C(0) \,-\, \sum_{i=1}^N w_i \, C(\mathbf{s}_i, \mathbf{s}_0) \,+\, \sum_{k=0}^{K-1} \mu_k \, f_k(\mathbf{s}_0) \;=\; C(0) \,-\, \mathbf{w}^\top \mathbf{k} \,+\, \boldsymbol\mu^\top \mathbf{f}_0.\;}

(Sign convention matching Cressie 1993 / Chilès-Delfiner 2012 §3.5: the AGGREGATE $\boldsymbol\mu^\top \mathbf{f}$ is non-negative — which keeps $\sigma$ {UK}^2 \ge 0 $σ_{U K}^{2} \geq 0$ — but individual $\mu_k$ may take either sign when the target's basis values $f_k(\mathbf{s}_0)$ depart strongly from the data-weighted average of $f_k(\mathbf{s}_i)$ . Some references absorb the overall sign into $\boldsymbol\mu$ and write $-\boldsymbol\mu^\top \mathbf{f}_0$ ; physical content is identical.)

Compare to OK: $\sigma_{OK}^2 = C(0) - \mathbf{w}^\top \mathbf{k} + \mu \cdot 1$ . The UK variance has the SAME structure as OK but with $K$ Lagrange-multiplier terms instead of one. Each $\mu_k$ has dimensions of variance and represents the HONEST COST of estimating the corresponding trend coefficient $\beta_k$ . The variance at a matched target satisfies $\sigma_{UK}^2 \ge \sigma_{OK}^2 \ge \sigma_{SK}^2$ : each successive variant absorbs more unknown structure into the price.

The variance has the same THREE properties as OK's. (i) NON-NEGATIVE: the PSD-ness of the $\mathbf{K}$ block plus the basis constraints guarantees $\sigma_{UK}^2 \ge 0$ for permissible variograms. (ii) BOUNDED ABOVE: when the target is far from every sample and the basis values $\mathbf{f}_0$ stay within their data-derived range, the variance approaches a bounded limit. (iii) FUNCTION OF LAYOUT + VARIOGRAM + BASIS — independent of the data VALUES. The kriging variance map remains a "trust map" that can be computed from layout, variogram, and chosen basis alone, supporting kriging-driven sampling design.

The first widget for §5.3 shows UK on a 1-D field with a clear linear trend. Six samples on $[0, 1]$ with values drawn from $Z(x) = m_0 + m_1, x + \epsilon$ where $m_1$ is an adjustable TREND SLOPE and $\epsilon$ is small Gaussian noise. The reader controls the trend slope, the Spherical (residual) variogram, and the kriging target.

The TOP panel plots the samples (yellow), the TRUE GENERATING CURVE $m_0 + m_1 x$ as a dotted reference, the OK kriged curve (dashed grey, with constant-mean assumption), and the UK kriged curve (solid green, with linear-trend basis). The DRAGGABLE TARGET sits as a diamond on the UK curve with its ±1 $\sigma_{UK}$ envelope. The MIDDLE panel shows the OK and UK weights at the current target — bars for UK, grey ticks for OK. The BOTTOM panel reports the numerical solve: $Z^$ , and the trend strength in dimensionless "trend / sill" units.

Three things to do with this widget. FIRST, with the trend slope at zero (constant true mean), the OK and UK curves overlap closely — UK with $\beta_1 \approx 0$ degenerates to OK. UK pays a small extra $\mu_1$ price for the linear-coefficient constraint even when there is no trend to fit, but the predictor and variance are very close to OK's. SECOND, raise the trend slope to a moderate value (e.g. $m_1 = 1.0$ ). The dotted reference curve tilts up across the domain. The OK kriged curve TRIES to follow the tilt but FLATTENS — it pulls toward the global mean and underestimates both ends. The UK kriged curve tracks the dotted reference much more closely. The OK estimate at x = 0 is biased UP; at x = 1 it is biased DOWN; UK has neither bias.

THIRD, raise the trend slope to a strong value (e.g. $m_1 = 2.0$ ) and move the target to $x = 0.95$ . Watch the OK curve: it barely lifts above the global average. The UK curve climbs to track the dotted reference — but notice that the UK variance grows at the right edge. The Lagrange-multiplier term $\mu_1 \cdot x_0 = \mu_1 \cdot 0.95$ is now a large contribution to the variance: UK is HONEST about the extra uncertainty introduced by the trend coefficient estimate, especially at locations where the basis $x$ takes extreme values. EXTRAPOLATION BEYOND the data range would make this term explode — the dotted UK curve would still climb, but the variance would balloon — UK's honest "I don't trust this extrapolation" signal.

Kriging with external drift — when a secondary variable is known

UNIVERSAL KRIGING handles trends that are smooth functions of the coordinates. Sometimes the trend has a PHYSICAL EXPLANATORY VARIABLE instead. Air temperature drops with elevation (an atmospheric physics relationship). Groundwater-table depth tracks the ground-surface elevation (a hydrology relationship). Soil moisture correlates with NDVI from satellite remote-sensing. Ore grade in a layered deposit varies with depth (a depositional-history relationship). For each of these, there is a known SECONDARY VARIABLE $\zeta(\mathbf{s})$ that drives $Z(\mathbf{s})$ via a linear relationship.

KRIGING WITH EXTERNAL DRIFT (KED) is mathematically identical to UK, but the basis is the secondary variable instead of a polynomial of coordinates. The mean becomes

m(\mathbf{s}) \;=\; \beta_0 \,+\, \beta_1 \, \zeta(\mathbf{s}),

with unknown coefficients $\beta_0, \beta_1$ . The basis is $\mathbf{F} = [\mathbf{1}, \boldsymbol\zeta]$ — the constant column plus the secondary-variable column. The KED system is the UK system with this specific basis:

\begin{pmatrix} \mathbf{K} & \mathbf{F} \\ \mathbf{F}^\top & \mathbf{0} \end{pmatrix} \begin{pmatrix} \mathbf{w} \\ \boldsymbol\mu \end{pmatrix} \;=\; \begin{pmatrix} \mathbf{k} \\ \begin{pmatrix} 1 \\ \zeta(\mathbf{s}_0) \end{pmatrix} \end{pmatrix}, \qquad \mathbf{F} \;=\; \begin{pmatrix} 1 & \zeta(\mathbf{s}_1) \\ \vdots & \vdots \\ 1 & \zeta(\mathbf{s}_N) \end{pmatrix}.

This is a $(N+2)\times(N+2)$ system. The two unbiasedness constraints are $\sum w_i = 1$ (the OK constraint, recovered for $f_0 = 1$ ) and $\sum w_i \zeta(\mathbf{s}_i) = \zeta(\mathbf{s}_0)$ (the new KED constraint — the weighted mean of the secondary at samples must match the secondary at the target).

The CRITICAL REQUIREMENT for KED: the secondary variable $\zeta(\mathbf{s})$ must be KNOWN DENSELY across the kriging domain — at every sample location AND at every kriging target. Examples where this is satisfied:

Temperature with elevation as drift — elevation is known everywhere from a DEM.
Groundwater table with surface elevation as drift — same.
Rainfall with elevation as drift — same.
Soil contamination with distance from a known source as drift — distance is computable everywhere.
Ore grade with depth as drift in a flat layered deposit — depth is computable from the layer geometry.

Examples where KED is NOT applicable: a secondary variable only sampled at a subset of locations (use cokriging instead — §5.7). The secondary must be defined LITERALLY EVERYWHERE the kriging target can be — that is the hard constraint that distinguishes KED from cokriging.

The second widget for §5.3 shows KED in action. Six samples on $[0, 1]$ where each sample value is the sum of a SECONDARY-DRIVEN component $\beta_1 \zeta(x_i)$ (linear in a known secondary variable, plotted alongside) plus stationary residual noise. The secondary $\zeta(x)$ is shown densely across the domain — this is the "external drift" KED uses.

The TOP panel shows the secondary curve $\zeta(x)$ (blue dashed) and the sample data (yellow) on the same axes. The MIDDLE panel shows the kriged predictions: OK ignores the secondary (dashed grey) and predicts a smoothed mean; KED with the secondary as drift (solid green) tracks the secondary-driven variation closely. The BOTTOM panel reports $Z^$ , and the strength $\beta_1$ of the secondary relationship.

Three things to do with this widget. FIRST, set the secondary-coupling strength $\beta_1$ to ZERO. The data become pure stationary noise — neither method has an advantage. OK and KED produce nearly identical predictions; KED pays a small $\mu_1$ price but otherwise behaves like OK. SECOND, raise $\beta_1$ to a moderate value. The data now visibly follow the secondary curve. OK still predicts a smooth average. KED tracks the secondary closely — every wiggle in the secondary curve is mirrored in the KED prediction (modulated by the KED weights). The KED estimate near the data is much closer to the truth than OK's smoothed estimate.

THIRD, drag the target into a region where the secondary takes an EXTREME value. KED predicts a value that reflects the secondary at that location; OK predicts something close to the global mean. The KED variance grows because the basis value $\zeta(s_0)$ multiplies the $\mu_1$ Lagrange penalty — but the prediction itself is informed by the secondary. This is the KED trade-off: it uses external knowledge to predict more accurately, and it honestly reports more variance when the basis is far from typical sample values.

The pragmatic alternative — trend + residual decomposition

Pure UK and KED solve the trend and the kriging in ONE system. A common pragmatic alternative is to DECOMPOSE the work into two steps:

Fit the trend by least-squares regression on the chosen basis. For UK: $\hat\beta = (\mathbf{F}^\top \mathbf{F})^{-1} \mathbf{F}^\top \mathbf{z}$ . Compute the fitted trend $\hat m(\mathbf{s}_i) = \mathbf{f}(\mathbf{s}_i)^\top \hat\beta$ at every sample.
Compute residuals $r(\mathbf{s}_i) = Z(\mathbf{s}_i) - \hat m(\mathbf{s}_i)$ .
Fit the variogram on the residuals, not on the raw data — see the residual-variogram trap below.
Krige the residuals using ordinary kriging on $r$ . Call the result $r^*(\mathbf{s}_0)$ .
Add the trend back at the kriging target: $Z^$ .

This is sometimes called REGRESSION KRIGING (Hengl, Heuvelink & Stein 2004; Pyrcz & Deutsch 2014 §4.6). It works well in practice, is much easier to implement than full UK, and is the workhorse for many "kriging with covariates" production workflows in soil-property mapping and digital-soil-mapping more broadly.

The mathematical difference between regression kriging and full UK is in the VARIANCE. Full UK propagates the trend-coefficient uncertainty through the Lagrange multipliers — $\sigma_{UK}^2$ includes the $\sum \mu_k f_k(\mathbf{s}$ term that accounts for it. Regression kriging treats the fitted $\hat\beta$ as FIXED in step 4 and so does not propagate its uncertainty into $\sigma$ {RK}^2 $σ_{R K}^{2}$ . The under-reporting is small in well-sampled regions (where the trend fit is precise) but grows where the trend basis is extreme (extrapolation regions). For high-stakes uncertainty quantification, prefer full UK; for routine production mapping, regression kriging is usually adequate.

The residual-variogram trap

One subtle but important point. Under UK or KED, the variogram $\gamma(h)$ should be computed on the RESIDUALS $r(\mathbf{s}) = Z(\mathbf{s}) - \hat m(\mathbf{s})$ , NOT on the raw data $Z(\mathbf{s})$ .

Why? A variogram computed on data with a strong trend is CONTAMINATED. Consider $Z(\mathbf{s}) = m(\mathbf{s}) + \epsilon(\mathbf{s})$ where $\epsilon$ is the zero-mean stationary residual. The "raw" variogram is

2 \gamma_{\text{raw}}(\mathbf{h}) \;=\; E[(Z(\mathbf{s} + \mathbf{h}) - Z(\mathbf{s}))^2] \;=\; (m(\mathbf{s} + \mathbf{h}) - m(\mathbf{s}))^2 \;+\; 2 \gamma_\epsilon(\mathbf{h}).

The first term is the TREND CONTRIBUTION — it grows without bound as $|\mathbf{h}|$ increases (the trend separates two distant locations by a deterministic amount). The result is a variogram that LOOKS LIKE AN UNBOUNDED POWER FUNCTION — $\gamma_{\text{raw}}(h) \sim h^2$ for a linear trend at large lags — instead of a clean Spherical / Exponential / Gaussian with a clear sill.

This pseudo-sill-less variogram is the CLASSIC SIGNATURE of an undetected trend. If you compute the empirical variogram from your data and it looks unbounded — no sill in sight, climbing without plateau — you should suspect a trend. The diagnostic test is to fit a low-order trend, subtract it, and recompute the variogram on the residuals. If the trend was the cause, the residual variogram now reaches a sensible sill at a reasonable range and looks like a clean §3 / §4 fittable shape. If the trend wasn't the cause, you have a power-law variogram and you need to consider a different model family (fractional-Brownian field, intrinsic random function).

In production UK / KED, the workflow is iterative: fit a tentative trend, subtract it, compute the variogram on residuals, refit the variogram, optionally refit the trend conditional on the residual variogram (generalised least squares), iterate until convergence. For simple linear trends with a clean residual structure, one pass is usually enough. See Cressie 1993 §3.5.2 for the iterative refitting algorithm.

Honest caveats — what UK and KED assume

The trend specification must be correct. UK and KED assume you have CORRECTLY identified the basis. Over-fitting (too many polynomial terms or too many KED secondaries) leaves NOTHING for kriging to do — the residuals are pure noise, the variogram has only nugget, the kriging weights become regression coefficients with no spatial-correlation contribution. The kriged map is the trend surface and the variance is the regression variance. Under-fitting (missing terms that are really there) leaves trend IN the residuals — the variogram is contaminated, the kriging is biased, the cross-validation residuals show a systematic pattern. The remedy in both directions is the same: keep the trend order LOW (typically 1 in 1-D, 1-2 in 2-D), examine the variogram of residuals to confirm it looks stationary, and use cross-validation to verify the SD of standardised residuals is near 1.

Extrapolation is dangerous. Universal kriging beyond the data range is unsafe. The polynomial basis $f_k(\mathbf{s})$ extrapolates wildly outside the convex hull of the sample locations — a linear trend keeps climbing forever; a quadratic trend turns into a balloon. The kriged prediction outside the data range is dominated by the EXTRAPOLATED TREND, not by the kriging system's data-driven contribution. The kriging variance grows correspondingly — the Lagrange-multiplier terms $\sum \mu_k f_k(\mathbf{s}_0)$ explode when $f_k(\mathbf{s}_0)$ takes extreme values. UK's honest variance is your safeguard, but the point estimate alone is misleading at extrapolation targets. Cap the kriging extent at the convex hull of the data, or report extrapolation predictions only with a "use with caution" flag.

KED requires the drift variable defined everywhere. If $\zeta$ is only known at sample locations, KED is impossible — you cannot evaluate $f_1(\mathbf{s}_0) = \zeta(\mathbf{s}_0)$ at every kriging target. Use COKRIGING (§5.7) instead, which handles sparsely-sampled secondaries within a multivariate framework. The requirement " $\zeta$ defined densely" is more common than it sounds because many natural explanatory variables come from continuous remote-sensing or DEM products: elevation, NDVI, distance-from-feature, slope, aspect, soil-type from a regional map.

Linear basis assumption. UK and KED both assume the trend is LINEAR in the chosen basis functions (with unknown coefficients). A non-linear relationship — e.g. $Z$ depends on $\log \zeta$ or $\sin \zeta$ — has to be linearised by transforming the basis ( $f_1 = \log \zeta$ instead of $f_1 = \zeta$ ) before plugging into KED. Failure to linearise leaves residual structure in the data that the kriging cannot capture.

Cross-validation is mandatory. UK and KED stress the kriging system harder than OK — more constraints, more Lagrange multipliers, more places for an over-fit or under-fit specification to bias the result. The §4.5 leave-one-out cross-validation is essential: SD of standardised residuals near 1, no spatial pattern in residuals, no systematic dependence of residuals on the basis variables. If the SD is too small the trend is over-fit; if too large the trend is under-fit; if the residuals show a pattern on the basis variable the basis specification is wrong.

§5.3 in the architecture of Part 5

§5.3 generalises the OK constraint to a basis-function expansion. §5.4 (kriging variance) cashes out what $\sigma_{UK}^2$ does and does not mean — the calibration apparatus from Part 6 applies the same way it does for OK and SK, with the additional dimensions of trend-coefficient uncertainty. §5.5 (block kriging) generalises the target from a point to a block; the same UK / KED machinery extends with point-to-block average covariances on the right-hand side and basis-function block averages on the basis-constraint rows.

§5.6 (neighbourhood selection) interacts with UK / KED in a subtle way: a local neighbourhood limits the data the kriging sees, and the basis must remain identifiable within that neighbourhood. With a tight neighbourhood and a high-order basis, the $\mathbf{F}$ matrix can become near-singular — the local samples don't span the basis well enough to estimate all $K$ coefficients. The remedy is to use a SMALLER basis with local neighbourhoods (typically just the constant) and let the local-stationarity of OK absorb slow trends within each neighbourhood. Full UK / KED with high-order bases needs global or near-global neighbourhoods.

§5.7 (cokriging) takes UK / KED a step further: when the secondary variable is SAMPLED rather than dense, the cokriging system handles it within a multivariate framework with cross-covariances. KED is the special case of cokriging when the secondary is dense; cokriging handles the more general case.

§5.8 (pathologies) catalogues the common UK / KED failure modes: trend specification too high-order (cross-validation overcoverage); residual variogram still trending (under-fitted trend); extrapolation runaway (Lagrange-multiplier variance explosion at the boundary); KED drift undefined at some targets (system unsolvable — practical fix is to flag those targets and fall back to OK locally).

Try it

In universal-kriging-trend, set the TREND SLOPE $m_1$ to ZERO. Note that OK and UK kriged curves overlap closely — UK with a zero linear coefficient degenerates to OK. The UK variance is slightly larger because $\mu_1$ still costs something even when $\beta_1 \approx 0$ . Read the bottom-row pills: the trend / sill ratio is ≈ 0; the kriging curves are nearly identical.
In universal-kriging-trend, raise the trend slope $m_1$ to 1.5. The dotted true-mean reference now climbs across the domain. OK's kriged curve TRIES to follow but FLATTENS — it pulls toward the global average. UK's kriged curve tracks the dotted line. Read the OK prediction at $x = 0.0$ versus the dotted true mean at $x = 0.0$ : OK is biased UP. At $x = 1.0$ , OK is biased DOWN.
In universal-kriging-trend, drag the target to $x = 0.95$ with a strong trend. The UK variance grows because $\mu_1 \cdot 0.95$ adds substantially to $\sigma_{UK}^2$ . UK is HONEST: the right edge of the domain has more uncertainty about the trend coefficient than the centre does. Compare with $x = 0.50$ where the $\mu_1 \cdot 0.50$ term is half as large.
In universal-kriging-trend, raise the trend slope to a STRONG value ( $m_1 = 2.0$ ) and confirm: OK's cross-validation would show systematic over-prediction at low $x$ and under-prediction at high $x$ . UK's would not. This is why diagnosing a trend matters — the choice of kriging variant changes the predictions and the diagnostics.
In ked-with-secondary, set the secondary coupling $\beta_1$ to ZERO. Data are stationary noise around a constant. OK and KED produce nearly identical predictions. KED still pays a small $\mu_1$ price but the predictions barely differ from OK.
In ked-with-secondary, raise $\beta_1$ to a STRONG positive value. The sample data now visibly follow the secondary curve. OK predicts a smooth average that misses the wiggles. KED tracks the secondary much more closely. Drag the target into a peak of the secondary — KED predicts a high value; OK predicts close to the global mean.
In ked-with-secondary, set the secondary to a NEGATIVE coupling. KED tracks the inverse relationship — every peak in the secondary becomes a trough in the prediction. The KED Lagrange multiplier $\mu_1$ adapts accordingly. This is why KED needs a defensible PHYSICAL relationship: it does not care about the sign or the strength, only that the relationship is linear in the basis.
Without coding: a meteorologist has 30 air-temperature stations across a mountainous region with a digital elevation model available everywhere. They want to predict temperature at unmonitored locations. Which kriging variant — OK, UK with $x, y$ linear trend, or KED with elevation as drift — is most defensible, and why? What residual-variogram check should they perform first?
Without coding: a soil scientist fits ordinary kriging to their data and the empirical variogram looks like $\gamma(h) \propto h^{1.8}$ — unbounded, no sill in sight. What does this most likely diagnose, and what should they do next?
Without coding: a mining engineer runs KED on their ore-grade dataset with depth as the drift. Their leave-one-out cross-validation shows SD of standardised residuals = 0.62. What does this indicate (over-fit, under-fit, or well-calibrated trend), and what would you change about the model specification?

Pause and reflect: the UK kriging variance $\sigma_{UK}^2 = C(0) - \mathbf{w}^\top \mathbf{k} + \sum_k \mu_k f_k(\mathbf{s}_0)$ includes ONE term per Lagrange multiplier. The $k = 0$ term $\mu_0 \cdot 1$ is the OK penalty. The higher-order terms $\mu_k \cdot f_k(\mathbf{s}$ depend on the BASIS VALUE at the target. Why does this make UK's variance EXPLODE at extrapolation targets (outside the convex hull of the data), even when the predictor $Z^*$ {UK} $Z_{U K}^{*}$ may look reasonable? What does this tell you about how to USE the kriging variance map as a guardrail against extrapolation?

What you now know — and the open of §5.4

You can DIAGNOSE the limitation of OK on a trending field — the $\sum w = 1$ constraint forces a single implicit mean estimate across the kriging neighbourhood, biased toward the global average when a clear trend is present. You can RECOGNISE the empirical signature of an undetected trend: an unbounded raw variogram with no apparent sill, plus systematic cross-validation residual patterns dependent on a candidate coordinate or external variable.

You can STATE the UNIVERSAL KRIGING ASSUMPTION (mean is a known linear combination of $K$ known basis functions with unknown coefficients) and DERIVE the $K$ unbiasedness constraints $\sum_i w_i f_k(\mathbf{s}_i) = f_k(\mathbf{s}_0)$ from the requirement of unbiasedness regardless of the unknown coefficient values.

You can WRITE the UK AUGMENTED SYSTEM

\begin{pmatrix} \mathbf{K} & \mathbf{F} \\ \mathbf{F}^\top & \mathbf{0} \end{pmatrix} \begin{pmatrix} \mathbf{w} \\ \boldsymbol\mu \end{pmatrix} = \begin{pmatrix} \mathbf{k} \\ \mathbf{f}_0 \end{pmatrix}

— a $(N+K)\times(N+K)$ symmetric-indefinite linear system. You can SOLVE it by LU decomposition with partial pivoting, the same solver as for OK's augmented system. You can COMPUTE the UK variance $\sigma_{UK}^2 = C(0) - \mathbf{w}^\top \mathbf{k} + \boldsymbol\mu^\top \mathbf{f}_0$ (with $\mu_k \ge 0$ convention).

You can RECOGNISE KED as UK with a basis $\mathbf{F} = [\mathbf{1}, \boldsymbol\zeta]$ where $\boldsymbol\zeta$ is a known SECONDARY VARIABLE defined densely across the kriging domain. You know the four canonical KED examples (temperature ~ elevation, groundwater table ~ surface elevation, ore grade ~ depth, soil moisture ~ NDVI) and the critical requirement that the drift be defined everywhere.

You can use the PRAGMATIC ALTERNATIVE of trend + residual decomposition (regression kriging): fit the trend by least-squares, compute residuals, fit a variogram on residuals, krige residuals with OK, add back the trend. You know this UNDER-REPORTS variance by an amount related to the trend-coefficient uncertainty, and that full UK / KED is the calibrated alternative.

You recognise the RESIDUAL-VARIOGRAM TRAP: a raw variogram on trending data looks unbounded; only the residuals' variogram has a sensible structure. You know the HONEST CAVEATS — trend specification matters (over-fit destroys the kriging, under-fit biases the kriging), extrapolation beyond the data range is dangerous (basis explodes), KED requires the drift defined everywhere, the basis must be linear (transform if needed), cross-validation diagnoses over- and under-fit.

You can READ the widget output — comparing OK (constant mean) against UK (linear trend basis) on a trending field, and OK against KED (secondary as drift) on a secondary-driven field. You can interpret the Lagrange multipliers $\mu_k$ as the HONEST COST of estimating each trend coefficient.

This OPENS §5.4 — the kriging variance, what it means and what it doesn't. The next section cashes out $\sigma_K^2$ in detail: the calibration to actual error (Part 6), the data-independence property and its consequence for sampling design, the failure modes that masquerade as honest variance reports. The kriging-variance interpretation applies uniformly across SK, OK, UK, and KED — every variant produces a $\sigma_K^2(\mathbf{s}_0)$ that lives in the same conceptual space and has the same dependencies on layout, variogram, and (where applicable) basis. §5.4 makes the interpretation explicit.

References

Matheron, G. (1969). Le krigeage universel. Cahiers du Centre de Morphologie Mathématique, 1, École des Mines de Paris. (The original formal paper introducing universal kriging. Derives the augmented system from the constrained-MSE framework with polynomial basis functions, develops the Lagrange-multiplier apparatus for the trend coefficients, and connects to the intrinsic random function theory. The reference for the mathematical foundations of UK.)
Goldberger, A.S. (1962). Best linear unbiased prediction in the generalized linear regression model. Journal of the American Statistical Association, 57(298), 369–375. (The foundational paper on best linear unbiased prediction with a known mean function. Develops the prediction theory that underlies UK's derivation in a regression framework. The connection between BLUP in statistics and UK in geostatistics is documented in Stein 1999 §1.5.)
Matheron, G. (1971). The Theory of Regionalized Variables and Its Applications. Cahiers du Centre de Morphologie Mathématique, École des Mines de Paris, No. 5. (Chapter 4 expands the universal-kriging derivation with full mathematical rigour. Discusses the connection between universal kriging and intrinsic random functions of order k — the formal mathematical framework that handles non-stationary means.)
Cressie, N. (1993). Statistics for Spatial Data (revised ed.). Wiley. (§3.5 develops UK at mathematical-statistics rigour, including the iterative refitting algorithm for the trend + residual-variogram joint estimation. §3.5.2 specifically treats the residual-variogram problem and the generalised-least-squares refitting. The reference textbook for the formal UK derivation.)
Chilès, J.-P., Delfiner, P. (2012). Geostatistics: Modeling Spatial Uncertainty (2nd ed.). Wiley. (Chapter 3.5-3.6 is the comprehensive modern reference for UK and KED. Develops both variants with detailed treatment of basis choice, the residual-variogram problem, the extrapolation pathology, and the connection to intrinsic random functions. The standard graduate-school reference for §5.3 material.)
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford University Press. (§5.5 develops universal kriging from the practitioner perspective with worked examples on the WL dataset and on a coal-seam thickness application. §5.6 develops KED with elevation as the drift in a soil-property mapping example. The practitioner-textbook reference for §5.3 at graduate-school level.)
Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (§16 develops universal kriging at the entry-level practitioner-pedagogy level. Includes the geometric interpretation of the basis constraints — the weighted mean of sample coordinates must equal the target coordinate, etc. — that makes the UK system intuitive.)
Deutsch, C.V., Journel, A.G. (1998). GSLIB: Geostatistical Software Library and User's Guide (2nd ed.). Oxford University Press. (§IV.1.5 documents the universal-kriging mode of the kt3d program — kriging type itype = 2 (UK with polynomial drift) and itype = 3,4,5 (KED with one to three external drift variables). The §5.3 widgets in this section implement the same algorithm at smaller scale.)
Hengl, T., Heuvelink, G.B.M., Stein, A. (2004). A generic framework for spatial prediction of soil variables based on regression-kriging. Geoderma, 120(1-2), 75–93. (The reference paper for REGRESSION KRIGING — the pragmatic trend + residual decomposition that approximates full UK / KED. Develops the workflow, compares to full UK on a soil-property mapping problem, and discusses the variance under-reporting issue. Foundational for digital soil mapping.)
Pyrcz, M.J., Deutsch, C.V. (2014). Geostatistical Reservoir Modeling (2nd ed.). Oxford University Press. (§4.5-4.6 documents UK / KED and regression kriging in production reservoir-characterisation workflows. Practical guidance on basis choice, neighbourhood compatibility, and the residual-variogram refitting loop.)
Stein, M.L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer. (The rigorous theoretical-statistics treatment of UK. Develops the asymptotic theory under increasing-domain and infill asymptotics, and the connection to BLUP in the generalised-linear-model framework — the link to Goldberger 1962.)
Wackernagel, H. (2003). Multivariate Geostatistics (3rd ed.). Springer. (Chapter 12 develops UK and KED within the multivariate framework. Useful for seeing how KED specialises from cokriging (§5.7) when the secondary variable is densely defined, and for the relationship between KED and the cokriging system.)

Universal kriging and kriging with external drift

Learning objectives

The universal-kriging assumption — mean as basis-function expansion

The UK unbiasedness constraints — Σ w_i f_k(s_i) = f_k(s₀)

The universal-kriging system — augmented (N+K)×(N+K)

The universal-kriging variance

Universal kriging — the trend widget

Kriging with external drift — when a secondary variable is known

KED with a secondary variable — second widget

The pragmatic alternative — trend + residual decomposition

The residual-variogram trap

Honest caveats — what UK and KED assume

§5.3 in the architecture of Part 5

Try it

What you now know — and the open of §5.4

References

Learning objectives

The problem with OK on a trending field

The universal-kriging assumption — mean as basis-function expansion

The UK unbiasedness constraints — Σ w_i f_k(s_i) = f_k(s₀)

The universal-kriging system — augmented (N+K)×(N+K)

The universal-kriging variance

Universal kriging — the trend widget

Kriging with external drift — when a secondary variable is known

KED with a secondary variable — second widget

The pragmatic alternative — trend + residual decomposition

The residual-variogram trap

Honest caveats — what UK and KED assume

§5.3 in the architecture of Part 5

Try it

What you now know — and the open of §5.4

References