Jackknife and split-sample validation

Part 6 — Cross-validation and QC

Learning objectives

Define JACKKNIFE as the leave-one-out resampling estimator of bias and variance
Distinguish LOO-CV (each point once as test) from SPLIT-SAMPLE (random hold-out fraction)
Recognise the BIAS-VARIANCE trade-off between the two approaches
Apply k-FOLD CV as a compromise between LOO and split-sample for large datasets
Choose the appropriate validation scheme given dataset size and computational budget

Cross-validation has several variants for geostatistics. §6.1 introduced LOO-CV (leave-one-out). §6.4 develops the JACKKNIFE perspective and contrasts with SPLIT-SAMPLE validation. Each has trade-offs in bias, variance, and computational cost.

The jackknife (Quenouille 1956, Tukey 1958)

The jackknife is a precursor to the bootstrap. For an estimator $T(y)$ :

For each i: compute $T_{(-i)} = T(y_{-i})$ (estimator on data without point i).
The N pseudo-values $T_{(-i)}$ provide bias and variance estimates.
Jackknife bias estimate: $\hat{\text{Bias}} = (N - 1)(T_{(\cdot)} - T)$ where $T_{(\cdot)}$ is the mean of the pseudo-values.
Jackknife variance: $\hat{V}(T) = \frac{N - 1}{N} \sum (T_{(-i)} - T_{(\cdot)})^2$ .

For kriging, the jackknife IS leave-one-out cross-validation: the pseudo-values are the kriging predictions $\hat{z}$ and the residuals $z_i - \hat{z}$ {(-i)} $z_{i} - \overset{z}{^}_{(- i)}$ provide the basis for variance and bias estimates.

Split-sample validation

An alternative: randomly partition the data into training (say 70%) and test (30%). Fit on training, predict on test, compute MSE. Repeat with different random partitions and average.

Advantages: simpler to interpret; computationally cheap; each test point gets a fresh, unbiased prediction. Disadvantages: smaller training set (predictions slightly worse than LOO); MSE estimate is biased upward (training on fewer than N-1 points); high variance across splits unless many partitions are averaged.

k-fold CV as a compromise

k-fold CV (k = 5 or 10): partition data into k folds; for each fold, train on the others and test on it; average. Special cases: k=2 = split-sample with 50% holdout; k=N = LOO-CV.

k=5 is a typical choice: balance between LOO (k=N, low bias high variance) and split-sample (k=2, higher bias lower variance). For VERY LARGE datasets, k=5 is computationally tractable while LOO is not.

The bias-variance trade-off

Method	Training size	Bias	Variance	Cost
LOO-CV	N-1	Minimal	High	N × kriging
k-fold (k=10)	~0.9N	Small	Moderate	k × kriging
k-fold (k=5)	~0.8N	Modest	Lower	k × kriging
Split-sample (30%)	0.7N	Largest	Lowest	1 kriging

For geostatistical datasets (typically N=50–500), LOO-CV is both feasible and optimal. For VERY LARGE datasets (>10⁵ points), k=5 or k=10 is more tractable.

Spatial cross-validation: a critical caveat

For SPATIAL data, a critical issue: random k-fold partitioning may put POINTS NEAR EACH OTHER in different folds. Spatially-adjacent points are correlated, so the model "cheats" by learning correlated information from nearby training points. This makes the CV MSE artificially OPTIMISTIC.

Fix: SPATIAL k-fold CV. Partition the data into K spatially-CONTIGUOUS blocks. Each block forms a test fold; training set is the rest. This forces predictions to be made at locations FAR from the training data — a more honest assessment of model generalisation. Standard in modern spatial-stats literature; see Roberts et al. (2017) "Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure".

Try it

Defaults: N=40, split=30%. LOO and split-sample give similar MSE estimates with comparable SEs.
Drop split fraction to 50%. Split-sample MSE estimate increases (training on 20 points instead of 28). LOO is unaffected — always uses N-1=39 training points.
Crank N to 100. LOO is exact; split-sample MSE estimate stabilises (less variance across splits, more data in each split).
Re-sample multiple times. LOO MSE is more stable across resamples than split-sample (which averages 20 random splits).
The takeaway: for moderate N (50–500), LOO is unambiguously preferred for geostat applications. Split-sample is for cases where LOO is prohibitively expensive (e.g., N > 10⁵).

For a spatial-prediction model with N = 200 clustered samples, why might random 5-fold CV give an over-optimistic MSE estimate, and what spatial-CV alternative would you recommend?

What you now know

Jackknife = LOO-CV mathematically. Split-sample is a faster, biased alternative. k-fold CV interpolates between them. Spatial CV partitions into contiguous blocks to defeat the spatial-correlation cheat. Modern geostat best practice: LOO-CV for moderate N; spatial k-fold for spatial datasets where spatially-random CV would over-estimate model quality. §6.5 closes Part 6 with debiasing checks and conditional bias.

References

Quenouille, M.H. (1956). "Notes on bias in estimation." Biometrika 43, 353–360. (Original jackknife.)
Tukey, J.W. (1958). "Bias and confidence in not-quite large samples." Annals of Mathematical Statistics 29, 614. (Jackknife extension.)
Roberts, D.R., et al. (2017). "Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure." Ecography 40, 913–929.
Brenning, A. (2012). "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing." IEEE IGARSS.
Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford.