Jackknife and split-sample validation
Learning objectives
- Define JACKKNIFE as the leave-one-out resampling estimator of bias and variance
- Distinguish LOO-CV (each point once as test) from SPLIT-SAMPLE (random hold-out fraction)
- Recognise the BIAS-VARIANCE trade-off between the two approaches
- Apply k-FOLD CV as a compromise between LOO and split-sample for large datasets
- Choose the appropriate validation scheme given dataset size and computational budget
Cross-validation has several variants for geostatistics. §6.1 introduced LOO-CV (leave-one-out). §6.4 develops the JACKKNIFE perspective and contrasts with SPLIT-SAMPLE validation. Each has trade-offs in bias, variance, and computational cost.
The jackknife (Quenouille 1956, Tukey 1958)
The jackknife is a precursor to the bootstrap. For an estimator :
- For each i: compute (estimator on data without point i).
- The N pseudo-values provide bias and variance estimates.
- Jackknife bias estimate: where is the mean of the pseudo-values.
- Jackknife variance: .
For kriging, the jackknife IS leave-one-out cross-validation: the pseudo-values are the kriging predictions and the residuals provide the basis for variance and bias estimates.
Split-sample validation
An alternative: randomly partition the data into training (say 70%) and test (30%). Fit on training, predict on test, compute MSE. Repeat with different random partitions and average.
Advantages: simpler to interpret; computationally cheap; each test point gets a fresh, unbiased prediction. Disadvantages: smaller training set (predictions slightly worse than LOO); MSE estimate is biased upward (training on fewer than N-1 points); high variance across splits unless many partitions are averaged.
k-fold CV as a compromise
k-fold CV (k = 5 or 10): partition data into k folds; for each fold, train on the others and test on it; average. Special cases: k=2 = split-sample with 50% holdout; k=N = LOO-CV.
k=5 is a typical choice: balance between LOO (k=N, low bias high variance) and split-sample (k=2, higher bias lower variance). For VERY LARGE datasets, k=5 is computationally tractable while LOO is not.
The bias-variance trade-off
| Method | Training size | Bias | Variance | Cost |
|---|---|---|---|---|
| LOO-CV | N-1 | Minimal | High | N × kriging |
| k-fold (k=10) | ~0.9N | Small | Moderate | k × kriging |
| k-fold (k=5) | ~0.8N | Modest | Lower | k × kriging |
| Split-sample (30%) | 0.7N | Largest | Lowest | 1 kriging |
For geostatistical datasets (typically N=50–500), LOO-CV is both feasible and optimal. For VERY LARGE datasets (>10⁵ points), k=5 or k=10 is more tractable.
Spatial cross-validation: a critical caveat
For SPATIAL data, a critical issue: random k-fold partitioning may put POINTS NEAR EACH OTHER in different folds. Spatially-adjacent points are correlated, so the model "cheats" by learning correlated information from nearby training points. This makes the CV MSE artificially OPTIMISTIC.
Fix: SPATIAL k-fold CV. Partition the data into K spatially-CONTIGUOUS blocks. Each block forms a test fold; training set is the rest. This forces predictions to be made at locations FAR from the training data — a more honest assessment of model generalisation. Standard in modern spatial-stats literature; see Roberts et al. (2017) "Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure".
Try it
- Defaults: N=40, split=30%. LOO and split-sample give similar MSE estimates with comparable SEs.
- Drop split fraction to 50%. Split-sample MSE estimate increases (training on 20 points instead of 28). LOO is unaffected — always uses N-1=39 training points.
- Crank N to 100. LOO is exact; split-sample MSE estimate stabilises (less variance across splits, more data in each split).
- Re-sample multiple times. LOO MSE is more stable across resamples than split-sample (which averages 20 random splits).
- The takeaway: for moderate N (50–500), LOO is unambiguously preferred for geostat applications. Split-sample is for cases where LOO is prohibitively expensive (e.g., N > 10⁵).
For a spatial-prediction model with N = 200 clustered samples, why might random 5-fold CV give an over-optimistic MSE estimate, and what spatial-CV alternative would you recommend?
What you now know
Jackknife = LOO-CV mathematically. Split-sample is a faster, biased alternative. k-fold CV interpolates between them. Spatial CV partitions into contiguous blocks to defeat the spatial-correlation cheat. Modern geostat best practice: LOO-CV for moderate N; spatial k-fold for spatial datasets where spatially-random CV would over-estimate model quality. §6.5 closes Part 6 with debiasing checks and conditional bias.
References
- Quenouille, M.H. (1956). "Notes on bias in estimation." Biometrika 43, 353–360. (Original jackknife.)
- Tukey, J.W. (1958). "Bias and confidence in not-quite large samples." Annals of Mathematical Statistics 29, 614. (Jackknife extension.)
- Roberts, D.R., et al. (2017). "Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure." Ecography 40, 913–929.
- Brenning, A. (2012). "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing." IEEE IGARSS.
- Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford.