Exploratory data analysis for spatial data
Learning objectives
- Apply the standard EDA toolkit (location map, histogram, Q-Q plot, boxplot) to spatial data
- Diagnose distribution shape (Normal, skewed, multi-modal, contaminated by outliers)
- Recognise spatial trend visually via bubble maps and moving-window profiles
- Decide when to log-transform or N-score-transform before variogram analysis
- Identify outliers and decide whether to remove, retain, or Winsorize
Before fitting any variogram or running kriging, do EXPLORATORY DATA ANALYSIS. EDA reveals: distribution shape, spatial trends, outliers, suspicious clusters, and any other issue that would invalidate downstream geostatistics. This is the closing section of Part 0; everything in Parts 1+ assumes you've done EDA first.
The standard EDA toolkit
- Bubble / location map: scatter of sample locations with bubble size or colour proportional to value. Reveals spatial trends, clusters, and outliers (giant bubbles).
- Histogram: distribution shape (Normal? skewed? bimodal?). Watch for outliers as far-right or far-left bars.
- Q-Q plot vs Normal: deviations from a straight reference line signal departures from Normality. Right-skewed: curves up at upper end. Heavy tails: curve up at both ends. Outliers: isolated points off the line.
- Boxplot: quartiles + whiskers. Outliers shown as separate dots.
- Moving-mean / moving-variance profile: along a transect or coordinate axis. Reveals trends and heteroscedasticity.
Decisions from EDA
Different EDA outcomes lead to different downstream decisions:
- Approximately Normal histogram + flat moving mean: proceed with standard kriging directly.
- Strongly right-skewed (lognormal-like): log-transform OR Normal-score transform Z̃ = Φ⁻¹(F̂(Z)). The transformed variable is approximately Normal; variogram and kriging work better; back-transform results at the end.
- Visible trend in moving-mean profile: detrend before variogram (universal kriging or KED).
- Outliers detected: investigate them (data error or genuine extreme value?). Decide: remove, retain, Winsorize. ALWAYS document.
- Bimodal histogram: probably two facies / zones / regimes — separate them before pooling.
Outlier handling — the hardest call
An extreme observation could be:
- A DATA ERROR (typo, unit conversion, instrument fault). Trace it back; correct or remove.
- A GENUINE EXTREME VALUE (one of nature's right-tail draws). Keep it — your model needs to represent the population including its tail.
- A POINT FROM A DIFFERENT POPULATION (a different facies, a different zone). Either model it as a mixture or exclude it from the current zone's variogram.
Always investigate before deciding. Removing 'outliers' thoughtlessly is one of the most common sources of bias in applied geostatistics.
Try it
- Gaussian scenario. Bubble map shows orderly spatial trend (increases left-to-right and bottom-to-top). Histogram is symmetric. Q-Q plot near the diagonal. Skewness near 0. Recommended action: proceed.
- Skewed scenario. Bubble map still shows trend but with a few very-large bubbles. Histogram is right-skewed. Q-Q plot CURVES UP at the upper end. Skewness substantially > 1. Recommended: log-transform or N-score before variogram.
- Outlier scenario. Bubble map has ONE giant bubble at a single location. Histogram has a far-right bar. Q-Q plot has the topmost point WAY off the line. Outlier-present flag is YES. Recommended: investigate the outlier.
- Compare summary statistics across scenarios — note how MEAN and SD respond to outliers vs MEDIAN and IQR (which are robust).
- Increase n samples. EDA diagnostics become more reliable as n grows. At n = 30, distinguishing skewed from outlier-contaminated is hard; at n = 300 it's easy.
Your data has skewness 2.5 (strongly right-skewed) AND one observation that is 6 SDs above the mean. Should you treat this as one problem (the outlier's contribution to skewness) or two (genuine skew + a separate outlier)? What single plot best disentangles them?
What you now know — and Part 0 closes for both books
EDA is the first step of every spatial analysis. Bubble maps, histograms, Q-Q plots, and moving-window profiles give the four key diagnostics. EDA outcomes lead to direct decisions: log-transform skewed data, detrend trended data, investigate outliers, separate zones. Part 0 of the Geostatistics textbook is now COMPLETE: spatial data fundamentals (§0.1), stationarity (§0.2), support and scale (§0.3), coordinate systems (§0.4), sampling biases (§0.5), and EDA (§0.6). Part 1 builds on these foundations with univariate statistics for geo-data.
References
- Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley. (The foundational EDA reference.)
- Cleveland, W.S. (1993). Visualizing Data. Hobart Press. (Classic on visualisation principles.)
- Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford. (Chapter 2 — univariate description of spatial data.)
- Tukey, J.W. (1962). "The future of data analysis." Annals of Math. Stat. 33, 1-67. (Coined many of the EDA principles.)
- Wilkinson, L. (2005). The Grammar of Graphics, 2nd ed. Springer. (Modern theoretical framework for visualisation.)