Histograms, CDFs, and quantile-quantile plots

Part 1 — Univariate statistics for geo-data

Learning objectives

  • State the empirical CDF Fₙ(z) = (1/n) Σ 1{Zᵢ ≤ z} and explain why it is a lossless summary of the marginal distribution
  • Read a histogram, an empirical CDF, and a sorted-values plot as three views of the same data, and recognise what each one emphasises
  • Use a Q-Q plot to test whether a dataset is consistent with a reference distribution family (Normal, Lognormal, Exponential, Uniform)
  • Spot right-skewness, bimodality, and outliers from Q-Q-plot curvature in the appropriate tail
  • Place §1.1's marginal-only inspection in the wider Part-1 arc: it is necessary but ignores spatial arrangement (that comes back in Part 3 via the variogram)

Part 0 introduced the central distinction between values and locations. Part 1 looks at the values — and only the values — through three classical lenses: the histogram, the cumulative distribution function, and the quantile-quantile plot. The deal is honest. We will deliberately throw away the locations for the duration of this part, so we can ask sharp questions about the marginal distribution of the data: is it symmetric, skewed, bimodal, heavy-tailed, contaminated by outliers? In Part 3, when the variogram arrives, the spatial arrangement comes back. For now, the marginal is the target.

This first section of Part 1 builds the inspection toolkit the next four sections need. §1.2 (normal-score transform) turns any marginal into a standard normal; you cannot evaluate whether the transform succeeded without a Q-Q plot. §1.3 (why your sample histogram is biased) shows that the histogram of your collected samples is not the histogram of the underlying field, because samples cluster preferentially in high-grade zones; you need declustering. §1.4 (robust descriptive statistics) replaces mean and standard deviation with median and IQR when the data is heavy-tailed or contaminated — and §1.1 is where you spot heavy tails and contamination. So although §1.1 is gentle on the math, almost everything later cites it.

The empirical CDF is the data

Given a sample Z1,Z2,,ZnZ_1, Z_2, \ldots, Z_n (forget the spatial coordinates for now), the empirical cumulative distribution function is

Fn(z)  =  1ni=1n1{Ziz}.F_n(z) \;=\; \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}\{Z_i \le z\}.

For any threshold zz, Fn(z)F_n(z) counts the fraction of your sample that lies at or below zz. As zz sweeps from -\infty to ++\infty, Fn(z)F_n(z) rises from 0 to 1, stepping up by 1/n1/n at each observed value. It is a non-decreasing step function. Two observations are worth pausing on.

First, FnF_n is sufficient. Every piece of marginal information about the sample lives in it. The mean, the median, the variance, every quantile, every percentile, the histogram (over any bin grid), and every reasonable functional of the data — all of them can be computed from FnF_n alone. Throwing away anything else loses nothing. The Glivenko–Cantelli theorem makes this formal: if FF is the true population CDF, then supzFn(z)F(z)0\sup_z |F_n(z) - F(z)| \to 0 almost surely as nn \to \infty. The empirical CDF converges uniformly to the true CDF — your data is the population, with a known rate of approach.

Second, the quantile is just the inverse of FnF_n. The pp-th quantile of the sample is the smallest zz with Fn(z)pF_n(z) \ge p — or, when ties and steps make that ambiguous, an interpolated value. The median is Fn1(0.5)F_n^{-1}(0.5). The interquartile range is Fn1(0.75)Fn1(0.25)F_n^{-1}(0.75) - F_n^{-1}(0.25). The 95th percentile is Fn1(0.95)F_n^{-1}(0.95). Everything you care about is a level set of the empirical CDF.

Three views of the same data

The histogram, the empirical CDF, and the sorted-values plot are three pictures of the same FnF_n. They emphasise different things.

  • Histogram. The values are bucketed into kk bins and the bin counts are plotted. Easy to read at a glance, terrible to read carefully: the shape depends visibly on the bin count and bin edges. Coarse bins hide multiple modes; fine bins look like noise. A good practice is to try a few bin counts (the widget below lets you slide from 6 to 30 and see the picture move) and trust shapes that survive the slide.
  • Empirical CDF. A step function with no tuning knobs — it is the data. Slope steeper where the data is dense, slope flatter where it is sparse. Reading quantiles is direct: draw a horizontal line at p=0.5p = 0.5 and find the median where it crosses FnF_n. Reading shape takes more practice than reading a histogram, but the picture does not depend on a bin choice.
  • Sorted-values plot. Plot the ii-th smallest value Z(i)Z_{(i)} on the y-axis against ii on the x-axis. This is just the CDF turned on its side: x and y axes are swapped, so the order-statistic plot is the inverse CDF. Why bother? Because this exact picture is what the Q-Q plot does next, just against a reference instead of a uniform rank axis.

Dist ViewerInteractive figure — enable JavaScript to interact.

Try the three presets. The lognormal porosity dataset is the standard reservoir-engineering distribution: most samples cluster between 0.15 and 0.30, but a small tail stretches above 0.5. The mean is pulled above the median by the tail — the canonical right-skew signature. The bimodal facies dataset has two clearly separated humps where most observations live, with the mean (computed as if the distribution were unimodal) sitting in the gap where almost no data exists. The heavy-tailed perm dataset hides five outliers near 80 mD that you have to look for in the histogram but jump out as an isolated cluster at the top of the sorted-values plot.

The Q-Q plot: a rigorous shape diagnostic

The question "is this distribution roughly Normal?" matters constantly in geostatistics, because §1.2 (normal-score transform) and Part 7 (sequential Gaussian simulation) both lean on a normality assumption. The histogram is too dependent on bin count to answer the question reliably. The quantile-quantile plot answers it cleanly.

The recipe is short. Pick nn probability points pi=(i0.5)/np_i = (i - 0.5)/n for i=1,,ni = 1, \ldots, n (the so-called plotting positions; other conventions exist, but they all look the same for moderate nn). For each pip_i, compute the empirical quantile q^i=Fn1(pi)\hat{q}i = F_n^{-1}(p_i) — which is just the ii-th sorted value Z(i)Z{(i)} — and the theoretical quantile of the reference distribution qiref=Fref1(pi)q_i^{\text{ref}} = F_{\text{ref}}^{-1}(p_i). Plot the pair (qiref,q^i)(q_i^{\text{ref}}, \hat{q}_i). If the data really is from the reference family (up to a location-scale transform), the points lie on a straight line. Deviations from straightness are diagnostic:

  • Straight line. Reference family fits. Slope and intercept tell you the location and scale.
  • Banana-curve, high tail above the line. The data has a heavier upper tail than the reference. Classic right-skew signature against a Normal reference.
  • Banana-curve, low tail below the line. Heavier lower tail than the reference.
  • S-shape. Both tails heavier than the reference (Student-t against Normal, for instance).
  • Inverted S-shape. Both tails lighter than the reference. Truncation.
  • Step or kink. Bimodality, censoring, or a mixed-population dataset.
  • Isolated points off the top or bottom of an otherwise-straight line. Outliers. The bulk fits, but a handful of values do not.

Qq PlotInteractive figure — enable JavaScript to interact.

The widget tries the same three presets against four reference families. The lognormal-porosity dataset against a Normal reference shows the canonical banana — the high tail floats above the line. Switch to a Lognormal reference and the points snap onto the line. Now toggle log-transform: the dataset is internally replaced by logZ\log Z, and the Q-Q plot against Normal now snaps straight. That last move is the seed of the normal-score transform that §1.2 generalises — pick any monotone transform that straightens the Q-Q plot, and you have a dataset whose marginal you can treat as Gaussian for downstream geostatistics. The heavy-tailed perm dataset shows the outlier-cluster behaviour: a straight line for the bulk, then a sudden departure at the top where the fracture-corridor cluster sits.

Why earth-science variables are usually right-skewed

It is not an accident. Porosity, permeability, ore grade, hydraulic conductivity, contaminant concentration, fracture density, and many other earth-science variables are products of many small multiplicative effects — depositional energy times grain-size effect times diagenetic dissolution times compaction, and so on. The central limit theorem applied to logs of independent contributions then implies that the variable itself is approximately lognormal: its log is Normal, but the variable is right-skewed in the original units. This is a physical regularity, not a mathematical convenience. It is why your downstream geostatistics will almost always need a normal-score transform (§1.2) before you can lean on Gaussian assumptions.

A single outlier can wreck your variogram. Part 3 builds the variogram, the spatial autocorrelation function that drives kriging and simulation. The variogram is computed as γ^(h)=12N(h)(i,j)N(h)(zizj)2\hat{\gamma}(h) = \tfrac{1}{2N(h)} \sum_{(i,j) \in N(h)} (z_i - z_j)^2: half the average squared difference between value pairs separated by lag hh. Squared differences. A single outlier paired with normal-range neighbours produces enormous (zizj)2(z_i - z_j)^2 values and inflates the variogram at every lag the outlier participates in. The Q-Q plot is how you find that outlier before it ruins the variogram. Spotting them in §1.1 is cheap; finding out they have wrecked your Part 5 kriging is expensive.

What §1.1 deliberately ignores

This section throws away the spatial coordinates. That is on purpose — the marginal-only view is exactly what you need to choose a transform and to spot outliers, and it would be confused by the spatial signal. But the marginal view is fundamentally limited:

  • Two datasets with identical marginal distributions can have wildly different spatial structures. §0.1's spatial-vs-aspatial widget made this point: the histogram is the same; everything spatial is invisible.
  • Two datasets with identical spatial structures can have wildly different marginal distributions. Two reservoirs with the same variogram and the same kriging weights but different histograms tell very different economic stories.

Both kinds of information are needed. §1.1 to §1.5 handle the marginal; Part 3 onward handles the spatial. Neither alone is enough.

Try it

  • In the distribution viewer, set the preset to Porosity (lognormal). Slide the bin count from 6 to 30. At what bin count does the right-skewed tail first become obvious? At what bin count does the picture start to look like noise? What does this tell you about reading a histogram?
  • Still in the distribution viewer, switch to the Bimodal facies preset. Locate where the mean (red dashed line) and median (blue dashed line) sit. They should not coincide with either mode. Why is "report the mean" the wrong summary for a bimodal dataset, and what would you report instead?
  • In the Q-Q plot widget, set preset = Porosity (lognormal), reference = Normal. Describe the shape of the deviation from the red reference line. Now switch reference to Lognormal. What changed? Now toggle log-transform and switch reference back to Normal. What does that tell you about the normal-score transform you will meet in §1.2?
  • Switch to the Perm + outliers preset (still in the Q-Q widget) against a Lognormal reference. The bulk should be straight; the top few points lift off the line. Roughly how many of the 200 points are outliers? What numerical diagnostic on the right panel grows large because of them?
  • Without coding: a colleague has 500 measurements of a contaminant concentration and reports the sample mean and a 95% confidence interval using σ/n\sigma / \sqrt{n}. The histogram is dramatically right-skewed and one observation is 50× the median. What does the Q-Q plot against a Lognormal reference probably look like, and why is the reported confidence interval untrustworthy even if you trust the IID assumption?

Pause and reflect: the Q-Q plot tells you whether a reference family fits. It does not tell you whether the choice of reference matters for your downstream analysis. For which geostatistical operations does the marginal shape actually affect the answer, and for which does it not? (Hint: kriging is unbiased under very weak assumptions; sequential Gaussian simulation lives or dies by a Gaussian marginal.)

What you now know

You have the empirical CDF Fn(z)F_n(z) as the lossless representation of the marginal distribution, the Glivenko–Cantelli convergence result that justifies treating it as the population CDF in the limit, and the three views — histogram, CDF, sorted values — that the same FnF_n produces. You have the Q-Q plot recipe, the standard library of curvatures and what each one means, and the geostatistical-specific reasons (right-skew of porosity and permeability; outliers wrecking variograms) that the Q-Q plot is the diagnostic of choice. You have an honest scope warning: §1.1 ignores the spatial arrangement on purpose, and Part 3 is where it returns.

§1.2 turns the Q-Q-plot diagnostic into an operation: pick a monotone transform that makes the Q-Q plot against Normal straight — the normal-score transform — and the rest of geostatistics gets a Gaussian-marginal dataset to work with. §1.3 takes a hard turn and explains why the histogram you computed in §1.1 is, in general, biased: samples in earth-science datasets are not laid down uniformly across the study area, they cluster in interesting zones, and the resulting histogram over-represents those zones; declustering (Part 2) is the fix. §1.4 hardens the descriptive-statistics machinery against the outliers you spotted in §1.1, and §1.5 moves from global summaries to local (block-by-block) ones that downstream kriging needs.

References

  • Wilk, M.B., Gnanadesikan, R. (1968). Probability plotting methods for the analysis of data. Biometrika, 55(1), 1–17. (The original Q-Q plot paper. Highly readable and still the definitive reference.)
  • Isaaks, E.H., Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press. (Chapter 2, "Univariate Description" — the classical geostatistical treatment of histograms, CDFs, quantiles, and Q-Q plots, with the right-skew arguments worked through on real datasets.)
  • Goovaerts, P. (1997). Geostatistics for Natural Resources Evaluation. Oxford University Press. (Chapter 2, exploratory data analysis for geostatistical purposes — including outlier diagnosis with Q-Q plots and its downstream consequences.)
  • Deutsch, C.V., Journel, A.G. (1998). GSLIB: Geostatistical Software Library and User's Guide (2nd ed.). Oxford University Press. (The GSLIB histplt, probplt, and nscore programs are the reference implementations of the operations in this section.)
  • Cleveland, W.S. (1985). The Elements of Graphing Data. Wadsworth. (Chapter 3 on quantile plots; remains the cleanest visual-graphics treatment of why Q-Q plots beat histograms for shape diagnosis.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.