Background Statistics

Chapter 2: Statistical Foundations for ML

Learning objectives

  • Compute descriptive statistics: mean, median, mode, variance, and standard deviation
  • Interpret percentiles, quartiles, and box plots
  • Distinguish populations from samples and understand hypothesis testing
  • Apply the Central Limit Theorem
  • Compute correlation and covariance and interpret their geoscience meaning

Why Statistics Matters in Geoscience

Every geoscience dataset—whether it is a suite of well-log measurements, a grid of seismic amplitudes, or a collection of grain-size analyses—contains variability. Statistics gives us the language and tools to summarise that variability, quantify uncertainty, and make defensible decisions from data.

1. Descriptive Statistics

Descriptive statistics condense a dataset into a few key numbers.

Measures of Central Tendency

Mean (arithmetic average):

xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i

The mean is the "balance point" of the data. For well-log porosity values of 12%, 15%, 18%, 14%, 16%, the mean is xˉ=12+15+18+14+165=15%\bar{x} = \frac{12+15+18+14+16}{5} = 15%.

Median: The middle value when data are sorted. If nn is even, the median is the average of the two middle values. The median is robust to outliers—a single anomalously high permeability reading will not pull it far from the bulk of the data.

Mode: The most frequently occurring value (or the peak of a histogram). In grain-size analysis, the mode tells us the dominant grain size in a sediment sample.

Measures of Spread

Variance measures how far values deviate from the mean. The sample variance uses n1n-1 (Bessel's correction) to give an unbiased estimate:

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Standard deviation is the square root of the variance: s=s2s = \sqrt{s^2}. It is in the same units as the data, making it easier to interpret than the variance.

Range: Range=xmaxxmin\text{Range} = x_{\max} - x_{\min}. Simple but sensitive to outliers.

Interquartile Range (IQR): IQR=Q3Q1\text{IQR} = Q_3 - Q_1, where Q1Q_1 is the 25th percentile and Q3Q_3 is the 75th percentile. The IQR captures the spread of the central 50% of the data and is robust to outliers.

Percentiles, Quartiles, and Box Plots

The kk-th percentile PkP_k is the value below which k%k% of the data fall. Special percentiles:

  • Q1=P25Q_1 = P_{25} — first quartile
  • Q2=P50Q_2 = P_{50} — median (second quartile)
  • Q3=P75Q_3 = P_{75} — third quartile

A box plot (box-and-whisker diagram) displays the five-number summary: minimum, Q1Q_1, median, Q3Q_3, maximum. Points beyond 1.5×IQR1.5 \times \text{IQR} from the quartiles are plotted individually as potential outliers. Box plots are excellent for comparing porosity distributions across different geological formations.

2. Inferential Statistics

Populations vs. Samples

A population is the complete set of items we wish to study (e.g., all porosity values in a reservoir). A sample is a subset drawn from the population (e.g., the 200 core-plug measurements we actually made). We use sample statistics (xˉ\bar{x}, ss) to estimate population parameters (μ\mu, σ\sigma).

Hypothesis Testing

Hypothesis testing lets us make decisions about populations from samples. The procedure is:

  • State hypotheses: Null hypothesis H0H_0 (no effect / no difference) vs. alternative H1H_1.
  • Choose significance level α\alpha (commonly 0.05).
  • Compute a test statistic and its p-value.
  • Decide: If p<αp < \alpha, reject H0H_0 in favour of H1H_1.

The p-value is the probability of observing data as extreme as (or more extreme than) our sample, assuming H0H_0 is true. A small p-value means the data are unlikely under H0H_0.

Confidence Intervals

A (1α)×100%(1-\alpha)\times 100% confidence interval for the population mean is:

xˉ±zα/2σn\bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}} (known σ\sigma) or xˉ±tα/2,n1sn\bar{x} \pm t_{\alpha/2,,n-1}\frac{s}{\sqrt{n}} (unknown σ\sigma)

A 95% CI means: if we repeated sampling many times, 95% of the calculated intervals would contain the true population mean.

The t-Test

The one-sample t-test checks whether a sample mean differs significantly from a hypothesised value μ0\mu_0:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

The two-sample t-test compares means of two independent groups. For example: "Is the mean porosity of Formation A significantly different from Formation B?"

3. Probability Distributions

Normal (Gaussian) Distribution

The most important distribution in statistics, characterised by two parameters: mean μ\mu and standard deviation σ\sigma.

f(x)=1σ2πexp((xμ)22σ2)f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Properties: bell-shaped, symmetric about μ\mu. About 68% of data lie within μ±σ\mu \pm \sigma, 95% within μ±2σ\mu \pm 2\sigma, and 99.7% within μ±3σ\mu \pm 3\sigma (the 68-95-99.7 rule).

The standard normal has μ=0,σ=1\mu = 0, \sigma = 1. Any normal variable can be standardised via the z-score:

z=xμσz = \frac{x - \mu}{\sigma}

Uniform Distribution

Every value in the interval [a,b][a, b] is equally likely: f(x)=1baf(x) = \frac{1}{b-a}. Mean: a+b2\frac{a+b}{2}, variance: (ba)212\frac{(b-a)^2}{12}. Useful as a prior in Bayesian analysis when we have no preference among values.

Binomial Distribution

Models the number of successes in nn independent trials, each with success probability pp:

P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k}p^k(1-p)^{n-k}

Mean: npnp, variance: np(1p)np(1-p). Example: if a well has a 30% chance of being economic (p=0.3p=0.3), the probability that exactly 3 of 10 wells (n=10,k=3n=10, k=3) succeed is (103)(0.3)3(0.7)70.267\binom{10}{3}(0.3)^3(0.7)^7 \approx 0.267.

4. Central Limit Theorem (CLT)

Statement

If we draw random samples of size nn from any population with mean μ\mu and standard deviation σ\sigma, then as nn increases, the distribution of the sample mean Xˉ\bar{X} approaches a normal distribution:

XˉN(μ,  σ2n)\bar{X} \sim N\left(\mu,\; \frac{\sigma^2}{n}\right)

In practice, n30n \geq 30 is usually large enough for the approximation to be good. This is why the normal distribution is so central to hypothesis testing: even if individual well-log readings are not normally distributed, the average of many readings will be approximately normal.

5. Correlation and Covariance

Covariance

Covariance measures the joint variability of two variables:

Cov(X,Y)=1n1i=1n(xixˉ)(yiyˉ)\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})

Positive covariance: as XX increases, YY tends to increase. Negative: they move in opposite directions.

Pearson Correlation Coefficient

Normalised covariance, bounded between 1-1 and +1+1:

r=Cov(X,Y)sXsY=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\text{Cov}(X, Y)}{s_X \cdot s_Y} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2}\sqrt{\sum(y_i - \bar{y})^2}}

r=+1r = +1: perfect positive linear relationship. r=1r = -1: perfect negative. r=0r = 0: no linear relationship (but there could still be a nonlinear one!).

Geoscience example: Porosity and permeability in sandstones often show a positive correlation (r0.60.9r \approx 0.6{-}0.9). Porosity and depth of burial tend to show a negative correlation.

[Refs: Wackerly et al., Mathematical Statistics; Davis, Statistics and Data Analysis in Geology]

References

  • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 2 (overview of supervised learning, basic statistics). Springer.
  • James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 2 (statistical learning fundamentals). Springer.
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 2 & 3 (probability, statistics). MIT Press.
  • Olden, J.D., Lawler, J.J., Poff, N.L. (2008). Machine learning methods without tears: A primer for ecologists. Q. Rev. Biol. 83(2), 171–193.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.