Background Statistics

Chapter 2: Statistical Foundations for ML

Learning objectives

Compute descriptive statistics: mean, median, mode, variance, and standard deviation
Interpret percentiles, quartiles, and box plots
Distinguish populations from samples and understand hypothesis testing
Apply the Central Limit Theorem
Compute correlation and covariance and interpret their geoscience meaning

Why Statistics Matters in Geoscience

Every geoscience dataset—whether it is a suite of well-log measurements, a grid of seismic amplitudes, or a collection of grain-size analyses—contains variability. Statistics gives us the language and tools to summarise that variability, quantify uncertainty, and make defensible decisions from data.

1. Descriptive Statistics

Descriptive statistics condense a dataset into a few key numbers.

Measures of Central Tendency

Mean (arithmetic average):

\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i

The mean is the "balance point" of the data. For well-log porosity values of 12%, 15%, 18%, 14%, 16%, the mean is $\bar{x} = \frac{12+15+18+14+16}{5} = 15%$ .

Median: The middle value when data are sorted. If $n$ is even, the median is the average of the two middle values. The median is robust to outliers—a single anomalously high permeability reading will not pull it far from the bulk of the data.

Mode: The most frequently occurring value (or the peak of a histogram). In grain-size analysis, the mode tells us the dominant grain size in a sediment sample.

Measures of Spread

Variance measures how far values deviate from the mean. The sample variance uses $n-1$ (Bessel's correction) to give an unbiased estimate:

s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Standard deviation is the square root of the variance: $s = \sqrt{s^2}$ . It is in the same units as the data, making it easier to interpret than the variance.

Range: $\text{Range} = x_{\max} - x_{\min}$ . Simple but sensitive to outliers.

Interquartile Range (IQR): $\text{IQR} = Q_3 - Q_1$ , where $Q_1$ is the 25th percentile and $Q_3$ is the 75th percentile. The IQR captures the spread of the central 50% of the data and is robust to outliers.

Percentiles, Quartiles, and Box Plots

The $k$ -th percentile $P_k$ is the value below which $k%$ of the data fall. Special percentiles:

$Q_1 = P_{25}$ — first quartile
$Q_2 = P_{50}$ — median (second quartile)
$Q_3 = P_{75}$ — third quartile

A box plot (box-and-whisker diagram) displays the five-number summary: minimum, $Q_1$ , median, $Q_3$ , maximum. Points beyond $1.5 \times \text{IQR}$ from the quartiles are plotted individually as potential outliers. Box plots are excellent for comparing porosity distributions across different geological formations.

2. Inferential Statistics

Populations vs. Samples

A population is the complete set of items we wish to study (e.g., all porosity values in a reservoir). A sample is a subset drawn from the population (e.g., the 200 core-plug measurements we actually made). We use sample statistics ( $\bar{x}$ , $s$ ) to estimate population parameters ( $\mu$ , $\sigma$ ).

Hypothesis Testing

Hypothesis testing lets us make decisions about populations from samples. The procedure is:

State hypotheses: Null hypothesis $H_0$ (no effect / no difference) vs. alternative $H_1$ .
Choose significance level $\alpha$ (commonly 0.05).
Compute a test statistic and its p-value.
Decide: If $p < \alpha$ , reject $H_0$ in favour of $H_1$ .

The p-value is the probability of observing data as extreme as (or more extreme than) our sample, assuming $H_0$ is true. A small p-value means the data are unlikely under $H_0$ .

Confidence Intervals

A $(1-\alpha)\times 100%$ confidence interval for the population mean is:

$\bar{x} \pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}$ (known $\sigma$ ) or $\bar{x} \pm t_{\alpha/2,,n-1}\frac{s}{\sqrt{n}}$ (unknown $\sigma$ )

A 95% CI means: if we repeated sampling many times, 95% of the calculated intervals would contain the true population mean.

The t-Test

The one-sample t-test checks whether a sample mean differs significantly from a hypothesised value $\mu_0$ :

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

The two-sample t-test compares means of two independent groups. For example: "Is the mean porosity of Formation A significantly different from Formation B?"

3. Probability Distributions

Normal (Gaussian) Distribution

The most important distribution in statistics, characterised by two parameters: mean $\mu$ and standard deviation $\sigma$ .

f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Properties: bell-shaped, symmetric about $\mu$ . About 68% of data lie within $\mu \pm \sigma$ , 95% within $\mu \pm 2\sigma$ , and 99.7% within $\mu \pm 3\sigma$ (the 68-95-99.7 rule).

The standard normal has $\mu = 0, \sigma = 1$ . Any normal variable can be standardised via the z-score:

z = \frac{x - \mu}{\sigma}

Uniform Distribution

Every value in the interval $[a, b]$ is equally likely: $f(x) = \frac{1}{b-a}$ . Mean: $\frac{a+b}{2}$ , variance: $\frac{(b-a)^2}{12}$ . Useful as a prior in Bayesian analysis when we have no preference among values.

Binomial Distribution

Models the number of successes in $n$ independent trials, each with success probability $p$ :

P(X = k) = \binom{n}{k}p^k(1-p)^{n-k}

Mean: $np$ , variance: $np(1-p)$ . Example: if a well has a 30% chance of being economic ( $p=0.3$ ), the probability that exactly 3 of 10 wells ( $n=10, k=3$ ) succeed is $\binom{10}{3}(0.3)^3(0.7)^7 \approx 0.267$ .

4. Central Limit Theorem (CLT)

Statement

If we draw random samples of size $n$ from any population with mean $\mu$ and standard deviation $\sigma$ , then as $n$ increases, the distribution of the sample mean $\bar{X}$ approaches a normal distribution:

\bar{X} \sim N\left(\mu,\; \frac{\sigma^2}{n}\right)

In practice, $n \geq 30$ is usually large enough for the approximation to be good. This is why the normal distribution is so central to hypothesis testing: even if individual well-log readings are not normally distributed, the average of many readings will be approximately normal.

5. Correlation and Covariance

Covariance

Covariance measures the joint variability of two variables:

\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})

Positive covariance: as $X$ increases, $Y$ tends to increase. Negative: they move in opposite directions.

Pearson Correlation Coefficient

Normalised covariance, bounded between $-1$ and $+1$ :

r = \frac{\text{Cov}(X, Y)}{s_X \cdot s_Y} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2}\sqrt{\sum(y_i - \bar{y})^2}}

$r = +1$ : perfect positive linear relationship. $r = -1$ : perfect negative. $r = 0$ : no linear relationship (but there could still be a nonlinear one!).

Geoscience example: Porosity and permeability in sandstones often show a positive correlation ( $r \approx 0.6{-}0.9$ ). Porosity and depth of burial tend to show a negative correlation.

[Refs: Wackerly et al., Mathematical Statistics; Davis, Statistics and Data Analysis in Geology]

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 2 (overview of supervised learning, basic statistics). Springer.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 2 (statistical learning fundamentals). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 2 & 3 (probability, statistics). MIT Press.
Olden, J.D., Lawler, J.J., Poff, N.L. (2008). Machine learning methods without tears: A primer for ecologists. Q. Rev. Biol. 83(2), 171–193.