Background Statistics
Learning objectives
- Compute descriptive statistics: mean, median, mode, variance, and standard deviation
- Interpret percentiles, quartiles, and box plots
- Distinguish populations from samples and understand hypothesis testing
- Apply the Central Limit Theorem
- Compute correlation and covariance and interpret their geoscience meaning
Why Statistics Matters in Geoscience
Every geoscience dataset—whether it is a suite of well-log measurements, a grid of seismic amplitudes, or a collection of grain-size analyses—contains variability. Statistics gives us the language and tools to summarise that variability, quantify uncertainty, and make defensible decisions from data.
1. Descriptive Statistics
Descriptive statistics condense a dataset into a few key numbers.
Measures of Central Tendency
Mean (arithmetic average):
The mean is the "balance point" of the data. For well-log porosity values of 12%, 15%, 18%, 14%, 16%, the mean is .
Median: The middle value when data are sorted. If is even, the median is the average of the two middle values. The median is robust to outliers—a single anomalously high permeability reading will not pull it far from the bulk of the data.
Mode: The most frequently occurring value (or the peak of a histogram). In grain-size analysis, the mode tells us the dominant grain size in a sediment sample.
Measures of Spread
Variance measures how far values deviate from the mean. The sample variance uses (Bessel's correction) to give an unbiased estimate:
Standard deviation is the square root of the variance: . It is in the same units as the data, making it easier to interpret than the variance.
Range: . Simple but sensitive to outliers.
Interquartile Range (IQR): , where is the 25th percentile and is the 75th percentile. The IQR captures the spread of the central 50% of the data and is robust to outliers.
Percentiles, Quartiles, and Box Plots
The -th percentile is the value below which of the data fall. Special percentiles:
- — first quartile
- — median (second quartile)
- — third quartile
A box plot (box-and-whisker diagram) displays the five-number summary: minimum, , median, , maximum. Points beyond from the quartiles are plotted individually as potential outliers. Box plots are excellent for comparing porosity distributions across different geological formations.
2. Inferential Statistics
Populations vs. Samples
A population is the complete set of items we wish to study (e.g., all porosity values in a reservoir). A sample is a subset drawn from the population (e.g., the 200 core-plug measurements we actually made). We use sample statistics (, ) to estimate population parameters (, ).
Hypothesis Testing
Hypothesis testing lets us make decisions about populations from samples. The procedure is:
- State hypotheses: Null hypothesis (no effect / no difference) vs. alternative .
- Choose significance level (commonly 0.05).
- Compute a test statistic and its p-value.
- Decide: If , reject in favour of .
The p-value is the probability of observing data as extreme as (or more extreme than) our sample, assuming is true. A small p-value means the data are unlikely under .
Confidence Intervals
A confidence interval for the population mean is:
(known ) or (unknown )
A 95% CI means: if we repeated sampling many times, 95% of the calculated intervals would contain the true population mean.
The t-Test
The one-sample t-test checks whether a sample mean differs significantly from a hypothesised value :
The two-sample t-test compares means of two independent groups. For example: "Is the mean porosity of Formation A significantly different from Formation B?"
3. Probability Distributions
Normal (Gaussian) Distribution
The most important distribution in statistics, characterised by two parameters: mean and standard deviation .
Properties: bell-shaped, symmetric about . About 68% of data lie within , 95% within , and 99.7% within (the 68-95-99.7 rule).
The standard normal has . Any normal variable can be standardised via the z-score:
Uniform Distribution
Every value in the interval is equally likely: . Mean: , variance: . Useful as a prior in Bayesian analysis when we have no preference among values.
Binomial Distribution
Models the number of successes in independent trials, each with success probability :
Mean: , variance: . Example: if a well has a 30% chance of being economic (), the probability that exactly 3 of 10 wells () succeed is .
4. Central Limit Theorem (CLT)
Statement
If we draw random samples of size from any population with mean and standard deviation , then as increases, the distribution of the sample mean approaches a normal distribution:
In practice, is usually large enough for the approximation to be good. This is why the normal distribution is so central to hypothesis testing: even if individual well-log readings are not normally distributed, the average of many readings will be approximately normal.
5. Correlation and Covariance
Covariance
Covariance measures the joint variability of two variables:
Positive covariance: as increases, tends to increase. Negative: they move in opposite directions.
Pearson Correlation Coefficient
Normalised covariance, bounded between and :
: perfect positive linear relationship. : perfect negative. : no linear relationship (but there could still be a nonlinear one!).
Geoscience example: Porosity and permeability in sandstones often show a positive correlation (). Porosity and depth of burial tend to show a negative correlation.
[Refs: Wackerly et al., Mathematical Statistics; Davis, Statistics and Data Analysis in Geology]
References
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 2 (overview of supervised learning, basic statistics). Springer.
- James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 2 (statistical learning fundamentals). Springer.
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 2 & 3 (probability, statistics). MIT Press.
- Olden, J.D., Lawler, J.J., Poff, N.L. (2008). Machine learning methods without tears: A primer for ecologists. Q. Rev. Biol. 83(2), 171–193.