Statistics Practice with Python
Learning objectives
- Compute descriptive statistics using NumPy and Pandas
- Create histograms and box plots with Matplotlib
- Perform hypothesis tests using SciPy
- Interpret statistical summaries of geoscience data
Computing Statistics in Python
Python's scientific ecosystem makes statistical analysis straightforward. The key libraries are NumPy (fast numerical arrays), Pandas (DataFrames for tabular data), Matplotlib (plotting), and SciPy (advanced statistics).
Basic Descriptive Statistics with NumPy
NumPy provides efficient functions for arrays of numbers:
%%%python import numpy as np
Porosity measurements from 10 core plugs (%)
porosity = np.array([12.1, 15.3, 14.8, 18.2, 11.5, 16.7, 13.9, 19.0, 15.1, 14.4])
mean_val = np.mean(porosity) # 15.1 median_val = np.median(porosity) # 15.0 std_val = np.std(porosity, ddof=1) # ddof=1 for sample std var_val = np.var(porosity, ddof=1) # sample variance min_val = np.min(porosity) max_val = np.max(porosity)
print(f"Mean: {mean_val:.2f}%") print(f"Median: {median_val:.2f}%") print(f"Std: {std_val:.2f}%") print(f"Var: {var_val:.2f}") print(f"Range: {min_val:.1f} to {max_val:.1f}") %%%
Important: Use ddof=1 in np.std() and np.var() for the sample standard deviation and variance (divides by ). The default ddof=0 gives the population version (divides by ).
Percentiles and Quartiles
%%%python
Compute quartiles
Q1 = np.percentile(porosity, 25) Q2 = np.percentile(porosity, 50) # same as median Q3 = np.percentile(porosity, 75) IQR = Q3 - Q1
print(f"Q1 = {Q1:.2f}, Q2 = {Q2:.2f}, Q3 = {Q3:.2f}") print(f"IQR = {IQR:.2f}")
Detect outliers
lower_fence = Q1 - 1.5 * IQR upper_fence = Q3 + 1.5 * IQR outliers = porosity[(porosity < lower_fence) | (porosity > upper_fence)] print(f"Outliers: {outliers}") %%%
Pandas describe() — Quick Summary
Pandas gives a one-line summary of any DataFrame column:
%%%python import pandas as pd
Create a DataFrame from well data
df = pd.DataFrame({ "well": ["A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8"], "depth_m": [1200, 1350, 1500, 1650, 1800, 1950, 2100, 2250], "porosity": [24.1, 22.3, 19.8, 18.5, 16.2, 14.9, 13.1, 11.7], "perm_mD": [450, 320, 180, 120, 75, 45, 25, 12] })
Quick statistical summary
print(df.describe())
Shows count, mean, std, min, 25%, 50%, 75%, max for each numeric column
Summary for a single column
print(df["porosity"].describe()) %%%
Histograms with Matplotlib
Histograms show the distribution shape of your data:
%%%python import matplotlib.pyplot as plt
Histogram of porosity values
plt.figure(figsize=(8, 5)) plt.hist(df["porosity"], bins=6, color="steelblue", edgecolor="black", alpha=0.7) plt.xlabel("Porosity (%)") plt.ylabel("Frequency") plt.title("Porosity Distribution — Well Data") plt.axvline(df["porosity"].mean(), color="red", linestyle="--", label=f'Mean = {df["porosity"].mean():.1f}%') plt.legend() plt.tight_layout() plt.show() %%%
Box Plots for Comparing Groups
%%%python
Compare porosity distributions from two formations
formation_A = [22.1, 24.5, 19.8, 23.2, 21.0, 25.3, 20.1] formation_B = [14.5, 16.2, 12.8, 15.9, 13.4, 17.1, 14.0]
plt.figure(figsize=(6, 5)) plt.boxplot([formation_A, formation_B], labels=["Formation A", "Formation B"]) plt.ylabel("Porosity (%)") plt.title("Porosity Comparison by Formation") plt.grid(axis="y", alpha=0.3) plt.tight_layout() plt.show() %%%
The box shows to , the line inside is the median, and the whiskers extend to the farthest non-outlier points.
Hypothesis Testing with SciPy
%%%python from scipy import stats
One-sample t-test: is the mean porosity = 20%?
t_stat, p_value = stats.ttest_1samp(df["porosity"], 20.0) print(f"t-statistic: {t_stat:.3f}") print(f"p-value: {p_value:.4f}") if p_value < 0.05: print("Reject H0: mean porosity is significantly different from 20%") else: print("Fail to reject H0: not enough evidence to say mean differs from 20%")
Two-sample t-test: compare Formation A vs Formation B
t_stat2, p_value2 = stats.ttest_ind(formation_A, formation_B) print(f"\nTwo-sample t-test: t = {t_stat2:.3f}, p = {p_value2:.4f}") %%%
Correlation Matrix
%%%python
Compute and display correlation matrix
corr_matrix = df[["depth_m", "porosity", "perm_mD"]].corr() print(corr_matrix)
Scatter plot with correlation
plt.figure(figsize=(6, 5)) plt.scatter(df["depth_m"], df["porosity"], color="teal", s=60) plt.xlabel("Depth (m)") plt.ylabel("Porosity (%)") plt.title(f'Depth vs Porosity (r = {df["depth_m"].corr(df["porosity"]):.3f})') plt.tight_layout() plt.show() %%%
[Libraries: NumPy, Pandas, Matplotlib, SciPy]
References
- McKinney, W. (2017). Python for Data Analysis (2nd ed.), ch. 5 & 9 (pandas, plotting). O’Reilly.
- VanderPlas, J. (2016). Python Data Science Handbook, ch. 3 & 4 (pandas, visualization). O’Reilly.
- Harris, C.R., et al. (2020). Array programming with NumPy. Nature 585, 357–362.