Statistics Practice with Python

Chapter 2: Statistical Foundations for ML

Learning objectives

  • Compute descriptive statistics using NumPy and Pandas
  • Create histograms and box plots with Matplotlib
  • Perform hypothesis tests using SciPy
  • Interpret statistical summaries of geoscience data

Computing Statistics in Python

Python's scientific ecosystem makes statistical analysis straightforward. The key libraries are NumPy (fast numerical arrays), Pandas (DataFrames for tabular data), Matplotlib (plotting), and SciPy (advanced statistics).

Basic Descriptive Statistics with NumPy

NumPy provides efficient functions for arrays of numbers:

%%%python import numpy as np

Porosity measurements from 10 core plugs (%)

porosity = np.array([12.1, 15.3, 14.8, 18.2, 11.5, 16.7, 13.9, 19.0, 15.1, 14.4])

mean_val = np.mean(porosity) # 15.1 median_val = np.median(porosity) # 15.0 std_val = np.std(porosity, ddof=1) # ddof=1 for sample std var_val = np.var(porosity, ddof=1) # sample variance min_val = np.min(porosity) max_val = np.max(porosity)

print(f"Mean: {mean_val:.2f}%") print(f"Median: {median_val:.2f}%") print(f"Std: {std_val:.2f}%") print(f"Var: {var_val:.2f}") print(f"Range: {min_val:.1f} to {max_val:.1f}") %%%

Important: Use ddof=1 in np.std() and np.var() for the sample standard deviation and variance (divides by n1n-1). The default ddof=0 gives the population version (divides by nn).

Percentiles and Quartiles

%%%python

Compute quartiles

Q1 = np.percentile(porosity, 25) Q2 = np.percentile(porosity, 50) # same as median Q3 = np.percentile(porosity, 75) IQR = Q3 - Q1

print(f"Q1 = {Q1:.2f}, Q2 = {Q2:.2f}, Q3 = {Q3:.2f}") print(f"IQR = {IQR:.2f}")

Detect outliers

lower_fence = Q1 - 1.5 * IQR upper_fence = Q3 + 1.5 * IQR outliers = porosity[(porosity < lower_fence) | (porosity > upper_fence)] print(f"Outliers: {outliers}") %%%

Pandas describe() — Quick Summary

Pandas gives a one-line summary of any DataFrame column:

%%%python import pandas as pd

Create a DataFrame from well data

df = pd.DataFrame({ "well": ["A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8"], "depth_m": [1200, 1350, 1500, 1650, 1800, 1950, 2100, 2250], "porosity": [24.1, 22.3, 19.8, 18.5, 16.2, 14.9, 13.1, 11.7], "perm_mD": [450, 320, 180, 120, 75, 45, 25, 12] })

Quick statistical summary

print(df.describe())

Shows count, mean, std, min, 25%, 50%, 75%, max for each numeric column

Summary for a single column

print(df["porosity"].describe()) %%%

Histograms with Matplotlib

Histograms show the distribution shape of your data:

%%%python import matplotlib.pyplot as plt

Histogram of porosity values

plt.figure(figsize=(8, 5)) plt.hist(df["porosity"], bins=6, color="steelblue", edgecolor="black", alpha=0.7) plt.xlabel("Porosity (%)") plt.ylabel("Frequency") plt.title("Porosity Distribution — Well Data") plt.axvline(df["porosity"].mean(), color="red", linestyle="--", label=f'Mean = {df["porosity"].mean():.1f}%') plt.legend() plt.tight_layout() plt.show() %%%

Box Plots for Comparing Groups

%%%python

Compare porosity distributions from two formations

formation_A = [22.1, 24.5, 19.8, 23.2, 21.0, 25.3, 20.1] formation_B = [14.5, 16.2, 12.8, 15.9, 13.4, 17.1, 14.0]

plt.figure(figsize=(6, 5)) plt.boxplot([formation_A, formation_B], labels=["Formation A", "Formation B"]) plt.ylabel("Porosity (%)") plt.title("Porosity Comparison by Formation") plt.grid(axis="y", alpha=0.3) plt.tight_layout() plt.show() %%%

The box shows Q1Q_1 to Q3Q_3, the line inside is the median, and the whiskers extend to the farthest non-outlier points.

Hypothesis Testing with SciPy

%%%python from scipy import stats

One-sample t-test: is the mean porosity = 20%?

t_stat, p_value = stats.ttest_1samp(df["porosity"], 20.0) print(f"t-statistic: {t_stat:.3f}") print(f"p-value: {p_value:.4f}") if p_value < 0.05: print("Reject H0: mean porosity is significantly different from 20%") else: print("Fail to reject H0: not enough evidence to say mean differs from 20%")

Two-sample t-test: compare Formation A vs Formation B

t_stat2, p_value2 = stats.ttest_ind(formation_A, formation_B) print(f"\nTwo-sample t-test: t = {t_stat2:.3f}, p = {p_value2:.4f}") %%%

Correlation Matrix

%%%python

Compute and display correlation matrix

corr_matrix = df[["depth_m", "porosity", "perm_mD"]].corr() print(corr_matrix)

Scatter plot with correlation

plt.figure(figsize=(6, 5)) plt.scatter(df["depth_m"], df["porosity"], color="teal", s=60) plt.xlabel("Depth (m)") plt.ylabel("Porosity (%)") plt.title(f'Depth vs Porosity (r = {df["depth_m"].corr(df["porosity"]):.3f})') plt.tight_layout() plt.show() %%%

[Libraries: NumPy, Pandas, Matplotlib, SciPy]

References

  • McKinney, W. (2017). Python for Data Analysis (2nd ed.), ch. 5 & 9 (pandas, plotting). O’Reilly.
  • VanderPlas, J. (2016). Python Data Science Handbook, ch. 3 & 4 (pandas, visualization). O’Reilly.
  • Harris, C.R., et al. (2020). Array programming with NumPy. Nature 585, 357–362.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.