Regression Practice with Python

Chapter 4: Linear Models — Regression and Classification

Learning objectives

Load and explore geoscience data for regression analysis
Implement simple and multiple linear regression using scikit-learn
Implement logistic regression for binary classification
Evaluate models using R-squared, MSE, residual plots, and classification accuracy
Apply polynomial regression with PolynomialFeatures
Plot regression lines and decision boundaries

Regression with scikit-learn

scikit-learn provides consistent APIs for all regression and classification models. The workflow is always: create model $\to$ fit(X, y) $\to$ predict(X_new) $\to$ evaluate.

Step 1: Loading and Exploring Geoscience Data

Loading a Depth-Porosity Dataset

Before fitting any model, always explore the data. Check for missing values, data types, and basic statistics.

%%%python import numpy as np import pandas as pd import matplotlib.pyplot as plt

Simulated well-log data: depth vs porosity

df = pd.DataFrame({ "depth": [1000, 1100, 1200, 1350, 1400, 1550, 1600, 1750, 1800, 1950, 2000, 2150, 2200, 2350, 2400], "gamma_ray": [30, 35, 42, 50, 55, 60, 68, 72, 78, 85, 90, 95, 100, 108, 112], "density": [2.20, 2.22, 2.25, 2.28, 2.30, 2.33, 2.36, 2.40, 2.42, 2.45, 2.48, 2.50, 2.52, 2.55, 2.58], "porosity": [30, 28, 25, 23, 22, 19, 18, 15, 14, 12, 10, 9, 8, 6, 5] })

print(df.head()) print("\nShape:", df.shape) print("\nDescriptive statistics:\n", df.describe()) print("\nMissing values:\n", df.isnull().sum())

Quick scatter plot

plt.scatter(df["depth"], df["porosity"], color="teal") plt.xlabel("Depth (m)") plt.ylabel("Porosity (%)") plt.title("Porosity vs Depth — Raw Data") plt.show() %%%

Step 2: Train/Test Split

Splitting Data for Honest Evaluation

Always split data before fitting any model. This prevents data leakage and gives an honest estimate of model performance on unseen data.

%%%python from sklearn.model_selection import train_test_split

X = df[["depth"]].values # Feature matrix (must be 2D) y = df["porosity"].values # Target vector (1D)

80% train, 20% test

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

print(f"Training samples: {len(X_train)}") print(f"Test samples: {len(X_test)}") print(f"Train depths range: {X_train.min():.0f} - {X_train.max():.0f} m") print(f"Test depths range: {X_test.min():.0f} - {X_test.max():.0f} m") %%%

Step 3: Simple Linear Regression

Fit, Predict, Score, and Plot

Linear regression fits the model $\hat{y} = w \cdot x + b$ by minimizing the sum of squared residuals.

%%%python from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score

Create and fit the model

model = LinearRegression() model.fit(X_train, y_train)

Model parameters

print(f"Slope (w): {model.coef_[0]:.4f} %/m") # porosity change per meter print(f"Intercept (b): {model.intercept_:.2f} %") # porosity at depth=0

Predictions on training and test sets

y_train_pred = model.predict(X_train) y_test_pred = model.predict(X_test)

Evaluation metrics

print(f"\nTrain R squared: {r2_score(y_train, y_train_pred):.4f}") print(f"Test R squared: {r2_score(y_test, y_test_pred):.4f}") print(f"Train MSE: {mean_squared_error(y_train, y_train_pred):.4f}") print(f"Test MSE: {mean_squared_error(y_test, y_test_pred):.4f}")

Plot the regression line

x_range = np.linspace(900, 2500, 200).reshape(-1, 1) y_range = model.predict(x_range)

plt.scatter(X_train, y_train, color="teal", label="Train", zorder=3) plt.scatter(X_test, y_test, color="orange", marker="s", label="Test", zorder=3) plt.plot(x_range, y_range, color="red", linewidth=2, label="Regression Line") plt.xlabel("Depth (m)") plt.ylabel("Porosity (%)") plt.title("Simple Linear Regression: Porosity vs Depth") plt.legend() plt.grid(True, alpha=0.3) plt.show() %%%

Step 4: Evaluating with Residual Plots

Residual Analysis

Residuals = $y_{\text{actual}} - y_{\text{predicted}}$ . A good model has residuals randomly scattered around zero with no pattern.

%%%python

Compute residuals on test set

residuals = y_test - y_test_pred

plt.figure(figsize=(10, 4))

Residual plot

plt.subplot(1, 2, 1) plt.scatter(y_test_pred, residuals, color="steelblue", edgecolor="k") plt.axhline(y=0, color="red", linestyle="--") plt.xlabel("Predicted Porosity (%)") plt.ylabel("Residual (%)") plt.title("Residual Plot")

Histogram of residuals

plt.subplot(1, 2, 2) plt.hist(residuals, bins=6, color="steelblue", edgecolor="k") plt.xlabel("Residual (%)") plt.ylabel("Frequency") plt.title("Residual Distribution")

plt.tight_layout() plt.show()

print(f"Mean residual: {residuals.mean():.4f} (should be near 0)") print(f"Std residual: {residuals.std():.4f}") %%%

Step 5: Multiple Linear Regression

Using Multiple Well-Log Features

When we have multiple features (depth, gamma ray, density), the model becomes $\hat{y} = w_1 x_1 + w_2 x_2 + w_3 x_3 + b$ . The scikit-learn API is identical.

%%%python

Multiple features

X_multi = df[["depth", "gamma_ray", "density"]].values y_multi = df["porosity"].values

X_train_m, X_test_m, y_train_m, y_test_m = train_test_split( X_multi, y_multi, test_size=0.2, random_state=42 )

model_multi = LinearRegression() model_multi.fit(X_train_m, y_train_m)

Show each coefficient with its feature name

feature_names = ["depth", "gamma_ray", "density"] for name, coef in zip(feature_names, model_multi.coef_): print(f" {name:12s}: {coef:+.4f}") print(f" intercept : {model_multi.intercept_:+.2f}")

print(f"\nTrain R squared: {model_multi.score(X_train_m, y_train_m):.4f}") print(f"Test R squared: {model_multi.score(X_test_m, y_test_m):.4f}")

Compare single vs multiple regression

print(f"\nSingle-feature R squared (test): {r2_score(y_test, y_test_pred):.4f}") print(f"Multi-feature R squared (test): {model_multi.score(X_test_m, y_test_m):.4f}") print("Multiple features often improve the model.") %%%

Step 6: Logistic Regression for Rock Classification

Binary Classification: Sandstone vs Shale

Logistic regression predicts probabilities using the sigmoid function: $P(y=1|x) = \frac{1}{1 + e^{-(w^T x + b)}}$

%%%python from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report

Binary classification: sandstone (1) vs shale (0)

Features: gamma ray and density

X_cls = np.array([ [30, 2.20], [35, 2.22], [40, 2.25], [45, 2.28], # sandstone [38, 2.24], [42, 2.26], # sandstone [80, 2.50], [85, 2.52], [90, 2.55], [95, 2.58], # shale [88, 2.54], [92, 2.56], # shale [50, 2.30], [75, 2.45] # borderline ]) y_cls = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0])

clf = LogisticRegression() clf.fit(X_cls, y_cls)

y_cls_pred = clf.predict(X_cls) print(f"Accuracy: {accuracy_score(y_cls, y_cls_pred):.2f}") print(classification_report(y_cls, y_cls_pred, target_names=["Shale", "Sandstone"]))

Predict probability for a new sample

new_sample = np.array([[60, 2.38]]) prob = clf.predict_proba(new_sample) print(f"\nNew sample [GR=60, RHOB=2.38]:") print(f" P(shale) = {prob[0][0]:.3f}") print(f" P(sandstone) = {prob[0][1]:.3f}") print(f" Prediction: {clf.predict(new_sample)[0]}") %%%

Step 7: Decision Boundary Visualization

Plotting the Logistic Regression Decision Boundary

The decision boundary is the line where $P(y=1) = 0.5$ , i.e., where $w_1 x_1 + w_2 x_2 + b = 0$ .

%%%python

Create a mesh grid for the decision boundary

x_min, x_max = X_cls[:, 0].min() - 5, X_cls[:, 0].max() + 5 y_min, y_max = X_cls[:, 1].min() - 0.05, X_cls[:, 1].max() + 0.05 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize=(8, 6)) plt.contourf(xx, yy, Z, alpha=0.3, cmap="RdYlBu") plt.scatter(X_cls[y_cls==1, 0], X_cls[y_cls==1, 1], c="blue", label="Sandstone", edgecolor="k", s=60) plt.scatter(X_cls[y_cls==0, 0], X_cls[y_cls==0, 1], c="red", label="Shale", edgecolor="k", s=60) plt.xlabel("Gamma Ray (API)") plt.ylabel("Density (g/cc)") plt.title("Logistic Regression Decision Boundary") plt.legend() plt.grid(True, alpha=0.2) plt.show() %%%

Step 8: Polynomial Regression

Capturing Nonlinear Trends with PolynomialFeatures

When the relationship is nonlinear (e.g., porosity decay with depth follows an exponential-like curve), polynomial regression can help. It creates new features $x, x^2, x^3, \ldots$ and then fits a linear model to them.

%%%python from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline

Compare degree 1, 2, and 3 polynomial fits

plt.figure(figsize=(10, 5)) plt.scatter(X_train, y_train, color="teal", label="Train", zorder=3) plt.scatter(X_test, y_test, color="orange", marker="s", label="Test", zorder=3)

colors = ["red", "green", "purple"] for degree, color in zip([1, 2, 3], colors): poly_model = make_pipeline( PolynomialFeatures(degree=degree, include_bias=False), LinearRegression() ) poly_model.fit(X_train, y_train) y_poly = poly_model.predict(x_range) r2_test = poly_model.score(X_test, y_test) plt.plot(x_range, y_poly, color=color, linewidth=2, label=f"Degree {degree} (R2={r2_test:.3f})")

plt.xlabel("Depth (m)") plt.ylabel("Porosity (%)") plt.title("Polynomial Regression: Comparing Degrees") plt.legend() plt.grid(True, alpha=0.3) plt.show() %%%

Detailed Polynomial Regression Walkthrough

%%%python

Degree-2 polynomial regression step-by-step

poly = PolynomialFeatures(degree=2, include_bias=False) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.transform(X_test)

print("Original features shape:", X_train.shape) # (n, 1) print("Polynomial features shape:", X_train_poly.shape) # (n, 2) print("Feature names:", poly.get_feature_names_out())

Fit linear regression on polynomial features

model_poly = LinearRegression() model_poly.fit(X_train_poly, y_train)

print(f"\nCoefficients: {model_poly.coef_}") print(f"Intercept: {model_poly.intercept_:.2f}") print(f"Train R2: {model_poly.score(X_train_poly, y_train):.4f}") print(f"Test R2: {model_poly.score(X_test_poly, y_test):.4f}") %%%

Key Methods Summary

Method	Purpose
`.fit(X, y)`	Train the model on data
`.predict(X)`	Make predictions
`.score(X, y)`	R-squared for regression, accuracy for classification
`.coef_`	Learned weights
`.intercept_`	Learned bias
`.predict_proba(X)`	Class probabilities (logistic only)
`PolynomialFeatures`	Create $x^2, x^3, \ldots$ features for nonlinear regression

[Library: scikit-learn]

References

Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 4 (training linear & logistic models). O’Reilly.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 3 (linear regression labs). Springer.