Regression Practice with Python
Learning objectives
- Load and explore geoscience data for regression analysis
- Implement simple and multiple linear regression using scikit-learn
- Implement logistic regression for binary classification
- Evaluate models using R-squared, MSE, residual plots, and classification accuracy
- Apply polynomial regression with PolynomialFeatures
- Plot regression lines and decision boundaries
Regression with scikit-learn
scikit-learn provides consistent APIs for all regression and classification models. The workflow is always: create model fit(X, y) predict(X_new) evaluate.
Step 1: Loading and Exploring Geoscience Data
Loading a Depth-Porosity Dataset
Before fitting any model, always explore the data. Check for missing values, data types, and basic statistics.
%%%python import numpy as np import pandas as pd import matplotlib.pyplot as plt
Simulated well-log data: depth vs porosity
df = pd.DataFrame({ "depth": [1000, 1100, 1200, 1350, 1400, 1550, 1600, 1750, 1800, 1950, 2000, 2150, 2200, 2350, 2400], "gamma_ray": [30, 35, 42, 50, 55, 60, 68, 72, 78, 85, 90, 95, 100, 108, 112], "density": [2.20, 2.22, 2.25, 2.28, 2.30, 2.33, 2.36, 2.40, 2.42, 2.45, 2.48, 2.50, 2.52, 2.55, 2.58], "porosity": [30, 28, 25, 23, 22, 19, 18, 15, 14, 12, 10, 9, 8, 6, 5] })
print(df.head()) print("\nShape:", df.shape) print("\nDescriptive statistics:\n", df.describe()) print("\nMissing values:\n", df.isnull().sum())
Quick scatter plot
plt.scatter(df["depth"], df["porosity"], color="teal") plt.xlabel("Depth (m)") plt.ylabel("Porosity (%)") plt.title("Porosity vs Depth — Raw Data") plt.show() %%%
Step 2: Train/Test Split
Splitting Data for Honest Evaluation
Always split data before fitting any model. This prevents data leakage and gives an honest estimate of model performance on unseen data.
%%%python from sklearn.model_selection import train_test_split
X = df[["depth"]].values # Feature matrix (must be 2D) y = df["porosity"].values # Target vector (1D)
80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )
print(f"Training samples: {len(X_train)}") print(f"Test samples: {len(X_test)}") print(f"Train depths range: {X_train.min():.0f} - {X_train.max():.0f} m") print(f"Test depths range: {X_test.min():.0f} - {X_test.max():.0f} m") %%%
Step 3: Simple Linear Regression
Fit, Predict, Score, and Plot
Linear regression fits the model by minimizing the sum of squared residuals.
%%%python from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score
Create and fit the model
model = LinearRegression() model.fit(X_train, y_train)
Model parameters
print(f"Slope (w): {model.coef_[0]:.4f} %/m") # porosity change per meter print(f"Intercept (b): {model.intercept_:.2f} %") # porosity at depth=0
Predictions on training and test sets
y_train_pred = model.predict(X_train) y_test_pred = model.predict(X_test)
Evaluation metrics
print(f"\nTrain R squared: {r2_score(y_train, y_train_pred):.4f}") print(f"Test R squared: {r2_score(y_test, y_test_pred):.4f}") print(f"Train MSE: {mean_squared_error(y_train, y_train_pred):.4f}") print(f"Test MSE: {mean_squared_error(y_test, y_test_pred):.4f}")
Plot the regression line
x_range = np.linspace(900, 2500, 200).reshape(-1, 1) y_range = model.predict(x_range)
plt.scatter(X_train, y_train, color="teal", label="Train", zorder=3) plt.scatter(X_test, y_test, color="orange", marker="s", label="Test", zorder=3) plt.plot(x_range, y_range, color="red", linewidth=2, label="Regression Line") plt.xlabel("Depth (m)") plt.ylabel("Porosity (%)") plt.title("Simple Linear Regression: Porosity vs Depth") plt.legend() plt.grid(True, alpha=0.3) plt.show() %%%
Step 4: Evaluating with Residual Plots
Residual Analysis
Residuals = . A good model has residuals randomly scattered around zero with no pattern.
%%%python
Compute residuals on test set
residuals = y_test - y_test_pred
plt.figure(figsize=(10, 4))
Residual plot
plt.subplot(1, 2, 1) plt.scatter(y_test_pred, residuals, color="steelblue", edgecolor="k") plt.axhline(y=0, color="red", linestyle="--") plt.xlabel("Predicted Porosity (%)") plt.ylabel("Residual (%)") plt.title("Residual Plot")
Histogram of residuals
plt.subplot(1, 2, 2) plt.hist(residuals, bins=6, color="steelblue", edgecolor="k") plt.xlabel("Residual (%)") plt.ylabel("Frequency") plt.title("Residual Distribution")
plt.tight_layout() plt.show()
print(f"Mean residual: {residuals.mean():.4f} (should be near 0)") print(f"Std residual: {residuals.std():.4f}") %%%
Step 5: Multiple Linear Regression
Using Multiple Well-Log Features
When we have multiple features (depth, gamma ray, density), the model becomes . The scikit-learn API is identical.
%%%python
Multiple features
X_multi = df[["depth", "gamma_ray", "density"]].values y_multi = df["porosity"].values
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split( X_multi, y_multi, test_size=0.2, random_state=42 )
model_multi = LinearRegression() model_multi.fit(X_train_m, y_train_m)
Show each coefficient with its feature name
feature_names = ["depth", "gamma_ray", "density"] for name, coef in zip(feature_names, model_multi.coef_): print(f" {name:12s}: {coef:+.4f}") print(f" intercept : {model_multi.intercept_:+.2f}")
print(f"\nTrain R squared: {model_multi.score(X_train_m, y_train_m):.4f}") print(f"Test R squared: {model_multi.score(X_test_m, y_test_m):.4f}")
Compare single vs multiple regression
print(f"\nSingle-feature R squared (test): {r2_score(y_test, y_test_pred):.4f}") print(f"Multi-feature R squared (test): {model_multi.score(X_test_m, y_test_m):.4f}") print("Multiple features often improve the model.") %%%
Step 6: Logistic Regression for Rock Classification
Binary Classification: Sandstone vs Shale
Logistic regression predicts probabilities using the sigmoid function:
%%%python from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report
Binary classification: sandstone (1) vs shale (0)
Features: gamma ray and density
X_cls = np.array([ [30, 2.20], [35, 2.22], [40, 2.25], [45, 2.28], # sandstone [38, 2.24], [42, 2.26], # sandstone [80, 2.50], [85, 2.52], [90, 2.55], [95, 2.58], # shale [88, 2.54], [92, 2.56], # shale [50, 2.30], [75, 2.45] # borderline ]) y_cls = np.array([1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0])
clf = LogisticRegression() clf.fit(X_cls, y_cls)
y_cls_pred = clf.predict(X_cls) print(f"Accuracy: {accuracy_score(y_cls, y_cls_pred):.2f}") print(classification_report(y_cls, y_cls_pred, target_names=["Shale", "Sandstone"]))
Predict probability for a new sample
new_sample = np.array([[60, 2.38]]) prob = clf.predict_proba(new_sample) print(f"\nNew sample [GR=60, RHOB=2.38]:") print(f" P(shale) = {prob[0][0]:.3f}") print(f" P(sandstone) = {prob[0][1]:.3f}") print(f" Prediction: {clf.predict(new_sample)[0]}") %%%
Step 7: Decision Boundary Visualization
Plotting the Logistic Regression Decision Boundary
The decision boundary is the line where , i.e., where .
%%%python
Create a mesh grid for the decision boundary
x_min, x_max = X_cls[:, 0].min() - 5, X_cls[:, 0].max() + 5 y_min, y_max = X_cls[:, 1].min() - 0.05, X_cls[:, 1].max() + 0.05 xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.figure(figsize=(8, 6)) plt.contourf(xx, yy, Z, alpha=0.3, cmap="RdYlBu") plt.scatter(X_cls[y_cls==1, 0], X_cls[y_cls==1, 1], c="blue", label="Sandstone", edgecolor="k", s=60) plt.scatter(X_cls[y_cls==0, 0], X_cls[y_cls==0, 1], c="red", label="Shale", edgecolor="k", s=60) plt.xlabel("Gamma Ray (API)") plt.ylabel("Density (g/cc)") plt.title("Logistic Regression Decision Boundary") plt.legend() plt.grid(True, alpha=0.2) plt.show() %%%
Step 8: Polynomial Regression
Capturing Nonlinear Trends with PolynomialFeatures
When the relationship is nonlinear (e.g., porosity decay with depth follows an exponential-like curve), polynomial regression can help. It creates new features and then fits a linear model to them.
%%%python from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline
Compare degree 1, 2, and 3 polynomial fits
plt.figure(figsize=(10, 5)) plt.scatter(X_train, y_train, color="teal", label="Train", zorder=3) plt.scatter(X_test, y_test, color="orange", marker="s", label="Test", zorder=3)
colors = ["red", "green", "purple"] for degree, color in zip([1, 2, 3], colors): poly_model = make_pipeline( PolynomialFeatures(degree=degree, include_bias=False), LinearRegression() ) poly_model.fit(X_train, y_train) y_poly = poly_model.predict(x_range) r2_test = poly_model.score(X_test, y_test) plt.plot(x_range, y_poly, color=color, linewidth=2, label=f"Degree {degree} (R2={r2_test:.3f})")
plt.xlabel("Depth (m)") plt.ylabel("Porosity (%)") plt.title("Polynomial Regression: Comparing Degrees") plt.legend() plt.grid(True, alpha=0.3) plt.show() %%%
Detailed Polynomial Regression Walkthrough
%%%python
Degree-2 polynomial regression step-by-step
poly = PolynomialFeatures(degree=2, include_bias=False) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.transform(X_test)
print("Original features shape:", X_train.shape) # (n, 1) print("Polynomial features shape:", X_train_poly.shape) # (n, 2) print("Feature names:", poly.get_feature_names_out())
Fit linear regression on polynomial features
model_poly = LinearRegression() model_poly.fit(X_train_poly, y_train)
print(f"\nCoefficients: {model_poly.coef_}") print(f"Intercept: {model_poly.intercept_:.2f}") print(f"Train R2: {model_poly.score(X_train_poly, y_train):.4f}") print(f"Test R2: {model_poly.score(X_test_poly, y_test):.4f}") %%%
Key Methods Summary
| Method | Purpose |
|---|---|
.fit(X, y) | Train the model on data |
.predict(X) | Make predictions |
.score(X, y) | R-squared for regression, accuracy for classification |
.coef_ | Learned weights |
.intercept_ | Learned bias |
.predict_proba(X) | Class probabilities (logistic only) |
PolynomialFeatures | Create features for nonlinear regression |
[Library: scikit-learn]
References
- Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
- Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 4 (training linear & logistic models). O’Reilly.
- James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 3 (linear regression labs). Springer.