Linear and Logistic Regression

Chapter 4: Linear Models — Regression and Classification

Learning objectives

  • Derive and apply simple linear regression with the cost function
  • Extend to multiple linear regression and the normal equation
  • Evaluate models using R-squared, MSE, and MAE
  • Understand logistic regression, the sigmoid function, and cross-entropy loss
  • Apply regression to geoscience prediction problems

Regression: Predicting Continuous and Categorical Outcomes

Regression is one of the most fundamental tools in machine learning. Linear regression predicts a continuous value (e.g., porosity). Logistic regression predicts a probability of belonging to a class (e.g., sandstone vs. shale).

1. Simple Linear Regression

The Model

We seek a straight-line relationship between one input feature xx and one output yy:

y^=wx+b\hat{y} = wx + b

where ww is the weight (slope) and bb is the bias (intercept). Given mm training examples {(x(1),y(1)),,(x(m),y(m))}{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})}, we want to find ww and bb that minimise the prediction error.

Cost Function (Mean Squared Error)

The cost function measures how far our predictions are from the actual values:

J(w,b)=12mi=1m(y^(i)y(i))2=12mi=1m(wx(i)+by(i))2J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right)^2 = \frac{1}{2m}\sum_{i=1}^{m}\left(wx^{(i)} + b - y^{(i)}\right)^2

The factor of 12\frac{1}{2} is a convenience that simplifies the derivative. Our goal is to minimise J(w,b)J(w, b).

Gradient Descent

Gradient descent iteratively updates ww and bb in the direction of steepest descent:

wwαJw,bbαJbw \leftarrow w - \alpha \frac{\partial J}{\partial w}, \qquad b \leftarrow b - \alpha \frac{\partial J}{\partial b}

where α\alpha is the learning rate. The partial derivatives are:

Jw=1mi=1m(y^(i)y(i))x(i),Jb=1mi=1m(y^(i)y(i))\frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}, \qquad \frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})

2. Multiple Linear Regression

Extending to Multiple Features

When we have nn features x1,x2,,xnx_1, x_2, \ldots, x_n, the model becomes:

y^=w1x1+w2x2++wnxn+b=wTx+b\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^T\mathbf{x} + b

In matrix notation, for all mm training examples:

y^=Xw+b\hat{\mathbf{y}} = X\mathbf{w} + b

where XX is the m×nm \times n feature matrix.

The Normal Equation

For linear regression, there is a closed-form solution that gives the optimal weights directly (no iterations needed):

w=(XTX)1XTy\mathbf{w} = (X^T X)^{-1} X^T \mathbf{y}

This works when XTXX^T X is invertible. For large datasets or many features, gradient descent is more efficient computationally.

3. Evaluation Metrics for Regression

Mean Squared Error (MSE)

MSE=1mi=1m(y(i)y^(i))2\text{MSE} = \frac{1}{m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2

Penalises large errors heavily due to squaring.

Mean Absolute Error (MAE)

MAE=1mi=1my(i)y^(i)\text{MAE} = \frac{1}{m}\sum_{i=1}^{m}|y^{(i)} - \hat{y}^{(i)}|

More robust to outliers than MSE.

R-Squared (Coefficient of Determination)

R2=1(y(i)y^(i))2(y(i)yˉ)2=1SSresSStotR^2 = 1 - \frac{\sum(y^{(i)} - \hat{y}^{(i)})^2}{\sum(y^{(i)} - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}

R2=1R^2 = 1 means a perfect fit; R2=0R^2 = 0 means the model is no better than predicting the mean. Values near 1 are good.

4. Logistic Regression

The Sigmoid Function

Logistic regression is used for binary classification (two classes). Instead of predicting a continuous value, we predict the probability that an observation belongs to class 1:

y^=σ(z)=11+ez,z=wTx+b\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}, \qquad z = \mathbf{w}^T\mathbf{x} + b

The sigmoid function maps any real number to the range (0,1)(0, 1). If y^0.5\hat{y} \geq 0.5, predict class 1; otherwise, predict class 0.

Cross-Entropy Loss

The cost function for logistic regression is the binary cross-entropy:

J(w,b)=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(\mathbf{w}, b) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(\hat{y}^{(i)}) + (1 - y^{(i)})\log(1 - \hat{y}^{(i)})\right]

When y=1y = 1, the loss is log(y^)-\log(\hat{y}), which penalises predictions near 0 heavily. When y=0y = 0, the loss is log(1y^)-\log(1-\hat{y}), which penalises predictions near 1.

Decision Boundary

The decision boundary is the surface where y^=0.5\hat{y} = 0.5, i.e., where z=wTx+b=0z = \mathbf{w}^T\mathbf{x} + b = 0. For two features, this is a straight line in the feature space.

Geoscience Applications

Linear regression: Predicting porosity from depth, estimating reservoir pressure from well-test data, computing velocity from offset in seismic refraction surveys.

Logistic regression: Classifying rock types (sandstone vs. shale) from well logs, predicting whether a well will be economic (yes/no), identifying fault presence from seismic attributes.

5. Polynomial Regression

Beyond Straight Lines

When the relationship between xx and yy is non-linear, we can extend linear regression by adding polynomial features:

y=w0+w1x+w2x2++wdxdy = w_0 + w_1 x + w_2 x^2 + \cdots + w_d x^d

Despite the non-linear relationship with xx, this is still "linear" in the parameters w0,,wdw_0, \ldots, w_d, so we can use the same least-squares machinery.

When to use: When scatter plots reveal curvature (e.g., porosity-permeability cross-plots often follow a power law). Start with d=2d = 2 or d=3d = 3 and increase cautiously.

Overfitting risk: High-degree polynomials fit the training data very well but oscillate wildly between data points. A polynomial of degree d=m1d = m - 1 (where mm is the number of training points) passes through every point but generalises terribly. Always evaluate on a held-out test set.

6. Regularized Regression

Ridge Regression (L2 Regularization)

Ridge regression adds a penalty proportional to the squared magnitude of the weights to the cost function:

JRidge=12mi=1m(y^(i)y(i))2+λj=1nwj2J_{\text{Ridge}} = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} w_j^2

The hyperparameter λ>0\lambda > 0 controls the strength of regularization. Larger λ\lambda shrinks weights toward zero, producing a simpler model. Ridge regression reduces overfitting and handles multicollinearity.

Lasso Regression (L1 Regularization)

Lasso uses the sum of absolute weights as the penalty:

JLasso=12mi=1m(y^(i)y(i))2+λj=1nwjJ_{\text{Lasso}} = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |w_j|

A key advantage of Lasso over Ridge: Lasso can drive some weights to exactly zero, effectively performing feature selection. If you have 50 well-log features but only 5 are truly relevant, Lasso will zero out the irrelevant ones.

Elastic Net combines L1 and L2: J+λ1wj+λ2wj2J + \lambda_1 \sum|w_j| + \lambda_2 \sum w_j^2. This gets the best of both worlds — sparsity from L1 and stability from L2.

7. Multi-Class Logistic Regression (Softmax)

Extending to More Than Two Classes

Binary logistic regression handles two classes. For K>2K > 2 classes, we use the softmax function:

P(y=kx)=ezkj=1Kezj,zk=wkTx+bkP(y = k | \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad z_k = \mathbf{w}_k^T \mathbf{x} + b_k

Each class kk has its own weight vector wk\mathbf{w}_k and bias bkb_k. The softmax ensures all probabilities sum to 1. The predicted class is y^=argmaxkP(y=kx)\hat{y} = \arg\max_k P(y = k | \mathbf{x}).

The loss function generalizes to categorical cross-entropy:

J=1mi=1mk=1Kyk(i)log(y^k(i))J = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} y_k^{(i)} \log(\hat{y}_k^{(i)})

In geoscience, softmax regression classifies well-log data into multiple lithofacies (sandstone, shale, limestone, dolomite, etc.) simultaneously.

8. Classification Evaluation Metrics

Beyond Accuracy: Confusion Matrix and Derived Metrics

Accuracy alone is misleading when classes are imbalanced (e.g., 95% shale, 5% sandstone). A model predicting "shale always" gets 95% accuracy but is useless.

Confusion Matrix: A K×KK \times K table where entry (i,j)(i, j) counts samples with true class ii predicted as class jj.

Precision (of predicted positives, how many are correct): Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Recall (of actual positives, how many are found): Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

F1-Score (harmonic mean balancing precision and recall): F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

In geoscience, recall is often critical: missing a sandstone reservoir (false negative) is more costly than a false alarm. Choose the metric that aligns with the geological decision being made.

Geoscience Applications (Extended)

Reservoir property prediction: Multiple linear regression predicts porosity, permeability, or water saturation from suites of well-log curves (GR, RHOB, NPHI, RT). Polynomial terms capture non-linear responses.

Lithology classification: Softmax logistic regression classifies well-log data into 5-10 lithofacies. Feature engineering (e.g., GR/RHOB ratio) and regularization improve results.

Feature selection for reservoir models: Lasso regression identifies which seismic attributes are most predictive of reservoir thickness, automatically zeroing out uninformative attributes.

[Refs: Bishop, Pattern Recognition and ML; Hastie et al., Elements of Statistical Learning]

References

  • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 3 & 4 (linear regression, linear classifiers). Springer.
  • James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 3 & 4 (linear & logistic regression). Springer.
  • Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 3 & 4 (linear models for regression and classification). Springer.
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 11 (linear regression) & ch. 10 (logistic regression). MIT Press.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.