Linear and Logistic Regression

Chapter 4: Linear Models — Regression and Classification

Learning objectives

Derive and apply simple linear regression with the cost function
Extend to multiple linear regression and the normal equation
Evaluate models using R-squared, MSE, and MAE
Understand logistic regression, the sigmoid function, and cross-entropy loss
Apply regression to geoscience prediction problems

Regression: Predicting Continuous and Categorical Outcomes

Regression is one of the most fundamental tools in machine learning. Linear regression predicts a continuous value (e.g., porosity). Logistic regression predicts a probability of belonging to a class (e.g., sandstone vs. shale).

1. Simple Linear Regression

The Model

We seek a straight-line relationship between one input feature $x$ and one output $y$ :

\hat{y} = wx + b

where $w$ is the weight (slope) and $b$ is the bias (intercept). Given $m$ training examples ${(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})}$ , we want to find $w$ and $b$ that minimise the prediction error.

Cost Function (Mean Squared Error)

The cost function measures how far our predictions are from the actual values:

J(w, b) = \frac{1}{2m}\sum_{i=1}^{m}\left(\hat{y}^{(i)} - y^{(i)}\right)^2 = \frac{1}{2m}\sum_{i=1}^{m}\left(wx^{(i)} + b - y^{(i)}\right)^2

The factor of $\frac{1}{2}$ is a convenience that simplifies the derivative. Our goal is to minimise $J(w, b)$ .

Gradient Descent

Gradient descent iteratively updates $w$ and $b$ in the direction of steepest descent:

w \leftarrow w - \alpha \frac{\partial J}{\partial w}, \qquad b \leftarrow b - \alpha \frac{\partial J}{\partial b}

where $\alpha$ is the learning rate. The partial derivatives are:

\frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})x^{(i)}, \qquad \frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})

2. Multiple Linear Regression

Extending to Multiple Features

When we have $n$ features $x_1, x_2, \ldots, x_n$ , the model becomes:

\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^T\mathbf{x} + b

In matrix notation, for all $m$ training examples:

\hat{\mathbf{y}} = X\mathbf{w} + b

where $X$ is the $m \times n$ feature matrix.

The Normal Equation

For linear regression, there is a closed-form solution that gives the optimal weights directly (no iterations needed):

\mathbf{w} = (X^T X)^{-1} X^T \mathbf{y}

This works when $X^T X$ is invertible. For large datasets or many features, gradient descent is more efficient computationally.

3. Evaluation Metrics for Regression

Mean Squared Error (MSE)

\text{MSE} = \frac{1}{m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2

Penalises large errors heavily due to squaring.

Mean Absolute Error (MAE)

\text{MAE} = \frac{1}{m}\sum_{i=1}^{m}|y^{(i)} - \hat{y}^{(i)}|

More robust to outliers than MSE.

R-Squared (Coefficient of Determination)

R^2 = 1 - \frac{\sum(y^{(i)} - \hat{y}^{(i)})^2}{\sum(y^{(i)} - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}

$R^2 = 1$ means a perfect fit; $R^2 = 0$ means the model is no better than predicting the mean. Values near 1 are good.

4. Logistic Regression

The Sigmoid Function

Logistic regression is used for binary classification (two classes). Instead of predicting a continuous value, we predict the probability that an observation belongs to class 1:

\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}, \qquad z = \mathbf{w}^T\mathbf{x} + b

The sigmoid function maps any real number to the range $(0, 1)$ . If $\hat{y} \geq 0.5$ , predict class 1; otherwise, predict class 0.

Cross-Entropy Loss

The cost function for logistic regression is the binary cross-entropy:

J(\mathbf{w}, b) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(\hat{y}^{(i)}) + (1 - y^{(i)})\log(1 - \hat{y}^{(i)})\right]

When $y = 1$ , the loss is $-\log(\hat{y})$ , which penalises predictions near 0 heavily. When $y = 0$ , the loss is $-\log(1-\hat{y})$ , which penalises predictions near 1.

Decision Boundary

The decision boundary is the surface where $\hat{y} = 0.5$ , i.e., where $z = \mathbf{w}^T\mathbf{x} + b = 0$ . For two features, this is a straight line in the feature space.

Geoscience Applications

Linear regression: Predicting porosity from depth, estimating reservoir pressure from well-test data, computing velocity from offset in seismic refraction surveys.

Logistic regression: Classifying rock types (sandstone vs. shale) from well logs, predicting whether a well will be economic (yes/no), identifying fault presence from seismic attributes.

5. Polynomial Regression

Beyond Straight Lines

When the relationship between $x$ and $y$ is non-linear, we can extend linear regression by adding polynomial features:

y = w_0 + w_1 x + w_2 x^2 + \cdots + w_d x^d

Despite the non-linear relationship with $x$ , this is still "linear" in the parameters $w_0, \ldots, w_d$ , so we can use the same least-squares machinery.

When to use: When scatter plots reveal curvature (e.g., porosity-permeability cross-plots often follow a power law). Start with $d = 2$ or $d = 3$ and increase cautiously.

Overfitting risk: High-degree polynomials fit the training data very well but oscillate wildly between data points. A polynomial of degree $d = m - 1$ (where $m$ is the number of training points) passes through every point but generalises terribly. Always evaluate on a held-out test set.

6. Regularized Regression

Ridge Regression (L2 Regularization)

Ridge regression adds a penalty proportional to the squared magnitude of the weights to the cost function:

J_{\text{Ridge}} = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} w_j^2

The hyperparameter $\lambda > 0$ controls the strength of regularization. Larger $\lambda$ shrinks weights toward zero, producing a simpler model. Ridge regression reduces overfitting and handles multicollinearity.

Lasso Regression (L1 Regularization)

Lasso uses the sum of absolute weights as the penalty:

J_{\text{Lasso}} = \frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |w_j|

A key advantage of Lasso over Ridge: Lasso can drive some weights to exactly zero, effectively performing feature selection. If you have 50 well-log features but only 5 are truly relevant, Lasso will zero out the irrelevant ones.

Elastic Net combines L1 and L2: $J + \lambda_1 \sum|w_j| + \lambda_2 \sum w_j^2$ . This gets the best of both worlds — sparsity from L1 and stability from L2.

7. Multi-Class Logistic Regression (Softmax)

Extending to More Than Two Classes

Binary logistic regression handles two classes. For $K > 2$ classes, we use the softmax function:

P(y = k | \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \qquad z_k = \mathbf{w}_k^T \mathbf{x} + b_k

Each class $k$ has its own weight vector $\mathbf{w}_k$ and bias $b_k$ . The softmax ensures all probabilities sum to 1. The predicted class is $\hat{y} = \arg\max_k P(y = k | \mathbf{x})$ .

The loss function generalizes to categorical cross-entropy:

J = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} y_k^{(i)} \log(\hat{y}_k^{(i)})

In geoscience, softmax regression classifies well-log data into multiple lithofacies (sandstone, shale, limestone, dolomite, etc.) simultaneously.

8. Classification Evaluation Metrics

Beyond Accuracy: Confusion Matrix and Derived Metrics

Accuracy alone is misleading when classes are imbalanced (e.g., 95% shale, 5% sandstone). A model predicting "shale always" gets 95% accuracy but is useless.

Confusion Matrix: A $K \times K$ table where entry $(i, j)$ counts samples with true class $i$ predicted as class $j$ .

Precision (of predicted positives, how many are correct): $\text{Precision} = \frac{TP}{TP + FP}$

Recall (of actual positives, how many are found): $\text{Recall} = \frac{TP}{TP + FN}$

F1-Score (harmonic mean balancing precision and recall): $F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

In geoscience, recall is often critical: missing a sandstone reservoir (false negative) is more costly than a false alarm. Choose the metric that aligns with the geological decision being made.

Geoscience Applications (Extended)

Reservoir property prediction: Multiple linear regression predicts porosity, permeability, or water saturation from suites of well-log curves (GR, RHOB, NPHI, RT). Polynomial terms capture non-linear responses.

Lithology classification: Softmax logistic regression classifies well-log data into 5-10 lithofacies. Feature engineering (e.g., GR/RHOB ratio) and regularization improve results.

Feature selection for reservoir models: Lasso regression identifies which seismic attributes are most predictive of reservoir thickness, automatically zeroing out uninformative attributes.

[Refs: Bishop, Pattern Recognition and ML; Hastie et al., Elements of Statistical Learning]

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 3 & 4 (linear regression, linear classifiers). Springer.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 3 & 4 (linear & logistic regression). Springer.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 3 & 4 (linear models for regression and classification). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 11 (linear regression) & ch. 10 (logistic regression). MIT Press.