Linear and Logistic Regression
Learning objectives
- Derive and apply simple linear regression with the cost function
- Extend to multiple linear regression and the normal equation
- Evaluate models using R-squared, MSE, and MAE
- Understand logistic regression, the sigmoid function, and cross-entropy loss
- Apply regression to geoscience prediction problems
Regression: Predicting Continuous and Categorical Outcomes
Regression is one of the most fundamental tools in machine learning. Linear regression predicts a continuous value (e.g., porosity). Logistic regression predicts a probability of belonging to a class (e.g., sandstone vs. shale).
1. Simple Linear Regression
The Model
We seek a straight-line relationship between one input feature and one output :
where is the weight (slope) and is the bias (intercept). Given training examples , we want to find and that minimise the prediction error.
Cost Function (Mean Squared Error)
The cost function measures how far our predictions are from the actual values:
The factor of is a convenience that simplifies the derivative. Our goal is to minimise .
Gradient Descent
Gradient descent iteratively updates and in the direction of steepest descent:
where is the learning rate. The partial derivatives are:
2. Multiple Linear Regression
Extending to Multiple Features
When we have features , the model becomes:
In matrix notation, for all training examples:
where is the feature matrix.
The Normal Equation
For linear regression, there is a closed-form solution that gives the optimal weights directly (no iterations needed):
This works when is invertible. For large datasets or many features, gradient descent is more efficient computationally.
3. Evaluation Metrics for Regression
Mean Squared Error (MSE)
Penalises large errors heavily due to squaring.
Mean Absolute Error (MAE)
More robust to outliers than MSE.
R-Squared (Coefficient of Determination)
means a perfect fit; means the model is no better than predicting the mean. Values near 1 are good.
4. Logistic Regression
The Sigmoid Function
Logistic regression is used for binary classification (two classes). Instead of predicting a continuous value, we predict the probability that an observation belongs to class 1:
The sigmoid function maps any real number to the range . If , predict class 1; otherwise, predict class 0.
Cross-Entropy Loss
The cost function for logistic regression is the binary cross-entropy:
When , the loss is , which penalises predictions near 0 heavily. When , the loss is , which penalises predictions near 1.
Decision Boundary
The decision boundary is the surface where , i.e., where . For two features, this is a straight line in the feature space.
Geoscience Applications
Linear regression: Predicting porosity from depth, estimating reservoir pressure from well-test data, computing velocity from offset in seismic refraction surveys.
Logistic regression: Classifying rock types (sandstone vs. shale) from well logs, predicting whether a well will be economic (yes/no), identifying fault presence from seismic attributes.
5. Polynomial Regression
Beyond Straight Lines
When the relationship between and is non-linear, we can extend linear regression by adding polynomial features:
Despite the non-linear relationship with , this is still "linear" in the parameters , so we can use the same least-squares machinery.
When to use: When scatter plots reveal curvature (e.g., porosity-permeability cross-plots often follow a power law). Start with or and increase cautiously.
Overfitting risk: High-degree polynomials fit the training data very well but oscillate wildly between data points. A polynomial of degree (where is the number of training points) passes through every point but generalises terribly. Always evaluate on a held-out test set.
6. Regularized Regression
Ridge Regression (L2 Regularization)
Ridge regression adds a penalty proportional to the squared magnitude of the weights to the cost function:
The hyperparameter controls the strength of regularization. Larger shrinks weights toward zero, producing a simpler model. Ridge regression reduces overfitting and handles multicollinearity.
Lasso Regression (L1 Regularization)
Lasso uses the sum of absolute weights as the penalty:
A key advantage of Lasso over Ridge: Lasso can drive some weights to exactly zero, effectively performing feature selection. If you have 50 well-log features but only 5 are truly relevant, Lasso will zero out the irrelevant ones.
Elastic Net combines L1 and L2: . This gets the best of both worlds — sparsity from L1 and stability from L2.
7. Multi-Class Logistic Regression (Softmax)
Extending to More Than Two Classes
Binary logistic regression handles two classes. For classes, we use the softmax function:
Each class has its own weight vector and bias . The softmax ensures all probabilities sum to 1. The predicted class is .
The loss function generalizes to categorical cross-entropy:
In geoscience, softmax regression classifies well-log data into multiple lithofacies (sandstone, shale, limestone, dolomite, etc.) simultaneously.
8. Classification Evaluation Metrics
Beyond Accuracy: Confusion Matrix and Derived Metrics
Accuracy alone is misleading when classes are imbalanced (e.g., 95% shale, 5% sandstone). A model predicting "shale always" gets 95% accuracy but is useless.
Confusion Matrix: A table where entry counts samples with true class predicted as class .
Precision (of predicted positives, how many are correct):
Recall (of actual positives, how many are found):
F1-Score (harmonic mean balancing precision and recall):
In geoscience, recall is often critical: missing a sandstone reservoir (false negative) is more costly than a false alarm. Choose the metric that aligns with the geological decision being made.
Geoscience Applications (Extended)
Reservoir property prediction: Multiple linear regression predicts porosity, permeability, or water saturation from suites of well-log curves (GR, RHOB, NPHI, RT). Polynomial terms capture non-linear responses.
Lithology classification: Softmax logistic regression classifies well-log data into 5-10 lithofacies. Feature engineering (e.g., GR/RHOB ratio) and regularization improve results.
Feature selection for reservoir models: Lasso regression identifies which seismic attributes are most predictive of reservoir thickness, automatically zeroing out uninformative attributes.
[Refs: Bishop, Pattern Recognition and ML; Hastie et al., Elements of Statistical Learning]
References
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 3 & 4 (linear regression, linear classifiers). Springer.
- James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 3 & 4 (linear & logistic regression). Springer.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 3 & 4 (linear models for regression and classification). Springer.
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 11 (linear regression) & ch. 10 (logistic regression). MIT Press.