Feature Engineering

Chapter 9: Engineering Features from Geoscience Data

Learning objectives

Define feature engineering and explain why it often matters more than algorithm choice
Distinguish numerical, categorical, and ordinal feature types
Apply one-hot encoding and label encoding to categorical variables
Understand and apply StandardScaler, MinMaxScaler, and RobustScaler
Create polynomial features, interaction terms, and domain-specific features
Use correlation analysis, mutual information, and RFE for feature selection
Handle missing data with imputation strategies

What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to a machine learning model, thereby improving its performance. It is often said that "applied machine learning is basically feature engineering" — choosing and crafting the right features frequently has a larger impact on model performance than choosing a fancier algorithm.

Feature engineering includes:

Cleaning and preprocessing existing features
Encoding categorical variables into numerical format
Scaling/normalizing numerical features
Creating new features from existing ones (domain knowledge)
Selecting the most informative features and discarding noise

Feature Types

Type	Description	Examples (Geoscience)
Numerical (continuous)	Real-valued measurements with meaningful arithmetic	Porosity (0.05–0.35), depth (m), temperature (°C), GR (API units)
Numerical (discrete)	Integer counts	Number of fractures, fault count per grid cell
Categorical (nominal)	Categories with no natural order	Rock type (granite, basalt, sandstone), well name, formation name
Ordinal	Categories with a meaningful order	Grain size (fine, medium, coarse), weathering grade (I–V), reservoir quality (poor, fair, good, excellent)

Encoding Categorical Variables

Machine learning algorithms work with numbers, not strings. Categorical variables must be converted to numerical representations.

One-Hot Encoding

Creates a binary (0/1) column for each category. For a feature with $k$ categories, this creates $k$ new binary features (or $k-1$ to avoid multicollinearity).

Example: Rock type = {Sandstone, Shale, Limestone}

Original	is_Sandstone	is_Shale	is_Limestone
Sandstone	1	0	0
Shale	0	1	0
Limestone	0	0	1

Use when: Categories have no natural order. Do NOT use for high-cardinality features (e.g., 500 unique well names) — the feature space explodes.

Label Encoding

Assigns a unique integer to each category: Sandstone = 0, Shale = 1, Limestone = 2.

Use when: There is a natural order (ordinal data), OR the algorithm is tree-based (trees handle arbitrary integer labels correctly because they split on thresholds). Do NOT use for linear models — the arbitrary integer order implies a false ordering (e.g., Limestone > Shale > Sandstone).

Feature Scaling

Many ML algorithms (KNN, SVM, neural networks, PCA) are sensitive to the scale of features. If one feature ranges from 0–1 and another from 0–10000, the latter will dominate distance calculations.

StandardScaler (Z-score normalization)

x' = \frac{x - \mu}{\sigma}

Transforms each feature to have mean 0 and standard deviation 1. Best for features that are approximately normally distributed.

MinMaxScaler

x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

Scales features to the range [0, 1]. Preserves zero entries in sparse data. Sensitive to outliers (a single extreme value stretches the entire range).

RobustScaler

x' = \frac{x - Q_{50}}{Q_{75} - Q_{25}}

Uses the median and interquartile range instead of mean and standard deviation. Robust to outliers. Excellent for geoscience data where outliers are common (e.g., spike in resistivity log).

Important: Always fit the scaler on training data only, then transform both training and test data with those same parameters. Never fit on test data — that causes data leakage.

Feature Creation

Creating new features from existing ones can dramatically improve model performance, especially when domain knowledge guides the process.

Polynomial Features

For features $x_1, x_2$ , generating degree-2 polynomial features creates: $x_1^2, x_2^2, x_1 x_2$ . This allows linear models to capture nonlinear relationships.

\text{PolynomialFeatures}(\text{degree}=2): [x_1, x_2] \to [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]

Interaction Terms

Products of two features that capture how they work together. Example: in well logs, the product $\text{GR} \times \text{NPHI}$ might be more predictive of shale content than either feature alone.

Domain-Specific Features (Geoscience)

GR/RHOB ratio: Gamma ray divided by bulk density — helps distinguish organic-rich shales.
NPHI - DPHI: Neutron-density porosity difference — classic gas indicator.
Acoustic impedance: $AI = V_p \times \rho$ — fundamental seismic attribute.
Depth-normalized logs: Dividing measurements by depth to remove depth trends.
Moving averages/gradients: Smoothed log values or vertical gradients over depth windows.

Feature Selection

Not all features are useful. Irrelevant or redundant features add noise and increase computation. Feature selection identifies the most informative subset.

Correlation Analysis

Compute the Pearson correlation matrix $\rho_{ij} = \frac{\text{Cov}(X_i, X_j)}{\sigma_{X_i}\sigma_{X_j}}$ . Features highly correlated with each other ( $|\rho| > 0.9$ ) are redundant — keep one, drop the rest.

Mutual Information

Measures how much knowing one feature reduces uncertainty about the target. Unlike correlation, it captures nonlinear relationships.

I(X; Y) = \sum_{x,y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)}

Recursive Feature Elimination (RFE)

Trains a model, removes the least important feature, retrains, and repeats. The process ranks features by the order in which they are eliminated.

Handling Missing Data

Real-world geoscience data is full of missing values: wells may lack certain logs, core samples may be incomplete, sensors fail.

Strategy	Method	When to Use
Drop rows	Remove samples with missing values	Few missing values, large dataset
Drop columns	Remove features with many missing values	>50% of values missing in a column
Mean/median imputation	Replace missing values with the column mean or median	Numerical features, data missing at random
Mode imputation	Replace with the most frequent value	Categorical features
Interpolation	Linear or spline interpolation between known values	Well logs with occasional gaps (spatially ordered data)
KNN imputation	Use similar samples to estimate missing values	Complex patterns in missing data

Geoscience example: A well has GR, RHOB, and RT logs, but NPHI is missing for a 50m interval due to a tool failure. Linear interpolation from the surrounding NPHI values is appropriate because log values change smoothly with depth. However, if an entire well lacks a log, you might use a model trained on other wells to predict the missing log from the available logs.

References

Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 2 (end-to-end ML project, feature pipelines). O’Reilly.
Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 5 (basis expansions) & ch. 14 (feature engineering). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 4 (data preprocessing & feature engineering). MIT Press.
Karpatne, A., Atluri, G., Faghmous, J.H., et al. (2017). Theory-guided data science. IEEE Trans. Knowl. Data Eng. 29(10), 2318–2331.