Feature Engineering

Chapter 9: Engineering Features from Geoscience Data

Learning objectives

  • Define feature engineering and explain why it often matters more than algorithm choice
  • Distinguish numerical, categorical, and ordinal feature types
  • Apply one-hot encoding and label encoding to categorical variables
  • Understand and apply StandardScaler, MinMaxScaler, and RobustScaler
  • Create polynomial features, interaction terms, and domain-specific features
  • Use correlation analysis, mutual information, and RFE for feature selection
  • Handle missing data with imputation strategies

What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that better represent the underlying problem to a machine learning model, thereby improving its performance. It is often said that "applied machine learning is basically feature engineering" — choosing and crafting the right features frequently has a larger impact on model performance than choosing a fancier algorithm.

Feature engineering includes:

  • Cleaning and preprocessing existing features
  • Encoding categorical variables into numerical format
  • Scaling/normalizing numerical features
  • Creating new features from existing ones (domain knowledge)
  • Selecting the most informative features and discarding noise

Feature Types

TypeDescriptionExamples (Geoscience)
**Numerical (continuous)**Real-valued measurements with meaningful arithmeticPorosity (0.05–0.35), depth (m), temperature (°C), GR (API units)
**Numerical (discrete)**Integer countsNumber of fractures, fault count per grid cell
**Categorical (nominal)**Categories with no natural orderRock type (granite, basalt, sandstone), well name, formation name
**Ordinal**Categories with a meaningful orderGrain size (fine, medium, coarse), weathering grade (I–V), reservoir quality (poor, fair, good, excellent)

Encoding Categorical Variables

Machine learning algorithms work with numbers, not strings. Categorical variables must be converted to numerical representations.

One-Hot Encoding

Creates a binary (0/1) column for each category. For a feature with kk categories, this creates kk new binary features (or k1k-1 to avoid multicollinearity).

Example: Rock type = {Sandstone, Shale, Limestone}

Originalis_Sandstoneis_Shaleis_Limestone
Sandstone100
Shale010
Limestone001

Use when: Categories have no natural order. Do NOT use for high-cardinality features (e.g., 500 unique well names) — the feature space explodes.

Label Encoding

Assigns a unique integer to each category: Sandstone = 0, Shale = 1, Limestone = 2.

Use when: There is a natural order (ordinal data), OR the algorithm is tree-based (trees handle arbitrary integer labels correctly because they split on thresholds). Do NOT use for linear models — the arbitrary integer order implies a false ordering (e.g., Limestone > Shale > Sandstone).

Feature Scaling

Many ML algorithms (KNN, SVM, neural networks, PCA) are sensitive to the scale of features. If one feature ranges from 0–1 and another from 0–10000, the latter will dominate distance calculations.

StandardScaler (Z-score normalization)

x=xμσx' = \frac{x - \mu}{\sigma}

Transforms each feature to have mean 0 and standard deviation 1. Best for features that are approximately normally distributed.

MinMaxScaler

x=xxminxmaxxminx' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

Scales features to the range [0, 1]. Preserves zero entries in sparse data. Sensitive to outliers (a single extreme value stretches the entire range).

RobustScaler

x=xQ50Q75Q25x' = \frac{x - Q_{50}}{Q_{75} - Q_{25}}

Uses the median and interquartile range instead of mean and standard deviation. Robust to outliers. Excellent for geoscience data where outliers are common (e.g., spike in resistivity log).

Important: Always fit the scaler on training data only, then transform both training and test data with those same parameters. Never fit on test data — that causes data leakage.

Feature Creation

Creating new features from existing ones can dramatically improve model performance, especially when domain knowledge guides the process.

Polynomial Features

For features x1,x2x_1, x_2, generating degree-2 polynomial features creates: x12,x22,x1x2x_1^2, x_2^2, x_1 x_2. This allows linear models to capture nonlinear relationships.

PolynomialFeatures(degree=2):[x1,x2][1,x1,x2,x12,x1x2,x22]\text{PolynomialFeatures}(\text{degree}=2): [x_1, x_2] \to [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]

Interaction Terms

Products of two features that capture how they work together. Example: in well logs, the product GR×NPHI\text{GR} \times \text{NPHI} might be more predictive of shale content than either feature alone.

Domain-Specific Features (Geoscience)

  • GR/RHOB ratio: Gamma ray divided by bulk density — helps distinguish organic-rich shales.
  • NPHI - DPHI: Neutron-density porosity difference — classic gas indicator.
  • Acoustic impedance: AI=Vp×ρAI = V_p \times \rho — fundamental seismic attribute.
  • Depth-normalized logs: Dividing measurements by depth to remove depth trends.
  • Moving averages/gradients: Smoothed log values or vertical gradients over depth windows.

Feature Selection

Not all features are useful. Irrelevant or redundant features add noise and increase computation. Feature selection identifies the most informative subset.

Correlation Analysis

Compute the Pearson correlation matrix ρij=Cov(Xi,Xj)σXiσXj\rho_{ij} = \frac{\text{Cov}(X_i, X_j)}{\sigma_{X_i}\sigma_{X_j}}. Features highly correlated with each other (ρ>0.9|\rho| > 0.9) are redundant — keep one, drop the rest.

Mutual Information

Measures how much knowing one feature reduces uncertainty about the target. Unlike correlation, it captures nonlinear relationships.

I(X;Y)=x,yP(x,y)logP(x,y)P(x)P(y)I(X; Y) = \sum_{x,y} P(x,y) \log \frac{P(x,y)}{P(x)P(y)}

Recursive Feature Elimination (RFE)

Trains a model, removes the least important feature, retrains, and repeats. The process ranks features by the order in which they are eliminated.

Handling Missing Data

Real-world geoscience data is full of missing values: wells may lack certain logs, core samples may be incomplete, sensors fail.

StrategyMethodWhen to Use
**Drop rows**Remove samples with missing valuesFew missing values, large dataset
**Drop columns**Remove features with many missing values>50% of values missing in a column
**Mean/median imputation**Replace missing values with the column mean or medianNumerical features, data missing at random
**Mode imputation**Replace with the most frequent valueCategorical features
**Interpolation**Linear or spline interpolation between known valuesWell logs with occasional gaps (spatially ordered data)
**KNN imputation**Use similar samples to estimate missing valuesComplex patterns in missing data

Geoscience example: A well has GR, RHOB, and RT logs, but NPHI is missing for a 50m interval due to a tool failure. Linear interpolation from the surrounding NPHI values is appropriate because log values change smoothly with depth. However, if an entire well lacks a log, you might use a model trained on other wells to predict the missing log from the available logs.

References

  • Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 2 (end-to-end ML project, feature pipelines). O’Reilly.
  • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 5 (basis expansions) & ch. 14 (feature engineering). Springer.
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 4 (data preprocessing & feature engineering). MIT Press.
  • Karpatne, A., Atluri, G., Faghmous, J.H., et al. (2017). Theory-guided data science. IEEE Trans. Knowl. Data Eng. 29(10), 2318–2331.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.