Feature Engineering
Learning objectives
- Define feature engineering and explain why it often matters more than algorithm choice
- Distinguish numerical, categorical, and ordinal feature types
- Apply one-hot encoding and label encoding to categorical variables
- Understand and apply StandardScaler, MinMaxScaler, and RobustScaler
- Create polynomial features, interaction terms, and domain-specific features
- Use correlation analysis, mutual information, and RFE for feature selection
- Handle missing data with imputation strategies
What is Feature Engineering?
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to a machine learning model, thereby improving its performance. It is often said that "applied machine learning is basically feature engineering" — choosing and crafting the right features frequently has a larger impact on model performance than choosing a fancier algorithm.
Feature engineering includes:
- Cleaning and preprocessing existing features
- Encoding categorical variables into numerical format
- Scaling/normalizing numerical features
- Creating new features from existing ones (domain knowledge)
- Selecting the most informative features and discarding noise
Feature Types
| Type | Description | Examples (Geoscience) |
|---|---|---|
| **Numerical (continuous)** | Real-valued measurements with meaningful arithmetic | Porosity (0.05–0.35), depth (m), temperature (°C), GR (API units) |
| **Numerical (discrete)** | Integer counts | Number of fractures, fault count per grid cell |
| **Categorical (nominal)** | Categories with no natural order | Rock type (granite, basalt, sandstone), well name, formation name |
| **Ordinal** | Categories with a meaningful order | Grain size (fine, medium, coarse), weathering grade (I–V), reservoir quality (poor, fair, good, excellent) |
Encoding Categorical Variables
Machine learning algorithms work with numbers, not strings. Categorical variables must be converted to numerical representations.
One-Hot Encoding
Creates a binary (0/1) column for each category. For a feature with categories, this creates new binary features (or to avoid multicollinearity).
Example: Rock type = {Sandstone, Shale, Limestone}
| Original | is_Sandstone | is_Shale | is_Limestone |
|---|---|---|---|
| Sandstone | 1 | 0 | 0 |
| Shale | 0 | 1 | 0 |
| Limestone | 0 | 0 | 1 |
Use when: Categories have no natural order. Do NOT use for high-cardinality features (e.g., 500 unique well names) — the feature space explodes.
Label Encoding
Assigns a unique integer to each category: Sandstone = 0, Shale = 1, Limestone = 2.
Use when: There is a natural order (ordinal data), OR the algorithm is tree-based (trees handle arbitrary integer labels correctly because they split on thresholds). Do NOT use for linear models — the arbitrary integer order implies a false ordering (e.g., Limestone > Shale > Sandstone).
Feature Scaling
Many ML algorithms (KNN, SVM, neural networks, PCA) are sensitive to the scale of features. If one feature ranges from 0–1 and another from 0–10000, the latter will dominate distance calculations.
StandardScaler (Z-score normalization)
Transforms each feature to have mean 0 and standard deviation 1. Best for features that are approximately normally distributed.
MinMaxScaler
Scales features to the range [0, 1]. Preserves zero entries in sparse data. Sensitive to outliers (a single extreme value stretches the entire range).
RobustScaler
Uses the median and interquartile range instead of mean and standard deviation. Robust to outliers. Excellent for geoscience data where outliers are common (e.g., spike in resistivity log).
Important: Always fit the scaler on training data only, then transform both training and test data with those same parameters. Never fit on test data — that causes data leakage.
Feature Creation
Creating new features from existing ones can dramatically improve model performance, especially when domain knowledge guides the process.
Polynomial Features
For features , generating degree-2 polynomial features creates: . This allows linear models to capture nonlinear relationships.
Interaction Terms
Products of two features that capture how they work together. Example: in well logs, the product might be more predictive of shale content than either feature alone.
Domain-Specific Features (Geoscience)
- GR/RHOB ratio: Gamma ray divided by bulk density — helps distinguish organic-rich shales.
- NPHI - DPHI: Neutron-density porosity difference — classic gas indicator.
- Acoustic impedance: — fundamental seismic attribute.
- Depth-normalized logs: Dividing measurements by depth to remove depth trends.
- Moving averages/gradients: Smoothed log values or vertical gradients over depth windows.
Feature Selection
Not all features are useful. Irrelevant or redundant features add noise and increase computation. Feature selection identifies the most informative subset.
Correlation Analysis
Compute the Pearson correlation matrix . Features highly correlated with each other () are redundant — keep one, drop the rest.
Mutual Information
Measures how much knowing one feature reduces uncertainty about the target. Unlike correlation, it captures nonlinear relationships.
Recursive Feature Elimination (RFE)
Trains a model, removes the least important feature, retrains, and repeats. The process ranks features by the order in which they are eliminated.
Handling Missing Data
Real-world geoscience data is full of missing values: wells may lack certain logs, core samples may be incomplete, sensors fail.
| Strategy | Method | When to Use |
|---|---|---|
| **Drop rows** | Remove samples with missing values | Few missing values, large dataset |
| **Drop columns** | Remove features with many missing values | >50% of values missing in a column |
| **Mean/median imputation** | Replace missing values with the column mean or median | Numerical features, data missing at random |
| **Mode imputation** | Replace with the most frequent value | Categorical features |
| **Interpolation** | Linear or spline interpolation between known values | Well logs with occasional gaps (spatially ordered data) |
| **KNN imputation** | Use similar samples to estimate missing values | Complex patterns in missing data |
Geoscience example: A well has GR, RHOB, and RT logs, but NPHI is missing for a 50m interval due to a tool failure. Linear interpolation from the surrounding NPHI values is appropriate because log values change smoothly with depth. However, if an entire well lacks a log, you might use a model trained on other wells to predict the missing log from the available logs.
References
- Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 2 (end-to-end ML project, feature pipelines). O’Reilly.
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 5 (basis expansions) & ch. 14 (feature engineering). Springer.
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 4 (data preprocessing & feature engineering). MIT Press.
- Karpatne, A., Atluri, G., Faghmous, J.H., et al. (2017). Theory-guided data science. IEEE Trans. Knowl. Data Eng. 29(10), 2318–2331.