Feature Engineering Practice

Chapter 9: Engineering Features from Geoscience Data

Learning objectives

Load and explore raw geoscience data with missing values
Handle missing data using SimpleImputer strategies (mean, median, most_frequent)
Apply one-hot encoding and label encoding using scikit-learn
Compare StandardScaler and MinMaxScaler on well-log data
Create domain-specific features: GR/RHOB ratio, NPHI-DPHI crossover
Use ColumnTransformer for mixed preprocessing pipelines
Perform feature selection with correlation matrix heatmaps
Build a complete sklearn Pipeline combining preprocessing and model

Step 1: Loading Raw Geoscience Data with Missing Values

Creating a Realistic Well-Log Dataset

Real geoscience data always has missing values — tool failures, incomplete logging runs, missing core samples. We start by loading and exploring such data.

%%%python import pandas as pd import numpy as np import matplotlib.pyplot as plt

Realistic well-log dataset with missing values

data = pd.DataFrame({ "depth": [1500, 1520, 1540, 1560, 1580, 1600, 1620, 1640, 1660, 1680, 1700, 1720, 1740, 1760, 1780, 1800], "GR": [45, 50, np.nan, 130, 42, 48, 22, 18, 55, np.nan, 115, 88, 35, 40, np.nan, 95], "RHOB": [2.35, 2.38, 2.55, np.nan, 2.32, 2.34, 2.71, 2.68, 2.40, 2.45, np.nan, 2.50, 2.30, 2.33, 2.70, 2.52], "NPHI": [0.18, 0.20, 0.32, 0.35, 0.15, 0.17, 0.02, 0.03, 0.22, 0.25, 0.30, np.nan, 0.14, 0.16, 0.01, 0.28], "RT": [15, 12, 4, 3, 18, 20, 200, 180, 10, 8, 5, 6, 22, np.nan, 250, 4], "formation": ["Delaware", "Delaware", "Wolfcamp", "Wolfcamp", "Delaware", "Delaware", "Ellenburger", "Ellenburger", "Delaware", "Wolfcamp", "Wolfcamp", "Wolfcamp", "Delaware", "Delaware", None, "Wolfcamp"], "grain_size": ["medium", "medium", "fine", "fine", "coarse", "coarse", "fine", "medium", "medium", "fine", "fine", "medium", "coarse", "medium", "fine", None], "lithology": ["Sandstone", "Sandstone", "Shale", "Shale", "Sandstone", "Sandstone", "Limestone", "Limestone", "Sandstone", "Shale", "Shale", "Shale", "Sandstone", "Sandstone", "Limestone", "Shale"] })

print(data.head(10)) %%%

Step 2: Exploring the Data

Using df.info() and df.isnull().sum()

Always start by checking the shape, data types, and missing value counts.

%%%python print("Shape:", data.shape) print("\nData types:") print(data.dtypes)

print("\nMissing values per column:") print(data.isnull().sum())

print("\nMissing percentage:") print((data.isnull().sum() / len(data) * 100).round(1))

print("\nDescriptive statistics (numerical columns):") print(data.describe())

print("\nCategorical value counts:") print("formation:", data["formation"].value_counts().to_dict()) print("grain_size:", data["grain_size"].value_counts().to_dict()) print("lithology:", data["lithology"].value_counts().to_dict()) %%%

Step 3: Handling Missing Data with SimpleImputer

Imputation Strategies: Mean, Median, Most Frequent

SimpleImputer replaces missing values with a computed statistic from the non-missing values in that column.

%%%python from sklearn.impute import SimpleImputer

--- Numerical imputation ---

Strategy: median (robust to outliers, common in geoscience)

num_cols = ["GR", "RHOB", "NPHI", "RT"] num_imputer = SimpleImputer(strategy="median") data[num_cols] = num_imputer.fit_transform(data[num_cols])

print("After numerical imputation:") print(data[num_cols].isnull().sum()) print("\nMedian values used for imputation:") for col in num_cols: print(f" {col}: {num_imputer.statistics_[num_cols.index(col)]:.3f}")

--- Categorical imputation ---

Strategy: most_frequent (mode)

cat_cols = ["formation", "grain_size"] cat_imputer = SimpleImputer(strategy="most_frequent") data[cat_cols] = cat_imputer.fit_transform(data[cat_cols])

print("\nAfter categorical imputation:") print(data[cat_cols].isnull().sum()) print("Most frequent values:", cat_imputer.statistics_) %%%

Comparing Imputation Strategies

Mean imputation is sensitive to outliers; median is robust. For skewed data like resistivity (RT), median is preferred.

%%%python

Demonstrate the difference on a skewed column

RT_raw = np.array([15, 12, 4, 3, 18, 20, 200, 180, 10, 8, 5, 6, 22, np.nan, 250, 4])

mean_val = np.nanmean(RT_raw) median_val = np.nanmedian(RT_raw)

print(f"RT mean: {mean_val:.1f} ohm-m (pulled high by outliers 200, 250)") print(f"RT median: {median_val:.1f} ohm-m (robust to outliers)") print("\nFor well-log data with spikes, median imputation is safer.") %%%

Step 4: One-Hot Encoding Categorical Features

Encoding Formation and Rock Type

Categorical features like formation name have no natural order, so we use one-hot encoding.

%%%python from sklearn.preprocessing import OneHotEncoder

Method 1: pandas get_dummies (quick and easy)

data_encoded = pd.get_dummies(data, columns=["formation"], drop_first=False) print("After one-hot encoding formation:") print(data_encoded.columns.tolist()) print(data_encoded.head())

Method 2: sklearn OneHotEncoder (for use in pipelines)

ohe = OneHotEncoder(sparse_output=False, drop=None) formation_encoded = ohe.fit_transform(data[["formation"]]) print("\nsklearn OneHotEncoder:") print(" Shape:", formation_encoded.shape) print(" Categories:", ohe.categories_[0].tolist()) print(" First 3 rows:\n", formation_encoded[:3]) %%%

Label Encoding for Ordinal Variables

Grain size has a natural order (fine < medium < coarse), so we use an ordinal mapping.

%%%python from sklearn.preprocessing import LabelEncoder

Manual mapping preserves the correct order

size_map = {"fine": 0, "medium": 1, "coarse": 2} data["grain_size_encoded"] = data["grain_size"].map(size_map) print("Manual ordinal encoding:") print(data[["grain_size", "grain_size_encoded"]].drop_duplicates())

WARNING: sklearn LabelEncoder assigns alphabetical order

le = LabelEncoder() data["grain_size_le"] = le.fit_transform(data["grain_size"]) print("\nsklearn LabelEncoder mapping:") print(dict(zip(le.classes_, le.transform(le.classes_)))) print(" coarse=0, fine=1, medium=2 — WRONG ORDER!") print(" Always use manual mapping for ordinal features.") %%%

Step 5: StandardScaler vs MinMaxScaler Comparison

Side-by-Side Scaling Comparison

StandardScaler centres data to mean=0, std=1. MinMaxScaler maps data to [0, 1]. RobustScaler uses median and IQR.

%%%python from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

num_data = data[["GR", "RHOB", "NPHI", "RT"]].values

scaler_std = StandardScaler() scaler_mm = MinMaxScaler() scaler_rob = RobustScaler()

data_std = scaler_std.fit_transform(num_data) data_mm = scaler_mm.fit_transform(num_data) data_rob = scaler_rob.fit_transform(num_data)

print("StandardScaler:") print(f" GR range: [{data_std[:, 0].min():.2f}, {data_std[:, 0].max():.2f}]") print(f" RT range: [{data_std[:, 3].min():.2f}, {data_std[:, 3].max():.2f}]")

print("\nMinMaxScaler:") print(f" GR range: [{data_mm[:, 0].min():.2f}, {data_mm[:, 0].max():.2f}]") print(f" RT range: [{data_mm[:, 3].min():.2f}, {data_mm[:, 3].max():.2f}]")

print("\nRobustScaler:") print(f" GR range: [{data_rob[:, 0].min():.2f}, {data_rob[:, 0].max():.2f}]") print(f" RT range: [{data_rob[:, 3].min():.2f}, {data_rob[:, 3].max():.2f}]") print("\nRobustScaler handles RT outliers (200, 250) best.") %%%

Visualizing the Effect of Scaling

%%%python fig, axes = plt.subplots(1, 3, figsize=(14, 4)) features = ["GR", "RHOB", "NPHI", "RT"]

for ax, scaled, title in zip(axes, [data_std, data_mm, data_rob], ["StandardScaler", "MinMaxScaler", "RobustScaler"]): ax.boxplot(scaled, tick_labels=features) ax.set_title(title) ax.axhline(y=0, color="red", linestyle="--", alpha=0.3)

plt.suptitle("Scaling Comparison on Well-Log Data") plt.tight_layout() plt.show() %%%

Step 6: Creating Domain-Specific Features

GR/RHOB Ratio and NPHI-DPHI Crossover

Domain knowledge in petrophysics tells us that certain feature combinations are more informative than raw measurements.

%%%python

GR/RHOB ratio: helps distinguish organic-rich shales

data["GR_RHOB_ratio"] = data["GR"] / data["RHOB"]

Density porosity from RHOB

DPHI = (rho_matrix - RHOB) / (rho_matrix - rho_fluid)

rho_matrix = 2.65 # quartz grain density (g/cc) rho_fluid = 1.0 # water density (g/cc) data["DPHI"] = (rho_matrix - data["RHOB"]) / (rho_matrix - rho_fluid)

NPHI - DPHI crossover: classic gas indicator

data["NPHI_DPHI_diff"] = data["NPHI"] - data["DPHI"] data["gas_flag"] = (data["NPHI_DPHI_diff"] > 0.05).astype(int)

GR-NPHI interaction: captures combined shale signal

data["GR_NPHI_product"] = data["GR"] * data["NPHI"]

print("New domain features:") print(data[["GR_RHOB_ratio", "DPHI", "NPHI_DPHI_diff", "gas_flag", "GR_NPHI_product"]].describe())

Visualize gas flag

print("\nGas flag distribution:") print(data["gas_flag"].value_counts().rename({0: "No gas", 1: "Gas"})) %%%

Step 7: ColumnTransformer for Mixed Pipelines

Combining Different Transformations

ColumnTransformer applies different preprocessing to different column types in a single step. This is essential for real-world data with mixed numerical and categorical features.

%%%python from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder

Define feature groups

numerical_features = ["depth", "GR", "RHOB", "NPHI", "RT"] categorical_features = ["formation"]

Create sub-pipelines for each feature type

numerical_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ])

categorical_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore")) ])

Combine into a single ColumnTransformer

preprocessor = ColumnTransformer(transformers=[ ("num", numerical_transformer, numerical_features), ("cat", categorical_transformer, categorical_features) ])

print("Preprocessor defined with:") print(f" Numerical ({len(numerical_features)} cols): Impute(median) -> StandardScaler") print(f" Categorical ({len(categorical_features)} cols): Impute(mode) -> OneHotEncoder") %%%

Step 8: Feature Selection with Correlation Matrix

Correlation Heatmap and Dropping Redundant Features

Features highly correlated with each other ( $|\rho| > 0.9$ ) are redundant. Keep one, drop the rest.

%%%python import seaborn as sns

Compute correlation matrix for numerical features

corr_cols = ["GR", "RHOB", "NPHI", "RT", "GR_RHOB_ratio", "DPHI", "GR_NPHI_product"] corr_matrix = data[corr_cols].corr()

plt.figure(figsize=(9, 7)) sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", center=0, fmt=".2f", square=True, linewidths=0.5) plt.title("Feature Correlation Matrix — Well-Log Data") plt.tight_layout() plt.show()

Find highly correlated pairs

print("\nHighly correlated feature pairs (|r| > 0.85):") for i in range(len(corr_matrix.columns)): for j in range(i+1, len(corr_matrix.columns)): r = corr_matrix.iloc[i, j] if abs(r) > 0.85: print(f" {corr_matrix.columns[i]} vs {corr_matrix.columns[j]}: r = {r:.3f}")

print("\nAction: drop one feature from each highly correlated pair.") %%%

Step 9: Complete Pipeline — Preprocessing + Model

End-to-End Pipeline with sklearn

A Pipeline chains preprocessing and model training into a single object. This ensures preprocessing is always applied consistently and prevents data leakage.

%%%python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import accuracy_score, classification_report

Prepare data (use original data before our manual feature engineering)

raw_data = pd.DataFrame({ "depth": [1500, 1520, 1540, 1560, 1580, 1600, 1620, 1640, 1660, 1680, 1700, 1720, 1740, 1760, 1780, 1800], "GR": [45, 50, np.nan, 130, 42, 48, 22, 18, 55, np.nan, 115, 88, 35, 40, np.nan, 95], "RHOB": [2.35, 2.38, 2.55, np.nan, 2.32, 2.34, 2.71, 2.68, 2.40, 2.45, np.nan, 2.50, 2.30, 2.33, 2.70, 2.52], "NPHI": [0.18, 0.20, 0.32, 0.35, 0.15, 0.17, 0.02, 0.03, 0.22, 0.25, 0.30, np.nan, 0.14, 0.16, 0.01, 0.28], "RT": [15, 12, 4, 3, 18, 20, 200, 180, 10, 8, 5, 6, 22, np.nan, 250, 4], "formation": ["Delaware", "Delaware", "Wolfcamp", "Wolfcamp", "Delaware", "Delaware", "Ellenburger", "Ellenburger", "Delaware", "Wolfcamp", "Wolfcamp", "Wolfcamp", "Delaware", "Delaware", None, "Wolfcamp"], "lithology": ["Sandstone", "Sandstone", "Shale", "Shale", "Sandstone", "Sandstone", "Limestone", "Limestone", "Sandstone", "Shale", "Shale", "Shale", "Sandstone", "Sandstone", "Limestone", "Shale"] })

X = raw_data[numerical_features + categorical_features] y = raw_data["lithology"]

Full pipeline: preprocessing + model

pipeline = Pipeline(steps=[ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=100, random_state=42)) ])

Cross-validation (more robust than single split for small data)

scores = cross_val_score(pipeline, X, y, cv=3, scoring="accuracy") print(f"3-Fold CV Accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")

Train on all data and show feature names after transformation

pipeline.fit(X, y) transformed = pipeline.named_steps["preprocessor"].transform(X) print(f"\nOriginal features: {X.shape[1]}") print(f"After preprocessing: {transformed.shape[1]}") print(" (numerical features scaled + formation one-hot encoded)") %%%

References

Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
McKinney, W. (2017). Python for Data Analysis (2nd ed.), ch. 7 (data cleaning and preparation). O’Reilly.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 2 (preparing the data for ML algorithms). O’Reilly.