Feature Engineering Practice
Learning objectives
- Load and explore raw geoscience data with missing values
- Handle missing data using SimpleImputer strategies (mean, median, most_frequent)
- Apply one-hot encoding and label encoding using scikit-learn
- Compare StandardScaler and MinMaxScaler on well-log data
- Create domain-specific features: GR/RHOB ratio, NPHI-DPHI crossover
- Use ColumnTransformer for mixed preprocessing pipelines
- Perform feature selection with correlation matrix heatmaps
- Build a complete sklearn Pipeline combining preprocessing and model
Step 1: Loading Raw Geoscience Data with Missing Values
Creating a Realistic Well-Log Dataset
Real geoscience data always has missing values — tool failures, incomplete logging runs, missing core samples. We start by loading and exploring such data.
%%%python import pandas as pd import numpy as np import matplotlib.pyplot as plt
Realistic well-log dataset with missing values
data = pd.DataFrame({ "depth": [1500, 1520, 1540, 1560, 1580, 1600, 1620, 1640, 1660, 1680, 1700, 1720, 1740, 1760, 1780, 1800], "GR": [45, 50, np.nan, 130, 42, 48, 22, 18, 55, np.nan, 115, 88, 35, 40, np.nan, 95], "RHOB": [2.35, 2.38, 2.55, np.nan, 2.32, 2.34, 2.71, 2.68, 2.40, 2.45, np.nan, 2.50, 2.30, 2.33, 2.70, 2.52], "NPHI": [0.18, 0.20, 0.32, 0.35, 0.15, 0.17, 0.02, 0.03, 0.22, 0.25, 0.30, np.nan, 0.14, 0.16, 0.01, 0.28], "RT": [15, 12, 4, 3, 18, 20, 200, 180, 10, 8, 5, 6, 22, np.nan, 250, 4], "formation": ["Delaware", "Delaware", "Wolfcamp", "Wolfcamp", "Delaware", "Delaware", "Ellenburger", "Ellenburger", "Delaware", "Wolfcamp", "Wolfcamp", "Wolfcamp", "Delaware", "Delaware", None, "Wolfcamp"], "grain_size": ["medium", "medium", "fine", "fine", "coarse", "coarse", "fine", "medium", "medium", "fine", "fine", "medium", "coarse", "medium", "fine", None], "lithology": ["Sandstone", "Sandstone", "Shale", "Shale", "Sandstone", "Sandstone", "Limestone", "Limestone", "Sandstone", "Shale", "Shale", "Shale", "Sandstone", "Sandstone", "Limestone", "Shale"] })
print(data.head(10)) %%%
Step 2: Exploring the Data
Using df.info() and df.isnull().sum()
Always start by checking the shape, data types, and missing value counts.
%%%python print("Shape:", data.shape) print("\nData types:") print(data.dtypes)
print("\nMissing values per column:") print(data.isnull().sum())
print("\nMissing percentage:") print((data.isnull().sum() / len(data) * 100).round(1))
print("\nDescriptive statistics (numerical columns):") print(data.describe())
print("\nCategorical value counts:") print("formation:", data["formation"].value_counts().to_dict()) print("grain_size:", data["grain_size"].value_counts().to_dict()) print("lithology:", data["lithology"].value_counts().to_dict()) %%%
Step 3: Handling Missing Data with SimpleImputer
Imputation Strategies: Mean, Median, Most Frequent
SimpleImputer replaces missing values with a computed statistic from the non-missing values in that column.
%%%python from sklearn.impute import SimpleImputer
--- Numerical imputation ---
Strategy: median (robust to outliers, common in geoscience)
num_cols = ["GR", "RHOB", "NPHI", "RT"] num_imputer = SimpleImputer(strategy="median") data[num_cols] = num_imputer.fit_transform(data[num_cols])
print("After numerical imputation:") print(data[num_cols].isnull().sum()) print("\nMedian values used for imputation:") for col in num_cols: print(f" {col}: {num_imputer.statistics_[num_cols.index(col)]:.3f}")
--- Categorical imputation ---
Strategy: most_frequent (mode)
cat_cols = ["formation", "grain_size"] cat_imputer = SimpleImputer(strategy="most_frequent") data[cat_cols] = cat_imputer.fit_transform(data[cat_cols])
print("\nAfter categorical imputation:") print(data[cat_cols].isnull().sum()) print("Most frequent values:", cat_imputer.statistics_) %%%
Comparing Imputation Strategies
Mean imputation is sensitive to outliers; median is robust. For skewed data like resistivity (RT), median is preferred.
%%%python
Demonstrate the difference on a skewed column
RT_raw = np.array([15, 12, 4, 3, 18, 20, 200, 180, 10, 8, 5, 6, 22, np.nan, 250, 4])
mean_val = np.nanmean(RT_raw) median_val = np.nanmedian(RT_raw)
print(f"RT mean: {mean_val:.1f} ohm-m (pulled high by outliers 200, 250)") print(f"RT median: {median_val:.1f} ohm-m (robust to outliers)") print("\nFor well-log data with spikes, median imputation is safer.") %%%
Step 4: One-Hot Encoding Categorical Features
Encoding Formation and Rock Type
Categorical features like formation name have no natural order, so we use one-hot encoding.
%%%python from sklearn.preprocessing import OneHotEncoder
Method 1: pandas get_dummies (quick and easy)
data_encoded = pd.get_dummies(data, columns=["formation"], drop_first=False) print("After one-hot encoding formation:") print(data_encoded.columns.tolist()) print(data_encoded.head())
Method 2: sklearn OneHotEncoder (for use in pipelines)
ohe = OneHotEncoder(sparse_output=False, drop=None) formation_encoded = ohe.fit_transform(data[["formation"]]) print("\nsklearn OneHotEncoder:") print(" Shape:", formation_encoded.shape) print(" Categories:", ohe.categories_[0].tolist()) print(" First 3 rows:\n", formation_encoded[:3]) %%%
Label Encoding for Ordinal Variables
Grain size has a natural order (fine < medium < coarse), so we use an ordinal mapping.
%%%python from sklearn.preprocessing import LabelEncoder
Manual mapping preserves the correct order
size_map = {"fine": 0, "medium": 1, "coarse": 2} data["grain_size_encoded"] = data["grain_size"].map(size_map) print("Manual ordinal encoding:") print(data[["grain_size", "grain_size_encoded"]].drop_duplicates())
WARNING: sklearn LabelEncoder assigns alphabetical order
le = LabelEncoder() data["grain_size_le"] = le.fit_transform(data["grain_size"]) print("\nsklearn LabelEncoder mapping:") print(dict(zip(le.classes_, le.transform(le.classes_)))) print(" coarse=0, fine=1, medium=2 — WRONG ORDER!") print(" Always use manual mapping for ordinal features.") %%%
Step 5: StandardScaler vs MinMaxScaler Comparison
Side-by-Side Scaling Comparison
StandardScaler centres data to mean=0, std=1. MinMaxScaler maps data to [0, 1]. RobustScaler uses median and IQR.
%%%python from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
num_data = data[["GR", "RHOB", "NPHI", "RT"]].values
scaler_std = StandardScaler() scaler_mm = MinMaxScaler() scaler_rob = RobustScaler()
data_std = scaler_std.fit_transform(num_data) data_mm = scaler_mm.fit_transform(num_data) data_rob = scaler_rob.fit_transform(num_data)
print("StandardScaler:") print(f" GR range: [{data_std[:, 0].min():.2f}, {data_std[:, 0].max():.2f}]") print(f" RT range: [{data_std[:, 3].min():.2f}, {data_std[:, 3].max():.2f}]")
print("\nMinMaxScaler:") print(f" GR range: [{data_mm[:, 0].min():.2f}, {data_mm[:, 0].max():.2f}]") print(f" RT range: [{data_mm[:, 3].min():.2f}, {data_mm[:, 3].max():.2f}]")
print("\nRobustScaler:") print(f" GR range: [{data_rob[:, 0].min():.2f}, {data_rob[:, 0].max():.2f}]") print(f" RT range: [{data_rob[:, 3].min():.2f}, {data_rob[:, 3].max():.2f}]") print("\nRobustScaler handles RT outliers (200, 250) best.") %%%
Visualizing the Effect of Scaling
%%%python fig, axes = plt.subplots(1, 3, figsize=(14, 4)) features = ["GR", "RHOB", "NPHI", "RT"]
for ax, scaled, title in zip(axes, [data_std, data_mm, data_rob], ["StandardScaler", "MinMaxScaler", "RobustScaler"]): ax.boxplot(scaled, tick_labels=features) ax.set_title(title) ax.axhline(y=0, color="red", linestyle="--", alpha=0.3)
plt.suptitle("Scaling Comparison on Well-Log Data") plt.tight_layout() plt.show() %%%
Step 6: Creating Domain-Specific Features
GR/RHOB Ratio and NPHI-DPHI Crossover
Domain knowledge in petrophysics tells us that certain feature combinations are more informative than raw measurements.
%%%python
GR/RHOB ratio: helps distinguish organic-rich shales
data["GR_RHOB_ratio"] = data["GR"] / data["RHOB"]
Density porosity from RHOB
DPHI = (rho_matrix - RHOB) / (rho_matrix - rho_fluid)
rho_matrix = 2.65 # quartz grain density (g/cc) rho_fluid = 1.0 # water density (g/cc) data["DPHI"] = (rho_matrix - data["RHOB"]) / (rho_matrix - rho_fluid)
NPHI - DPHI crossover: classic gas indicator
data["NPHI_DPHI_diff"] = data["NPHI"] - data["DPHI"] data["gas_flag"] = (data["NPHI_DPHI_diff"] > 0.05).astype(int)
GR-NPHI interaction: captures combined shale signal
data["GR_NPHI_product"] = data["GR"] * data["NPHI"]
print("New domain features:") print(data[["GR_RHOB_ratio", "DPHI", "NPHI_DPHI_diff", "gas_flag", "GR_NPHI_product"]].describe())
Visualize gas flag
print("\nGas flag distribution:") print(data["gas_flag"].value_counts().rename({0: "No gas", 1: "Gas"})) %%%
Step 7: ColumnTransformer for Mixed Pipelines
Combining Different Transformations
ColumnTransformer applies different preprocessing to different column types in a single step. This is essential for real-world data with mixed numerical and categorical features.
%%%python from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder
Define feature groups
numerical_features = ["depth", "GR", "RHOB", "NPHI", "RT"] categorical_features = ["formation"]
Create sub-pipelines for each feature type
numerical_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ])
categorical_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore")) ])
Combine into a single ColumnTransformer
preprocessor = ColumnTransformer(transformers=[ ("num", numerical_transformer, numerical_features), ("cat", categorical_transformer, categorical_features) ])
print("Preprocessor defined with:") print(f" Numerical ({len(numerical_features)} cols): Impute(median) -> StandardScaler") print(f" Categorical ({len(categorical_features)} cols): Impute(mode) -> OneHotEncoder") %%%
Step 8: Feature Selection with Correlation Matrix
Correlation Heatmap and Dropping Redundant Features
Features highly correlated with each other () are redundant. Keep one, drop the rest.
%%%python import seaborn as sns
Compute correlation matrix for numerical features
corr_cols = ["GR", "RHOB", "NPHI", "RT", "GR_RHOB_ratio", "DPHI", "GR_NPHI_product"] corr_matrix = data[corr_cols].corr()
plt.figure(figsize=(9, 7)) sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", center=0, fmt=".2f", square=True, linewidths=0.5) plt.title("Feature Correlation Matrix — Well-Log Data") plt.tight_layout() plt.show()
Find highly correlated pairs
print("\nHighly correlated feature pairs (|r| > 0.85):") for i in range(len(corr_matrix.columns)): for j in range(i+1, len(corr_matrix.columns)): r = corr_matrix.iloc[i, j] if abs(r) > 0.85: print(f" {corr_matrix.columns[i]} vs {corr_matrix.columns[j]}: r = {r:.3f}")
print("\nAction: drop one feature from each highly correlated pair.") %%%
Step 9: Complete Pipeline — Preprocessing + Model
End-to-End Pipeline with sklearn
A Pipeline chains preprocessing and model training into a single object. This ensures preprocessing is always applied consistently and prevents data leakage.
%%%python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import accuracy_score, classification_report
Prepare data (use original data before our manual feature engineering)
raw_data = pd.DataFrame({ "depth": [1500, 1520, 1540, 1560, 1580, 1600, 1620, 1640, 1660, 1680, 1700, 1720, 1740, 1760, 1780, 1800], "GR": [45, 50, np.nan, 130, 42, 48, 22, 18, 55, np.nan, 115, 88, 35, 40, np.nan, 95], "RHOB": [2.35, 2.38, 2.55, np.nan, 2.32, 2.34, 2.71, 2.68, 2.40, 2.45, np.nan, 2.50, 2.30, 2.33, 2.70, 2.52], "NPHI": [0.18, 0.20, 0.32, 0.35, 0.15, 0.17, 0.02, 0.03, 0.22, 0.25, 0.30, np.nan, 0.14, 0.16, 0.01, 0.28], "RT": [15, 12, 4, 3, 18, 20, 200, 180, 10, 8, 5, 6, 22, np.nan, 250, 4], "formation": ["Delaware", "Delaware", "Wolfcamp", "Wolfcamp", "Delaware", "Delaware", "Ellenburger", "Ellenburger", "Delaware", "Wolfcamp", "Wolfcamp", "Wolfcamp", "Delaware", "Delaware", None, "Wolfcamp"], "lithology": ["Sandstone", "Sandstone", "Shale", "Shale", "Sandstone", "Sandstone", "Limestone", "Limestone", "Sandstone", "Shale", "Shale", "Shale", "Sandstone", "Sandstone", "Limestone", "Shale"] })
X = raw_data[numerical_features + categorical_features] y = raw_data["lithology"]
Full pipeline: preprocessing + model
pipeline = Pipeline(steps=[ ("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=100, random_state=42)) ])
Cross-validation (more robust than single split for small data)
scores = cross_val_score(pipeline, X, y, cv=3, scoring="accuracy") print(f"3-Fold CV Accuracy: {scores.mean():.4f} +/- {scores.std():.4f}")
Train on all data and show feature names after transformation
pipeline.fit(X, y) transformed = pipeline.named_steps["preprocessor"].transform(X) print(f"\nOriginal features: {X.shape[1]}") print(f"After preprocessing: {transformed.shape[1]}") print(" (numerical features scaled + formation one-hot encoded)") %%%
References
- Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
- McKinney, W. (2017). Python for Data Analysis (2nd ed.), ch. 7 (data cleaning and preparation). O’Reilly.
- Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 2 (preparing the data for ML algorithms). O’Reilly.