Dimensionality Reduction Practice

Chapter 11: Reducing Dimensions — PCA and Beyond

Learning objectives

Apply PCA using scikit-learn and interpret the results
Plot explained variance ratio and choose the number of components
Use t-SNE for 2D visualization of multi-dimensional data
Compare PCA and t-SNE projections
Interpret PCA loadings in a geoscience context
Apply UMAP as a modern alternative to t-SNE

Loading High-Dimensional Well Log Data

Well log data is a perfect candidate for dimensionality reduction because wells are typically logged with many correlated measurements. Before applying PCA, we must load, inspect, and clean the data. Common well-log features include gamma ray (GR), spontaneous potential (SP), bulk density (RHOB), neutron porosity (NPHI), sonic transit time (DT), and resistivity (RT). Since these features span wildly different scales (GR in 0–200 API, NPHI in 0–0.45 fraction, RT in 0.1–10000 ohm-m), standardization is critical.

%%%python import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler

Simulated well-log data: 200 samples, 6 features

np.random.seed(42) n = 200

Underlying factors: lithology and porosity

lithology = np.random.randn(n) porosity = np.random.randn(n)

Observable features are combinations of these factors + noise

GR = 80 + 40 * lithology + 5 * np.random.randn(n) SP = -60 + 25 * lithology + 3 * np.random.randn(n) RHOB = 2.5 - 0.15 * porosity + 0.02 * np.random.randn(n) NPHI = 0.15 + 0.1 * lithology + 0.08 * porosity + 0.01 * np.random.randn(n) DT = 80 + 20 * porosity + 3 * np.random.randn(n) RT = np.exp(3 + 0.5 * lithology - 0.8 * porosity + 0.3 * np.random.randn(n))

X = np.column_stack([GR, SP, RHOB, NPHI, DT, np.log(RT)]) feature_names = ["GR", "SP", "RHOB", "NPHI", "DT", "log(RT)"]

Create DataFrame for inspection

df = pd.DataFrame(X, columns=feature_names) print("Shape:", df.shape) print("\nSummary statistics:") print(df.describe().round(3)) print("\nCorrelation matrix:") print(df.corr().round(3)) %%%

Why Standardize Before PCA?

PCA finds the directions of maximum variance. If features are on different scales, the feature with the largest absolute values (e.g., GR ranging 0–200) will dominate the first principal component simply because it has the most variance in raw units. Standardizing transforms every feature to have mean 0 and standard deviation 1, putting them on equal footing so PCA captures correlation structure rather than scale differences.

StandardScaler: What It Does

For each feature $x_i$ , the scaler computes: $z_i = \frac{x_i - \bar{x}_i}{s_i}$

where $\bar{x}_i$ is the training mean and $s_i$ is the training standard deviation. After scaling, each feature has $\mu = 0, \sigma = 1$ .

Critical: fit the scaler on training data only, then use the same transform on test data to avoid data leakage.

%%%python # IMPORTANT: always standardize before PCA scaler = StandardScaler() X_scaled = scaler.fit_transform(X)

Verify: each feature now has mean~0, std~1

print("Means after scaling:", np.round(X_scaled.mean(axis=0), 6)) print("Stds after scaling: ", np.round(X_scaled.std(axis=0), 6))

Fit PCA (keep all components to see the full variance spectrum)

pca = PCA() pca.fit(X_scaled)

print("\nExplained variance ratios:", np.round(pca.explained_variance_ratio_, 4)) print("Cumulative:", np.round(np.cumsum(pca.explained_variance_ratio_), 4)) %%%

Scree Plot and Component Selection

The scree plot shows the explained variance for each principal component. We look for an "elbow" — the point where adding more components provides diminishing returns. The cumulative variance plot helps us choose the number of components needed to retain a target amount of variance (commonly 90% or 95%).

%%%python fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

Scree plot

ax1.bar(range(1, 7), pca.explained_variance_ratio_, color="steelblue", alpha=0.7) ax1.set_xlabel("Principal Component") ax1.set_ylabel("Explained Variance Ratio") ax1.set_title("Scree Plot") ax1.set_xticks(range(1, 7))

Cumulative explained variance

cumulative = np.cumsum(pca.explained_variance_ratio_) ax2.plot(range(1, 7), cumulative, "o-", color="darkorange") ax2.axhline(y=0.95, color="red", linestyle="--", label="95% threshold") ax2.axhline(y=0.90, color="blue", linestyle=":", label="90% threshold") ax2.set_xlabel("Number of Components") ax2.set_ylabel("Cumulative Explained Variance") ax2.set_title("Cumulative Explained Variance") ax2.set_xticks(range(1, 7)) ax2.legend()

plt.tight_layout() plt.show()

Find components for 90% and 95% variance

for threshold in [0.90, 0.95]: for k in range(1, 7): if cumulative[k-1] >= threshold: print(f"Need {k} components for {threshold:.0%} variance (cumulative: {cumulative[k-1]:.4f})") break %%%

Interpreting PCA Loadings

The loadings (stored in pca.components_) tell us how each original feature contributes to each principal component. Large positive or negative loadings mean a feature strongly influences that PC. By examining loadings, we can attach physical meaning to each component.

%%%python

The loadings tell us which features contribute to each PC

loadings = pca.components_ # shape: (n_components, n_features)

print("\nPC Loadings (rows=PCs, cols=features):") header = " " for f in feature_names: header += f"{f:>10}" print(header) for i in range(3): row = f" PC{i+1}: " for j in range(len(feature_names)): row += f"{loadings[i][j]:>10.3f}" print(row)

Visualize loadings as a heatmap

fig, ax = plt.subplots(figsize=(10, 4)) im = ax.imshow(loadings[:3], cmap="coolwarm", aspect="auto", vmin=-0.7, vmax=0.7) ax.set_yticks(range(3)) ax.set_yticklabels(["PC1", "PC2", "PC3"]) ax.set_xticks(range(len(feature_names))) ax.set_xticklabels(feature_names) for i in range(3): for j in range(len(feature_names)): ax.text(j, i, f"{loadings[i][j]:.2f}", ha="center", va="center", color="white" if abs(loadings[i][j]) > 0.4 else "black", fontsize=9) plt.colorbar(im, label="Loading") plt.title("PCA Loadings Heatmap - First 3 Components") plt.tight_layout() plt.show() %%%

Biplot: Samples and Loadings Together

A biplot overlays the PCA projection of samples with arrows showing how original features map into the PC space. Arrows pointing in the same direction indicate correlated features; arrows at 90 degrees indicate uncorrelated features.

%%%python

PCA biplot: samples + feature loading arrows

pca_2d = PCA(n_components=2) X_pca = pca_2d.fit_transform(X_scaled)

fig, ax = plt.subplots(figsize=(10, 8)) scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=lithology, cmap="viridis", alpha=0.5, s=20) plt.colorbar(scatter, label="Lithology Factor")

Overlay loading arrows (scaled for visibility)

scale = 3.0 for i, fname in enumerate(feature_names): ax.arrow(0, 0, pca_2d.components_[0, i] * scale, pca_2d.components_[1, i] * scale, head_width=0.08, head_length=0.05, fc="red", ec="red") ax.text(pca_2d.components_[0, i] * scale * 1.15, pca_2d.components_[1, i] * scale * 1.15, fname, color="red", fontsize=11, fontweight="bold")

ax.set_xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)") ax.set_ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)") ax.set_title("PCA Biplot - Well-Log Data") ax.axhline(0, color="gray", linewidth=0.5) ax.axvline(0, color="gray", linewidth=0.5) plt.tight_layout() plt.show() %%%

2D Projection with PCA

%%%python

Project to 2 dimensions

X_pca = pca_2d.fit_transform(X_scaled)

Color by a known property (e.g., lithology factor)

plt.figure(figsize=(8, 6)) scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=lithology, cmap="viridis", alpha=0.7) plt.colorbar(scatter, label="Lithology Factor") plt.xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)") plt.ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)") plt.title("PCA Projection - Well-Log Data") plt.tight_layout() plt.show() %%%

t-SNE Visualization

t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear method designed specifically for visualization. Unlike PCA, t-SNE preserves local neighborhood structure: points that are close in high-dimensional space remain close in the 2D embedding. However, t-SNE does not preserve global distances or density, and axes have no physical meaning.

%%%python from sklearn.manifold import TSNE

Reduce to 2D with t-SNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000) X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(8, 6)) scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=lithology, cmap="viridis", alpha=0.7) plt.colorbar(scatter, label="Lithology Factor") plt.xlabel("t-SNE Dimension 1") plt.ylabel("t-SNE Dimension 2") plt.title("t-SNE Projection - Well-Log Data (perplexity=30)") plt.tight_layout() plt.show() %%%

Comparing Perplexity Values

The perplexity parameter controls how many effective neighbors each point considers. Low perplexity (5) focuses on very local structure; high perplexity (50) incorporates more global structure. Always compare multiple perplexity values to ensure conclusions are robust.

%%%python fig, axes = plt.subplots(1, 4, figsize=(22, 5)) perplexities = [5, 15, 30, 50]

for ax, perp in zip(axes, perplexities): tsne_temp = TSNE(n_components=2, perplexity=perp, random_state=42, n_iter=1000) X_temp = tsne_temp.fit_transform(X_scaled) scatter = ax.scatter(X_temp[:, 0], X_temp[:, 1], c=lithology, cmap="viridis", alpha=0.7, s=20) ax.set_title(f"Perplexity = {perp}") ax.set_xlabel("Dim 1") ax.set_ylabel("Dim 2")

plt.suptitle("t-SNE: Effect of Perplexity Parameter") plt.tight_layout() plt.show() %%%

PCA vs t-SNE: Side-by-Side Comparison

Comparing PCA and t-SNE reveals their strengths. PCA preserves global variance and produces axes with clear physical interpretation but cannot capture non-linear structures. t-SNE excels at revealing cluster structure and non-linear relationships but sacrifices global distance information.

%%%python

Side by side comparison: PCA vs t-SNE

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

PCA

ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=lithology, cmap="viridis", alpha=0.7, s=25) ax1.set_xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%})") ax1.set_ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%})") ax1.set_title("PCA (linear, global structure)")

t-SNE

ax2.scatter(X_tsne[:, 0], X_tsne[:, 1], c=lithology, cmap="viridis", alpha=0.7, s=25) ax2.set_xlabel("t-SNE Dim 1") ax2.set_ylabel("t-SNE Dim 2") ax2.set_title("t-SNE (non-linear, local structure)")

plt.suptitle("PCA vs t-SNE Projections - Well-Log Data", fontsize=14) plt.tight_layout() plt.show()

print("Key differences:") print("- PCA axes have meaning (% variance); t-SNE axes are arbitrary") print("- PCA preserves global distances; t-SNE preserves local neighborhoods") print("- PCA is deterministic; t-SNE varies with random_state") print("- PCA can transform new data; t-SNE must re-run on the whole dataset") %%%

Applying PCA to New / Test Data

A critical practical distinction: PCA can transform new data using the learned components, but t-SNE cannot. This makes PCA suitable for preprocessing pipelines, while t-SNE is visualization-only.

%%%python

Split into train/test

from sklearn.model_selection import train_test_split X_train, X_test = train_test_split(X_scaled, test_size=0.3, random_state=42)

Fit PCA on training data ONLY

pca_pipe = PCA(n_components=3) pca_pipe.fit(X_train)

Transform both train and test using the SAME fitted PCA

X_train_pca = pca_pipe.transform(X_train) X_test_pca = pca_pipe.transform(X_test)

print("Train PCA shape:", X_train_pca.shape) print("Test PCA shape: ", X_test_pca.shape) print("Explained variance:", np.round(pca_pipe.explained_variance_ratio_, 4)) print("Total variance retained:", f"{sum(pca_pipe.explained_variance_ratio_):.2%}") %%%

References

Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 8 (dimensionality reduction). O’Reilly.
van der Maaten, L., Hinton, G. (2008). Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605.