Dimensionality Reduction Practice
Learning objectives
- Apply PCA using scikit-learn and interpret the results
- Plot explained variance ratio and choose the number of components
- Use t-SNE for 2D visualization of multi-dimensional data
- Compare PCA and t-SNE projections
- Interpret PCA loadings in a geoscience context
- Apply UMAP as a modern alternative to t-SNE
Loading High-Dimensional Well Log Data
Well log data is a perfect candidate for dimensionality reduction because wells are typically logged with many correlated measurements. Before applying PCA, we must load, inspect, and clean the data. Common well-log features include gamma ray (GR), spontaneous potential (SP), bulk density (RHOB), neutron porosity (NPHI), sonic transit time (DT), and resistivity (RT). Since these features span wildly different scales (GR in 0–200 API, NPHI in 0–0.45 fraction, RT in 0.1–10000 ohm-m), standardization is critical.
%%%python import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler
Simulated well-log data: 200 samples, 6 features
np.random.seed(42) n = 200
Underlying factors: lithology and porosity
lithology = np.random.randn(n) porosity = np.random.randn(n)
Observable features are combinations of these factors + noise
GR = 80 + 40 * lithology + 5 * np.random.randn(n) SP = -60 + 25 * lithology + 3 * np.random.randn(n) RHOB = 2.5 - 0.15 * porosity + 0.02 * np.random.randn(n) NPHI = 0.15 + 0.1 * lithology + 0.08 * porosity + 0.01 * np.random.randn(n) DT = 80 + 20 * porosity + 3 * np.random.randn(n) RT = np.exp(3 + 0.5 * lithology - 0.8 * porosity + 0.3 * np.random.randn(n))
X = np.column_stack([GR, SP, RHOB, NPHI, DT, np.log(RT)]) feature_names = ["GR", "SP", "RHOB", "NPHI", "DT", "log(RT)"]
Create DataFrame for inspection
df = pd.DataFrame(X, columns=feature_names) print("Shape:", df.shape) print("\nSummary statistics:") print(df.describe().round(3)) print("\nCorrelation matrix:") print(df.corr().round(3)) %%%
Why Standardize Before PCA?
PCA finds the directions of maximum variance. If features are on different scales, the feature with the largest absolute values (e.g., GR ranging 0–200) will dominate the first principal component simply because it has the most variance in raw units. Standardizing transforms every feature to have mean 0 and standard deviation 1, putting them on equal footing so PCA captures correlation structure rather than scale differences.
StandardScaler: What It Does
For each feature , the scaler computes:
where is the training mean and is the training standard deviation. After scaling, each feature has .
Critical: fit the scaler on training data only, then use the same transform on test data to avoid data leakage.
Verify: each feature now has mean~0, std~1
print("Means after scaling:", np.round(X_scaled.mean(axis=0), 6)) print("Stds after scaling: ", np.round(X_scaled.std(axis=0), 6))
Fit PCA (keep all components to see the full variance spectrum)
pca = PCA() pca.fit(X_scaled)
print("\nExplained variance ratios:", np.round(pca.explained_variance_ratio_, 4)) print("Cumulative:", np.round(np.cumsum(pca.explained_variance_ratio_), 4)) %%%
Scree Plot and Component Selection
The scree plot shows the explained variance for each principal component. We look for an "elbow" — the point where adding more components provides diminishing returns. The cumulative variance plot helps us choose the number of components needed to retain a target amount of variance (commonly 90% or 95%).
%%%python fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
Scree plot
ax1.bar(range(1, 7), pca.explained_variance_ratio_, color="steelblue", alpha=0.7) ax1.set_xlabel("Principal Component") ax1.set_ylabel("Explained Variance Ratio") ax1.set_title("Scree Plot") ax1.set_xticks(range(1, 7))
Cumulative explained variance
cumulative = np.cumsum(pca.explained_variance_ratio_) ax2.plot(range(1, 7), cumulative, "o-", color="darkorange") ax2.axhline(y=0.95, color="red", linestyle="--", label="95% threshold") ax2.axhline(y=0.90, color="blue", linestyle=":", label="90% threshold") ax2.set_xlabel("Number of Components") ax2.set_ylabel("Cumulative Explained Variance") ax2.set_title("Cumulative Explained Variance") ax2.set_xticks(range(1, 7)) ax2.legend()
plt.tight_layout() plt.show()
Find components for 90% and 95% variance
for threshold in [0.90, 0.95]: for k in range(1, 7): if cumulative[k-1] >= threshold: print(f"Need {k} components for {threshold:.0%} variance (cumulative: {cumulative[k-1]:.4f})") break %%%
Interpreting PCA Loadings
The loadings (stored in pca.components_) tell us how each original feature contributes to each principal component. Large positive or negative loadings mean a feature strongly influences that PC. By examining loadings, we can attach physical meaning to each component.
%%%python
The loadings tell us which features contribute to each PC
loadings = pca.components_ # shape: (n_components, n_features)
print("\nPC Loadings (rows=PCs, cols=features):") header = " " for f in feature_names: header += f"{f:>10}" print(header) for i in range(3): row = f" PC{i+1}: " for j in range(len(feature_names)): row += f"{loadings[i][j]:>10.3f}" print(row)
Visualize loadings as a heatmap
fig, ax = plt.subplots(figsize=(10, 4)) im = ax.imshow(loadings[:3], cmap="coolwarm", aspect="auto", vmin=-0.7, vmax=0.7) ax.set_yticks(range(3)) ax.set_yticklabels(["PC1", "PC2", "PC3"]) ax.set_xticks(range(len(feature_names))) ax.set_xticklabels(feature_names) for i in range(3): for j in range(len(feature_names)): ax.text(j, i, f"{loadings[i][j]:.2f}", ha="center", va="center", color="white" if abs(loadings[i][j]) > 0.4 else "black", fontsize=9) plt.colorbar(im, label="Loading") plt.title("PCA Loadings Heatmap - First 3 Components") plt.tight_layout() plt.show() %%%
Biplot: Samples and Loadings Together
A biplot overlays the PCA projection of samples with arrows showing how original features map into the PC space. Arrows pointing in the same direction indicate correlated features; arrows at 90 degrees indicate uncorrelated features.
%%%python
PCA biplot: samples + feature loading arrows
pca_2d = PCA(n_components=2) X_pca = pca_2d.fit_transform(X_scaled)
fig, ax = plt.subplots(figsize=(10, 8)) scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=lithology, cmap="viridis", alpha=0.5, s=20) plt.colorbar(scatter, label="Lithology Factor")
Overlay loading arrows (scaled for visibility)
scale = 3.0 for i, fname in enumerate(feature_names): ax.arrow(0, 0, pca_2d.components_[0, i] * scale, pca_2d.components_[1, i] * scale, head_width=0.08, head_length=0.05, fc="red", ec="red") ax.text(pca_2d.components_[0, i] * scale * 1.15, pca_2d.components_[1, i] * scale * 1.15, fname, color="red", fontsize=11, fontweight="bold")
ax.set_xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)") ax.set_ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)") ax.set_title("PCA Biplot - Well-Log Data") ax.axhline(0, color="gray", linewidth=0.5) ax.axvline(0, color="gray", linewidth=0.5) plt.tight_layout() plt.show() %%%
2D Projection with PCA
%%%python
Project to 2 dimensions
X_pca = pca_2d.fit_transform(X_scaled)
Color by a known property (e.g., lithology factor)
plt.figure(figsize=(8, 6)) scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=lithology, cmap="viridis", alpha=0.7) plt.colorbar(scatter, label="Lithology Factor") plt.xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)") plt.ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)") plt.title("PCA Projection - Well-Log Data") plt.tight_layout() plt.show() %%%
t-SNE Visualization
t-SNE (t-distributed Stochastic Neighbor Embedding) is a non-linear method designed specifically for visualization. Unlike PCA, t-SNE preserves local neighborhood structure: points that are close in high-dimensional space remain close in the 2D embedding. However, t-SNE does not preserve global distances or density, and axes have no physical meaning.
%%%python from sklearn.manifold import TSNE
Reduce to 2D with t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000) X_tsne = tsne.fit_transform(X_scaled)
plt.figure(figsize=(8, 6)) scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=lithology, cmap="viridis", alpha=0.7) plt.colorbar(scatter, label="Lithology Factor") plt.xlabel("t-SNE Dimension 1") plt.ylabel("t-SNE Dimension 2") plt.title("t-SNE Projection - Well-Log Data (perplexity=30)") plt.tight_layout() plt.show() %%%
Comparing Perplexity Values
The perplexity parameter controls how many effective neighbors each point considers. Low perplexity (5) focuses on very local structure; high perplexity (50) incorporates more global structure. Always compare multiple perplexity values to ensure conclusions are robust.
%%%python fig, axes = plt.subplots(1, 4, figsize=(22, 5)) perplexities = [5, 15, 30, 50]
for ax, perp in zip(axes, perplexities): tsne_temp = TSNE(n_components=2, perplexity=perp, random_state=42, n_iter=1000) X_temp = tsne_temp.fit_transform(X_scaled) scatter = ax.scatter(X_temp[:, 0], X_temp[:, 1], c=lithology, cmap="viridis", alpha=0.7, s=20) ax.set_title(f"Perplexity = {perp}") ax.set_xlabel("Dim 1") ax.set_ylabel("Dim 2")
plt.suptitle("t-SNE: Effect of Perplexity Parameter") plt.tight_layout() plt.show() %%%
PCA vs t-SNE: Side-by-Side Comparison
Comparing PCA and t-SNE reveals their strengths. PCA preserves global variance and produces axes with clear physical interpretation but cannot capture non-linear structures. t-SNE excels at revealing cluster structure and non-linear relationships but sacrifices global distance information.
%%%python
Side by side comparison: PCA vs t-SNE
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
PCA
ax1.scatter(X_pca[:, 0], X_pca[:, 1], c=lithology, cmap="viridis", alpha=0.7, s=25) ax1.set_xlabel(f"PC1 ({pca_2d.explained_variance_ratio_[0]:.1%})") ax1.set_ylabel(f"PC2 ({pca_2d.explained_variance_ratio_[1]:.1%})") ax1.set_title("PCA (linear, global structure)")
t-SNE
ax2.scatter(X_tsne[:, 0], X_tsne[:, 1], c=lithology, cmap="viridis", alpha=0.7, s=25) ax2.set_xlabel("t-SNE Dim 1") ax2.set_ylabel("t-SNE Dim 2") ax2.set_title("t-SNE (non-linear, local structure)")
plt.suptitle("PCA vs t-SNE Projections - Well-Log Data", fontsize=14) plt.tight_layout() plt.show()
print("Key differences:") print("- PCA axes have meaning (% variance); t-SNE axes are arbitrary") print("- PCA preserves global distances; t-SNE preserves local neighborhoods") print("- PCA is deterministic; t-SNE varies with random_state") print("- PCA can transform new data; t-SNE must re-run on the whole dataset") %%%
Applying PCA to New / Test Data
A critical practical distinction: PCA can transform new data using the learned components, but t-SNE cannot. This makes PCA suitable for preprocessing pipelines, while t-SNE is visualization-only.
%%%python
Split into train/test
from sklearn.model_selection import train_test_split X_train, X_test = train_test_split(X_scaled, test_size=0.3, random_state=42)
Fit PCA on training data ONLY
pca_pipe = PCA(n_components=3) pca_pipe.fit(X_train)
Transform both train and test using the SAME fitted PCA
X_train_pca = pca_pipe.transform(X_train) X_test_pca = pca_pipe.transform(X_test)
print("Train PCA shape:", X_train_pca.shape) print("Test PCA shape: ", X_test_pca.shape) print("Explained variance:", np.round(pca_pipe.explained_variance_ratio_, 4)) print("Total variance retained:", f"{sum(pca_pipe.explained_variance_ratio_):.2%}") %%%
References
- Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
- Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd ed.), ch. 8 (dimensionality reduction). O’Reilly.
- van der Maaten, L., Hinton, G. (2008). Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605.