Dimensionality Reduction

Chapter 11: Reducing Dimensions — PCA and Beyond

Learning objectives

  • Explain the curse of dimensionality and why it degrades ML performance
  • Derive PCA from the covariance matrix eigendecomposition
  • Choose the number of PCA components using explained variance and the scree plot
  • Describe t-SNE and its perplexity parameter for nonlinear embedding
  • Compare UMAP to t-SNE for visualization
  • Apply dimensionality reduction to multi-attribute geoscience data

The Curse of Dimensionality

As the number of features (dimensions) increases, several problems arise:

  • Data becomes sparse: In high dimensions, data points are far apart from each other. The volume of the space grows exponentially: a unit hypercube in dd dimensions has 2d2^d corners. To maintain the same sample density, you need exponentially more data.
  • Distance measures break down: In high dimensions, the ratio of the nearest distance to the farthest distance approaches 1: limddistmaxdistmindistmin0\lim_{d \to \infty} \frac{\text{dist}{\max} - \text{dist}{\min}}{\text{dist}_{\min}} \to 0. All points appear equally far apart, making KNN and other distance-based methods unreliable.
  • Overfitting worsens: More features = more parameters = more opportunities to fit noise.
  • Visualization is impossible: We can only see 2D or 3D. With 50 features, we need dimensionality reduction just to visualize the data.

Rule of thumb: You need at least 5–10 samples per feature for reliable ML. With 20 features, you need 100–200 samples minimum.

Principal Component Analysis (PCA)

PCA is the most widely-used dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data.

Step 1: Center the data

Subtract the mean of each feature: X~=XXˉ\tilde{X} = X - \bar{X}.

Step 2: Compute the covariance matrix

C=1n1X~TX~C = \frac{1}{n-1}\tilde{X}^T \tilde{X}

This is a p×pp \times p matrix (where pp is the number of features). Entry CijC_{ij} measures how features ii and jj co-vary.

Step 3: Eigendecomposition

C=VΛVTC = V \Lambda V^T

where VV is the matrix of eigenvectors (principal components) and Λ\Lambda is the diagonal matrix of eigenvalues. The eigenvalues λ1λ2λp\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_p measure the variance captured by each principal component.

Step 4: Project the data

To reduce from pp dimensions to kk dimensions, keep only the top kk eigenvectors:

Z=X~VkZ = \tilde{X} V_k

where VkV_k is the p×kp \times k matrix of the first kk eigenvectors. ZZ is the transformed data in kk dimensions.

Explained Variance Ratio

The fraction of total variance captured by the jj-th principal component:

EVRj=λji=1pλi\text{EVR}_j = \frac{\lambda_j}{\sum_{i=1}^{p} \lambda_i}

The cumulative explained variance ratio tells us how much total information is retained by the first kk components:

Cumulative EVR(k)=j=1kEVRj\text{Cumulative EVR}(k) = \sum_{j=1}^{k} \text{EVR}_j

How to Choose the Number of Components

  • 95% variance rule: Choose the smallest kk such that the cumulative explained variance ratio ≥ 0.95. This retains 95% of the information.
  • Scree plot: Plot the eigenvalues (or explained variance ratio) against component number. Look for an "elbow" — a sharp drop-off. Keep components before the elbow.
  • Kaiser criterion: Keep components with eigenvalue > 1 (for standardized data). Components with eigenvalue < 1 explain less variance than a single original feature.

t-SNE: t-distributed Stochastic Neighbor Embedding

t-SNE is a nonlinear dimensionality reduction technique designed specifically for visualization (reducing to 2D or 3D). Unlike PCA, it preserves local neighborhood structure.

How t-SNE Works (Intuition)

  • For each pair of points in high-dimensional space, compute a probability proportional to their similarity (using a Gaussian kernel).
  • Initialize points randomly in 2D.
  • For each pair of points in 2D, compute a probability using a t-distribution (heavier tails than Gaussian).
  • Iteratively move the 2D points to minimize the difference (KL divergence) between the high-dimensional and low-dimensional probability distributions.

The t-distribution in step 3 is the key innovation: its heavier tails allow distant points to be placed farther apart in 2D, preventing the "crowding problem."

Perplexity parameter: Controls the effective number of neighbors considered for each point. Typical range: 5–50. Low perplexity focuses on very local structure, high perplexity considers more global structure. Try several values and compare.

Caveats:

  • t-SNE is stochastic — different runs produce different results. Always set a random seed.
  • Cluster sizes and distances in t-SNE plots are NOT meaningful — only the grouping structure is.
  • t-SNE is for visualization only, not for preprocessing before ML (use PCA for that).
  • Slow for large datasets (O(n2)O(n^2) or O(nlogn)O(n \log n) with Barnes-Hut approximation).

UMAP: Uniform Manifold Approximation and Projection

UMAP is a newer alternative to t-SNE with several advantages:

  • Faster: Scales much better to large datasets.
  • Preserves more global structure: t-SNE focuses on local neighborhoods; UMAP better preserves the relative positions of clusters.
  • Can be used for general dimensionality reduction: Unlike t-SNE, UMAP can be used as a preprocessing step before ML.
  • The key parameter is n_neighbors (similar to perplexity) and min_dist (controls how tightly points are packed).

Geoscience Applications

  • Multi-attribute seismic data: Seismic interpretation may involve 20+ attributes (amplitude, frequency, phase, coherence, curvature, etc.). PCA can reduce these to 3–5 principal components that capture most of the variation, making interpretation and classification tractable.
  • Well-log data: A well log suite might include GR, SP, RHOB, NPHI, PE, DT, RT, and derived logs. PCA identifies the independent sources of variation (often: lithology, porosity, fluid content).
  • Geochemical data: Major and trace element concentrations (20+ elements) can be reduced to key component axes that correspond to geological processes (e.g., magmatic differentiation, alteration).
  • Visualization: Use t-SNE or UMAP to project multi-dimensional well-log or geochemical data into 2D for cluster identification and facies visualization.

References

  • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 14.5 (principal component analysis). Springer.
  • Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 12 (continuous latent variables, PCA). Springer.
  • van der Maaten, L., Hinton, G. (2008). Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605.
  • McInnes, L., Healy, J., Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.