Dimensionality Reduction

Chapter 11: Reducing Dimensions — PCA and Beyond

Learning objectives

Explain the curse of dimensionality and why it degrades ML performance
Derive PCA from the covariance matrix eigendecomposition
Choose the number of PCA components using explained variance and the scree plot
Describe t-SNE and its perplexity parameter for nonlinear embedding
Compare UMAP to t-SNE for visualization
Apply dimensionality reduction to multi-attribute geoscience data

The Curse of Dimensionality

As the number of features (dimensions) increases, several problems arise:

Data becomes sparse: In high dimensions, data points are far apart from each other. The volume of the space grows exponentially: a unit hypercube in $d$ dimensions has $2^d$ corners. To maintain the same sample density, you need exponentially more data.
Distance measures break down: In high dimensions, the ratio of the nearest distance to the farthest distance approaches 1: $\lim_{d \to \infty} \frac{\text{dist}$ . All points appear equally far apart, making KNN and other distance-based methods unreliable.
Overfitting worsens: More features = more parameters = more opportunities to fit noise.
Visualization is impossible: We can only see 2D or 3D. With 50 features, we need dimensionality reduction just to visualize the data.

Rule of thumb: You need at least 5–10 samples per feature for reliable ML. With 20 features, you need 100–200 samples minimum.

Principal Component Analysis (PCA)

PCA is the most widely-used dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data.

Step 1: Center the data

Subtract the mean of each feature: $\tilde{X} = X - \bar{X}$ .

Step 2: Compute the covariance matrix

C = \frac{1}{n-1}\tilde{X}^T \tilde{X}

This is a $p \times p$ matrix (where $p$ is the number of features). Entry $C_{ij}$ measures how features $i$ and $j$ co-vary.

Step 3: Eigendecomposition

C = V \Lambda V^T

where $V$ is the matrix of eigenvectors (principal components) and $\Lambda$ is the diagonal matrix of eigenvalues. The eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_p$ measure the variance captured by each principal component.

Step 4: Project the data

To reduce from $p$ dimensions to $k$ dimensions, keep only the top $k$ eigenvectors:

Z = \tilde{X} V_k

where $V_k$ is the $p \times k$ matrix of the first $k$ eigenvectors. $Z$ is the transformed data in $k$ dimensions.

Explained Variance Ratio

The fraction of total variance captured by the $j$ -th principal component:

\text{EVR}_j = \frac{\lambda_j}{\sum_{i=1}^{p} \lambda_i}

The cumulative explained variance ratio tells us how much total information is retained by the first $k$ components:

\text{Cumulative EVR}(k) = \sum_{j=1}^{k} \text{EVR}_j

How to Choose the Number of Components

95% variance rule: Choose the smallest $k$ such that the cumulative explained variance ratio ≥ 0.95. This retains 95% of the information.
Scree plot: Plot the eigenvalues (or explained variance ratio) against component number. Look for an "elbow" — a sharp drop-off. Keep components before the elbow.
Kaiser criterion: Keep components with eigenvalue > 1 (for standardized data). Components with eigenvalue < 1 explain less variance than a single original feature.

t-SNE: t-distributed Stochastic Neighbor Embedding

t-SNE is a nonlinear dimensionality reduction technique designed specifically for visualization (reducing to 2D or 3D). Unlike PCA, it preserves local neighborhood structure.

How t-SNE Works (Intuition)

For each pair of points in high-dimensional space, compute a probability proportional to their similarity (using a Gaussian kernel).
Initialize points randomly in 2D.
For each pair of points in 2D, compute a probability using a t-distribution (heavier tails than Gaussian).
Iteratively move the 2D points to minimize the difference (KL divergence) between the high-dimensional and low-dimensional probability distributions.

The t-distribution in step 3 is the key innovation: its heavier tails allow distant points to be placed farther apart in 2D, preventing the "crowding problem."

Perplexity parameter: Controls the effective number of neighbors considered for each point. Typical range: 5–50. Low perplexity focuses on very local structure, high perplexity considers more global structure. Try several values and compare.

Caveats:

t-SNE is stochastic — different runs produce different results. Always set a random seed.
Cluster sizes and distances in t-SNE plots are NOT meaningful — only the grouping structure is.
t-SNE is for visualization only, not for preprocessing before ML (use PCA for that).
Slow for large datasets ( $O(n^2)$ or $O(n \log n)$ with Barnes-Hut approximation).

UMAP: Uniform Manifold Approximation and Projection

UMAP is a newer alternative to t-SNE with several advantages:

Faster: Scales much better to large datasets.
Preserves more global structure: t-SNE focuses on local neighborhoods; UMAP better preserves the relative positions of clusters.
Can be used for general dimensionality reduction: Unlike t-SNE, UMAP can be used as a preprocessing step before ML.
The key parameter is n_neighbors (similar to perplexity) and min_dist (controls how tightly points are packed).

Geoscience Applications

Multi-attribute seismic data: Seismic interpretation may involve 20+ attributes (amplitude, frequency, phase, coherence, curvature, etc.). PCA can reduce these to 3–5 principal components that capture most of the variation, making interpretation and classification tractable.
Well-log data: A well log suite might include GR, SP, RHOB, NPHI, PE, DT, RT, and derived logs. PCA identifies the independent sources of variation (often: lithology, porosity, fluid content).
Geochemical data: Major and trace element concentrations (20+ elements) can be reduced to key component axes that correspond to geological processes (e.g., magmatic differentiation, alteration).
Visualization: Use t-SNE or UMAP to project multi-dimensional well-log or geochemical data into 2D for cluster identification and facies visualization.

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 14.5 (principal component analysis). Springer.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 12 (continuous latent variables, PCA). Springer.
van der Maaten, L., Hinton, G. (2008). Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605.
McInnes, L., Healy, J., Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426.