Dimensionality Reduction
Learning objectives
- Explain the curse of dimensionality and why it degrades ML performance
- Derive PCA from the covariance matrix eigendecomposition
- Choose the number of PCA components using explained variance and the scree plot
- Describe t-SNE and its perplexity parameter for nonlinear embedding
- Compare UMAP to t-SNE for visualization
- Apply dimensionality reduction to multi-attribute geoscience data
The Curse of Dimensionality
As the number of features (dimensions) increases, several problems arise:
- Data becomes sparse: In high dimensions, data points are far apart from each other. The volume of the space grows exponentially: a unit hypercube in dimensions has corners. To maintain the same sample density, you need exponentially more data.
- Distance measures break down: In high dimensions, the ratio of the nearest distance to the farthest distance approaches 1: . All points appear equally far apart, making KNN and other distance-based methods unreliable.
- Overfitting worsens: More features = more parameters = more opportunities to fit noise.
- Visualization is impossible: We can only see 2D or 3D. With 50 features, we need dimensionality reduction just to visualize the data.
Rule of thumb: You need at least 5–10 samples per feature for reliable ML. With 20 features, you need 100–200 samples minimum.
Principal Component Analysis (PCA)
PCA is the most widely-used dimensionality reduction technique. It finds new axes (principal components) that capture the maximum variance in the data.
Step 1: Center the data
Subtract the mean of each feature: .
Step 2: Compute the covariance matrix
This is a matrix (where is the number of features). Entry measures how features and co-vary.
Step 3: Eigendecomposition
where is the matrix of eigenvectors (principal components) and is the diagonal matrix of eigenvalues. The eigenvalues measure the variance captured by each principal component.
Step 4: Project the data
To reduce from dimensions to dimensions, keep only the top eigenvectors:
where is the matrix of the first eigenvectors. is the transformed data in dimensions.
Explained Variance Ratio
The fraction of total variance captured by the -th principal component:
The cumulative explained variance ratio tells us how much total information is retained by the first components:
How to Choose the Number of Components
- 95% variance rule: Choose the smallest such that the cumulative explained variance ratio ≥ 0.95. This retains 95% of the information.
- Scree plot: Plot the eigenvalues (or explained variance ratio) against component number. Look for an "elbow" — a sharp drop-off. Keep components before the elbow.
- Kaiser criterion: Keep components with eigenvalue > 1 (for standardized data). Components with eigenvalue < 1 explain less variance than a single original feature.
t-SNE: t-distributed Stochastic Neighbor Embedding
t-SNE is a nonlinear dimensionality reduction technique designed specifically for visualization (reducing to 2D or 3D). Unlike PCA, it preserves local neighborhood structure.
How t-SNE Works (Intuition)
- For each pair of points in high-dimensional space, compute a probability proportional to their similarity (using a Gaussian kernel).
- Initialize points randomly in 2D.
- For each pair of points in 2D, compute a probability using a t-distribution (heavier tails than Gaussian).
- Iteratively move the 2D points to minimize the difference (KL divergence) between the high-dimensional and low-dimensional probability distributions.
The t-distribution in step 3 is the key innovation: its heavier tails allow distant points to be placed farther apart in 2D, preventing the "crowding problem."
Perplexity parameter: Controls the effective number of neighbors considered for each point. Typical range: 5–50. Low perplexity focuses on very local structure, high perplexity considers more global structure. Try several values and compare.
Caveats:
- t-SNE is stochastic — different runs produce different results. Always set a random seed.
- Cluster sizes and distances in t-SNE plots are NOT meaningful — only the grouping structure is.
- t-SNE is for visualization only, not for preprocessing before ML (use PCA for that).
- Slow for large datasets ( or with Barnes-Hut approximation).
UMAP: Uniform Manifold Approximation and Projection
UMAP is a newer alternative to t-SNE with several advantages:
- Faster: Scales much better to large datasets.
- Preserves more global structure: t-SNE focuses on local neighborhoods; UMAP better preserves the relative positions of clusters.
- Can be used for general dimensionality reduction: Unlike t-SNE, UMAP can be used as a preprocessing step before ML.
- The key parameter is
n_neighbors(similar to perplexity) andmin_dist(controls how tightly points are packed).
Geoscience Applications
- Multi-attribute seismic data: Seismic interpretation may involve 20+ attributes (amplitude, frequency, phase, coherence, curvature, etc.). PCA can reduce these to 3–5 principal components that capture most of the variation, making interpretation and classification tractable.
- Well-log data: A well log suite might include GR, SP, RHOB, NPHI, PE, DT, RT, and derived logs. PCA identifies the independent sources of variation (often: lithology, porosity, fluid content).
- Geochemical data: Major and trace element concentrations (20+ elements) can be reduced to key component axes that correspond to geological processes (e.g., magmatic differentiation, alteration).
- Visualization: Use t-SNE or UMAP to project multi-dimensional well-log or geochemical data into 2D for cluster identification and facies visualization.
References
- Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 14.5 (principal component analysis). Springer.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning, ch. 12 (continuous latent variables, PCA). Springer.
- van der Maaten, L., Hinton, G. (2008). Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605.
- McInnes, L., Healy, J., Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426.