Random Forest

Chapter 8: Tree-Based Models II — Random Forests

Learning objectives

  • Understand ensemble methods and the wisdom-of-crowds principle
  • Explain bagging (bootstrap aggregating) and how it reduces variance
  • Describe how Random Forest combines bagging with random feature subsets
  • Write the Random Forest prediction formula for regression and classification
  • Explain out-of-bag (OOB) error estimation and its advantages
  • Interpret feature importance via mean decrease in impurity
  • Identify key hyperparameters and their effects on model performance

Ensemble Methods: The Wisdom of Crowds

A single decision tree, as we learned in the previous chapter, is easy to interpret but tends to overfit the training data. Small changes in the data can produce dramatically different trees. Ensemble methods address this weakness by combining multiple models to produce a single, more robust prediction.

The core insight is the wisdom of crowds: if you ask 100 people to estimate the weight of an ox, the average of their guesses is usually closer to the true weight than any single guess. Similarly, averaging the predictions of many decision trees yields a model that generalizes better than any individual tree.

There are two main families of ensemble methods:

  • Bagging (Bootstrap Aggregating) — train many independent models on random subsets of the data, then average their predictions. Random Forest belongs to this family.
  • Boosting — train models sequentially, where each new model focuses on the mistakes of the previous one (e.g., Gradient Boosting, XGBoost). We will cover boosting in a later chapter.

Bootstrap Aggregating (Bagging)

Bagging works in three steps:

  • Bootstrap sampling: From the original dataset of nn samples, draw nn samples with replacement. This means some samples appear multiple times while others are left out. On average, each bootstrap sample contains about 63.2% of the unique original samples (because 1(11/n)n11/e0.6321 - (1 - 1/n)^n \approx 1 - 1/e \approx 0.632).
  • Train independent models: Fit a decision tree to each bootstrap sample independently.
  • Aggregate predictions: For regression, average the predictions of all trees. For classification, take a majority vote.

Why Bagging Reduces Variance

If each tree has variance σ2\sigma^2 and the trees are uncorrelated, then the variance of the average of BB trees is σ2/B\sigma^2 / B. Bagging dramatically reduces overfitting by lowering variance, while bias remains roughly unchanged.

In practice the trees are not perfectly uncorrelated (they are all trained on similar data), so the variance reduction is not as dramatic as 1/B1/B, but it is still substantial.

Random Forest: Bagging + Random Feature Subsets

Random Forest improves upon plain bagging by introducing an additional source of randomness: at each split in each tree, only a random subset of features is considered as candidates for the best split. This is the key innovation of Random Forest (Breiman, 2001).

The reason is that in plain bagging, if one feature is very strong, every tree will split on that feature first, making the trees highly correlated. By restricting each split to a random subset of features, we decorrelate the trees, which improves the variance reduction from averaging.

Random Forest Algorithm

For b=1,2,,Bb = 1, 2, \ldots, B:

  • Draw a bootstrap sample DbD_b of size nn from the training data (with replacement).
  • Grow a decision tree fbf_b on DbD_b. At each node, select mm features at random from the full set of pp features, and choose the best split among those mm features only.
  • Grow the tree fully (or to a specified max_depth) without pruning.

Regression prediction:

y^=1Bb=1Bfb(x)\hat{y} = \frac{1}{B}\sum_{b=1}^{B} f_b(x)

Classification prediction:

y^=mode{f1(x),f2(x),,fB(x)}\hat{y} = \text{mode}{f_1(x), f_2(x), \ldots, f_B(x)} (majority vote)

Common defaults for mm:

  • Classification: m=pm = \lfloor\sqrt{p}\rfloor
  • Regression: m=p/3m = \lfloor p/3 \rfloor

Out-of-Bag (OOB) Error Estimation

Because each bootstrap sample leaves out about 36.8% of the data, those left-out samples can serve as a built-in validation set for each tree. This is called the out-of-bag (OOB) error estimate.

For each sample xix_i in the dataset, collect predictions only from the trees whose bootstrap sample did not include xix_i. Average (or vote) those predictions to get the OOB prediction for xix_i. The overall OOB error is computed by comparing OOB predictions to true labels across all samples.

The OOB error is approximately equivalent to leave-one-out cross-validation, but it comes for free — no extra computation is needed beyond training the forest.

Feature Importance

Random Forest provides a natural measure of feature importance. The most common method is Mean Decrease in Impurity (MDI):

  • For each feature jj, sum the total reduction in Gini impurity (or MSE for regression) across all splits on feature jj in all trees.
  • Normalize so the importances sum to 1.

An alternative is Permutation Importance: randomly shuffle the values of feature jj in the OOB data and measure how much the OOB accuracy drops. Features whose shuffling causes a large drop are important.

Caution

MDI-based importance can be biased toward high-cardinality features (features with many unique values). Permutation importance is generally more reliable but slower to compute.

Key Hyperparameters

ParameterDescriptionEffect
n_estimatorsNumber of trees BBMore trees = more stable predictions, diminishing returns past ~100-500. Never hurts accuracy, but increases computation.
max_depthMaximum depth of each treeDeeper trees = more complex models. None = fully grown. Limiting depth can reduce overfitting.
max_featuresNumber of features mm to consider per splitLower = more decorrelated trees (less variance, slightly more bias). Default: sqrt(p) for classification.
min_samples_splitMinimum samples to split a nodeHigher = simpler trees, less overfitting.
min_samples_leafMinimum samples in a leaf nodePrevents very small leaves that memorize noise.
bootstrapWhether to bootstrap samplesTrue = standard RF. False = each tree uses the full dataset (less randomness).

Random Forest vs. Single Decision Tree

AspectDecision TreeRandom Forest
OverfittingHigh (memorizes training data)Much lower (averaging reduces variance)
InterpretabilityEasy to visualize and explainHarder (many trees), but feature importance helps
StabilityUnstable — small data changes cause big tree changesStable — averaging smooths out instability
BiasLow (fully grown tree)Similar bias, much lower variance
SpeedFast to trainSlower (must train many trees), but easily parallelized
Missing valuesSome implementations handle nativelySame, plus OOB-based proximity can impute missing data

Geoscience Applications

Random Forest is one of the most popular ML algorithms in geoscience because it handles mixed data types, is robust to outliers, and provides feature importance for scientific interpretation:

  • Geochemical classification: Classifying rock types from major-element geochemistry (SiO2, Al2O3, FeO, MgO, etc.). RF handles the correlations between oxides gracefully.
  • Mineral prospectivity mapping: Predicting the probability of ore deposits from geological, geophysical, and geochemical features across a grid.
  • Lithofacies prediction from well logs: Classifying lithofacies (sandstone, shale, limestone, etc.) from gamma ray (GR), resistivity (RT), neutron porosity (NPHI), and bulk density (RHOB) measurements.
  • Earthquake magnitude prediction: Using seismic waveform features to estimate magnitude.

References

  • Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 15 (random forests). Springer.
  • James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 8 (bagging, random forests, boosting). Springer.
  • Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 18 (bagging, random forests). MIT Press.
  • Bergen, K.J., Johnson, P.A., de Hoop, M.V., Beroza, G.C. (2019). Machine learning for data-driven discovery in solid Earth geoscience. Science 363, eaau0323.

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.