Random Forest

Chapter 8: Tree-Based Models II — Random Forests

Learning objectives

Understand ensemble methods and the wisdom-of-crowds principle
Explain bagging (bootstrap aggregating) and how it reduces variance
Describe how Random Forest combines bagging with random feature subsets
Write the Random Forest prediction formula for regression and classification
Explain out-of-bag (OOB) error estimation and its advantages
Interpret feature importance via mean decrease in impurity
Identify key hyperparameters and their effects on model performance

Ensemble Methods: The Wisdom of Crowds

A single decision tree, as we learned in the previous chapter, is easy to interpret but tends to overfit the training data. Small changes in the data can produce dramatically different trees. Ensemble methods address this weakness by combining multiple models to produce a single, more robust prediction.

The core insight is the wisdom of crowds: if you ask 100 people to estimate the weight of an ox, the average of their guesses is usually closer to the true weight than any single guess. Similarly, averaging the predictions of many decision trees yields a model that generalizes better than any individual tree.

There are two main families of ensemble methods:

Bagging (Bootstrap Aggregating) — train many independent models on random subsets of the data, then average their predictions. Random Forest belongs to this family.
Boosting — train models sequentially, where each new model focuses on the mistakes of the previous one (e.g., Gradient Boosting, XGBoost). We will cover boosting in a later chapter.

Bootstrap Aggregating (Bagging)

Bagging works in three steps:

Bootstrap sampling: From the original dataset of $n$ samples, draw $n$ samples with replacement. This means some samples appear multiple times while others are left out. On average, each bootstrap sample contains about 63.2% of the unique original samples (because $1 - (1 - 1/n)^n \approx 1 - 1/e \approx 0.632$ ).
Train independent models: Fit a decision tree to each bootstrap sample independently.
Aggregate predictions: For regression, average the predictions of all trees. For classification, take a majority vote.

Why Bagging Reduces Variance

If each tree has variance $\sigma^2$ and the trees are uncorrelated, then the variance of the average of $B$ trees is $\sigma^2 / B$ . Bagging dramatically reduces overfitting by lowering variance, while bias remains roughly unchanged.

In practice the trees are not perfectly uncorrelated (they are all trained on similar data), so the variance reduction is not as dramatic as $1/B$ , but it is still substantial.

Random Forest: Bagging + Random Feature Subsets

Random Forest improves upon plain bagging by introducing an additional source of randomness: at each split in each tree, only a random subset of features is considered as candidates for the best split. This is the key innovation of Random Forest (Breiman, 2001).

The reason is that in plain bagging, if one feature is very strong, every tree will split on that feature first, making the trees highly correlated. By restricting each split to a random subset of features, we decorrelate the trees, which improves the variance reduction from averaging.

Random Forest Algorithm

For $b = 1, 2, \ldots, B$ :

Draw a bootstrap sample $D_b$ of size $n$ from the training data (with replacement).
Grow a decision tree $f_b$ on $D_b$ . At each node, select $m$ features at random from the full set of $p$ features, and choose the best split among those $m$ features only.
Grow the tree fully (or to a specified max_depth) without pruning.

Regression prediction:

\hat{y} = \frac{1}{B}\sum_{b=1}^{B} f_b(x)

Classification prediction:

$\hat{y} = \text{mode}{f_1(x), f_2(x), \ldots, f_B(x)}$ (majority vote)

Common defaults for $m$ :

Classification: $m = \lfloor\sqrt{p}\rfloor$
Regression: $m = \lfloor p/3 \rfloor$

Out-of-Bag (OOB) Error Estimation

Because each bootstrap sample leaves out about 36.8% of the data, those left-out samples can serve as a built-in validation set for each tree. This is called the out-of-bag (OOB) error estimate.

For each sample $x_i$ in the dataset, collect predictions only from the trees whose bootstrap sample did not include $x_i$ . Average (or vote) those predictions to get the OOB prediction for $x_i$ . The overall OOB error is computed by comparing OOB predictions to true labels across all samples.

The OOB error is approximately equivalent to leave-one-out cross-validation, but it comes for free — no extra computation is needed beyond training the forest.

Feature Importance

Random Forest provides a natural measure of feature importance. The most common method is Mean Decrease in Impurity (MDI):

For each feature $j$ , sum the total reduction in Gini impurity (or MSE for regression) across all splits on feature $j$ in all trees.
Normalize so the importances sum to 1.

An alternative is Permutation Importance: randomly shuffle the values of feature $j$ in the OOB data and measure how much the OOB accuracy drops. Features whose shuffling causes a large drop are important.

Caution

MDI-based importance can be biased toward high-cardinality features (features with many unique values). Permutation importance is generally more reliable but slower to compute.

Key Hyperparameters

Parameter	Description	Effect
`n_estimators`	Number of trees $B$	More trees = more stable predictions, diminishing returns past ~100-500. Never hurts accuracy, but increases computation.
`max_depth`	Maximum depth of each tree	Deeper trees = more complex models. `None` = fully grown. Limiting depth can reduce overfitting.
`max_features`	Number of features $m$ to consider per split	Lower = more decorrelated trees (less variance, slightly more bias). Default: `sqrt(p)` for classification.
`min_samples_split`	Minimum samples to split a node	Higher = simpler trees, less overfitting.
`min_samples_leaf`	Minimum samples in a leaf node	Prevents very small leaves that memorize noise.
`bootstrap`	Whether to bootstrap samples	`True` = standard RF. `False` = each tree uses the full dataset (less randomness).

Random Forest vs. Single Decision Tree

Aspect	Decision Tree	Random Forest
Overfitting	High (memorizes training data)	Much lower (averaging reduces variance)
Interpretability	Easy to visualize and explain	Harder (many trees), but feature importance helps
Stability	Unstable — small data changes cause big tree changes	Stable — averaging smooths out instability
Bias	Low (fully grown tree)	Similar bias, much lower variance
Speed	Fast to train	Slower (must train many trees), but easily parallelized
Missing values	Some implementations handle natively	Same, plus OOB-based proximity can impute missing data

Geoscience Applications

Random Forest is one of the most popular ML algorithms in geoscience because it handles mixed data types, is robust to outliers, and provides feature importance for scientific interpretation:

Geochemical classification: Classifying rock types from major-element geochemistry (SiO2, Al2O3, FeO, MgO, etc.). RF handles the correlations between oxides gracefully.
Mineral prospectivity mapping: Predicting the probability of ore deposits from geological, geophysical, and geochemical features across a grid.
Lithofacies prediction from well logs: Classifying lithofacies (sandstone, shale, limestone, etc.) from gamma ray (GR), resistivity (RT), neutron porosity (NPHI), and bulk density (RHOB) measurements.
Earthquake magnitude prediction: Using seismic waveform features to estimate magnitude.

References

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.), ch. 15 (random forests). Springer.
James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), ch. 8 (bagging, random forests, boosting). Springer.
Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction, ch. 18 (bagging, random forests). MIT Press.
Bergen, K.J., Johnson, P.A., de Hoop, M.V., Beroza, G.C. (2019). Machine learning for data-driven discovery in solid Earth geoscience. Science 363, eaau0323.