Trees, random forests, and gradient boosting

Part 9 — Machine learning for researchers

Learning objectives

Define a DECISION TREE as a recursive axis-aligned partition that minimises within-leaf variance (regression) or impurity (classification)
Recognise the BAGGING principle: average many bootstrap-trained trees to reduce variance
Define RANDOM FORESTS (Breiman 2001) as bagging + random feature subsetting
Define GRADIENT BOOSTING (Friedman 2001) as sequential bias-correction via tree ensemble
Compare random forests vs gradient boosting in terms of variance/bias trade-offs

Tree-based methods are the workhorse of tabular ML. They partition the feature space recursively, fit constant predictions in each leaf, and naturally handle non-linearities, interactions, and mixed data types (numeric + categorical). ENSEMBLE methods — random forests and gradient boosting — combine many trees to overcome single-tree limitations.

The decision tree

A regression tree recursively partitions the feature space:

At each node, consider all possible (feature, threshold) splits.
Choose the split that minimises $\sum_L (y_i - \bar{y}_L)^2 + \sum_R (y_i - \bar{y}_R)^2$ (regression) or impurity (classification).
Recursively split until depth, leaf-size, or min-improvement criterion is reached.

The output is a STEP FUNCTION over the feature space — axis-aligned, easy to interpret, prone to overfitting in single-tree form (high variance: small training-set changes can dramatically alter the tree).

Bagging: variance reduction via ensemble

Bagging (Breiman 1996): "bootstrap aggregating".

Draw B bootstrap samples from the data.
Fit a tree on each.
Average the predictions (for regression) or majority-vote (for classification).

Variance of the ensemble is approximately $\sigma^2/B$ for uncorrelated trees and decreases as B grows. In practice, trees are CORRELATED (same data o similar trees), so the practical gain is less than full $1/B$ reduction. Out-of-bag samples (left out of each bootstrap) give a cheap internal estimate of generalisation error.

Random forest (Breiman 2001)

Random forest = bagging + RANDOM FEATURE SUBSETTING. At each split, consider only a random subset of features (typically $\sqrt{p}$ for classification or $p/3$ for regression). This DECORRELATES the trees, allowing fuller variance reduction. Random forest is the modern default for tabular ML when interpretability is not paramount.

Hyperparameters: B (typically 500-5000), max depth, min leaf size, feature subset size m. Cross-validation tunes these.

Gradient boosting (Friedman 2001)

Bagging reduces variance; boosting reduces BIAS by SEQUENTIAL bias correction:

Fit an initial weak tree (small depth, high bias).
Compute residuals (or gradient of loss).
Fit a NEW tree to the residuals (or gradients).
Add a small fraction of the new tree to the model (learning rate η, typically 0.01-0.1).
Repeat for many iterations.

Each tree corrects the residuals of the previous. Result: an ENSEMBLE that fits very well. Modern implementations (XGBoost, LightGBM, CatBoost) include regularisation (L1/L2 on tree weights), column subsampling, early stopping via validation set. Currently dominate Kaggle competitions and tabular ML applications.

Random forest vs gradient boosting

Random forest: parallel trees, low BIAS individually but high variance, ensembling reduces variance. Less prone to overfit. Easier to tune. Good baseline.
Gradient boosting: sequential trees, each correcting residuals. Both can fit very well; sensitive to hyperparameters (learning rate, depth, regularisation). Generally HIGHER ACCURACY than random forest at cost of more careful tuning.

In practice: try both. XGBoost/LightGBM usually wins on heavily-tuned applications; random forest is more robust to default settings.

Interpretability tools

Feature importance: average decrease in impurity contributed by each feature across all trees. Identifies "predictive" features.
Permutation importance: shuffle one feature's values, observe drop in predictive accuracy. Tests how much the model depends on each feature.
Partial dependence plots: show the marginal effect of one feature on the prediction, averaging over the others.
SHAP values (Lundberg-Lee 2017): game-theoretic attribution of each prediction to each feature. Modern best practice for ML interpretability.

Critical: feature importance ≠ causal effect. These tools tell you what the MODEL uses, not what the WORLD has caused.

Try it

Defaults: depth 4, B = 30, noise = 0.3. The single tree (red) is visibly a step function; the random forest (orange) is smoother, tracking the true sine pattern more closely. RF MSE is typically 30-50% lower than the single tree.
Crank depth to 10 (very deep tree). The single tree OVERFITS — it follows individual data points. RF still smooths because averaging over bootstrap samples cancels overfitting. RF is more robust to depth choice.
Crank B from 30 to 200. The RF prediction stabilises and gets smoother. Diminishing returns — at large B, additional trees barely change predictions. Typical B for production: 100-500.
Set depth to 1 (stump). Single tree is just a horizontal line — high bias. RF averages many stumps — still under-fits. Boosting (sequential, not implemented here) handles this case better.
Crank noise to 1.0 (very noisy). RF's smoothing advantage over single tree is even more pronounced. The classical bias-variance trade-off: at high noise, variance reduction via ensemble pays off most.

A data scientist trains an XGBoost model and observes 99% accuracy on training but 70% on test. What's the likely issue and how should they adjust hyperparameters?

What you now know

Decision trees recursively partition the feature space. Random forests bag many randomly-decorrelated trees to reduce variance. Gradient boosting sequentially corrects residuals via tree ensemble for higher accuracy. SHAP / permutation importance interpret black-box predictions but don't prove causation. §9.4 next: calibration and probability outputs — making ML predictions trustworthy as probabilistic statements.

References

Breiman, L. (2001). "Random forests." Machine Learning 45(1), 5–32.
Friedman, J.H. (2001). "Greedy function approximation: A gradient boosting machine." Annals of Statistics 29(5), 1189–1232.
Chen, T., Guestrin, C. (2016). "XGBoost: A scalable tree boosting system." KDD.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y. (2017). "LightGBM: A highly efficient gradient boosting decision tree." NeurIPS.
Lundberg, S.M., Lee, S.I. (2017). "A unified approach to interpreting model predictions." NeurIPS. (SHAP values.)