Trees, random forests, and gradient boosting

Part 9 — Machine learning for researchers

Learning objectives

  • Define a DECISION TREE as a recursive axis-aligned partition that minimises within-leaf variance (regression) or impurity (classification)
  • Recognise the BAGGING principle: average many bootstrap-trained trees to reduce variance
  • Define RANDOM FORESTS (Breiman 2001) as bagging + random feature subsetting
  • Define GRADIENT BOOSTING (Friedman 2001) as sequential bias-correction via tree ensemble
  • Compare random forests vs gradient boosting in terms of variance/bias trade-offs

Tree-based methods are the workhorse of tabular ML. They partition the feature space recursively, fit constant predictions in each leaf, and naturally handle non-linearities, interactions, and mixed data types (numeric + categorical). ENSEMBLE methods — random forests and gradient boosting — combine many trees to overcome single-tree limitations.

The decision tree

A regression tree recursively partitions the feature space:

  • At each node, consider all possible (feature, threshold) splits.
  • Choose the split that minimises L(yiyˉL)2+R(yiyˉR)2\sum_L (y_i - \bar{y}_L)^2 + \sum_R (y_i - \bar{y}_R)^2 (regression) or impurity (classification).
  • Recursively split until depth, leaf-size, or min-improvement criterion is reached.

The output is a STEP FUNCTION over the feature space — axis-aligned, easy to interpret, prone to overfitting in single-tree form (high variance: small training-set changes can dramatically alter the tree).

Bagging: variance reduction via ensemble

Bagging (Breiman 1996): "bootstrap aggregating".

  • Draw B bootstrap samples from the data.
  • Fit a tree on each.
  • Average the predictions (for regression) or majority-vote (for classification).

Variance of the ensemble is approximately σ2/B\sigma^2/B for uncorrelated trees and decreases as B grows. In practice, trees are CORRELATED (same data o similar trees), so the practical gain is less than full 1/B1/B reduction. Out-of-bag samples (left out of each bootstrap) give a cheap internal estimate of generalisation error.

Random forest (Breiman 2001)

Random forest = bagging + RANDOM FEATURE SUBSETTING. At each split, consider only a random subset of features (typically p\sqrt{p} for classification or p/3p/3 for regression). This DECORRELATES the trees, allowing fuller variance reduction. Random forest is the modern default for tabular ML when interpretability is not paramount.

Hyperparameters: B (typically 500-5000), max depth, min leaf size, feature subset size m. Cross-validation tunes these.

Gradient boosting (Friedman 2001)

Bagging reduces variance; boosting reduces BIAS by SEQUENTIAL bias correction:

  • Fit an initial weak tree (small depth, high bias).
  • Compute residuals (or gradient of loss).
  • Fit a NEW tree to the residuals (or gradients).
  • Add a small fraction of the new tree to the model (learning rate η, typically 0.01-0.1).
  • Repeat for many iterations.

Each tree corrects the residuals of the previous. Result: an ENSEMBLE that fits very well. Modern implementations (XGBoost, LightGBM, CatBoost) include regularisation (L1/L2 on tree weights), column subsampling, early stopping via validation set. Currently dominate Kaggle competitions and tabular ML applications.

Random forest vs gradient boosting

  • Random forest: parallel trees, low BIAS individually but high variance, ensembling reduces variance. Less prone to overfit. Easier to tune. Good baseline.
  • Gradient boosting: sequential trees, each correcting residuals. Both can fit very well; sensitive to hyperparameters (learning rate, depth, regularisation). Generally HIGHER ACCURACY than random forest at cost of more careful tuning.

In practice: try both. XGBoost/LightGBM usually wins on heavily-tuned applications; random forest is more robust to default settings.

Interpretability tools

  • Feature importance: average decrease in impurity contributed by each feature across all trees. Identifies "predictive" features.
  • Permutation importance: shuffle one feature's values, observe drop in predictive accuracy. Tests how much the model depends on each feature.
  • Partial dependence plots: show the marginal effect of one feature on the prediction, averaging over the others.
  • SHAP values (Lundberg-Lee 2017): game-theoretic attribution of each prediction to each feature. Modern best practice for ML interpretability.

Critical: feature importance ≠ causal effect. These tools tell you what the MODEL uses, not what the WORLD has caused.

Trees Forests ExplorerInteractive figure — enable JavaScript to interact.

Try it

  • Defaults: depth 4, B = 30, noise = 0.3. The single tree (red) is visibly a step function; the random forest (orange) is smoother, tracking the true sine pattern more closely. RF MSE is typically 30-50% lower than the single tree.
  • Crank depth to 10 (very deep tree). The single tree OVERFITS — it follows individual data points. RF still smooths because averaging over bootstrap samples cancels overfitting. RF is more robust to depth choice.
  • Crank B from 30 to 200. The RF prediction stabilises and gets smoother. Diminishing returns — at large B, additional trees barely change predictions. Typical B for production: 100-500.
  • Set depth to 1 (stump). Single tree is just a horizontal line — high bias. RF averages many stumps — still under-fits. Boosting (sequential, not implemented here) handles this case better.
  • Crank noise to 1.0 (very noisy). RF's smoothing advantage over single tree is even more pronounced. The classical bias-variance trade-off: at high noise, variance reduction via ensemble pays off most.

A data scientist trains an XGBoost model and observes 99% accuracy on training but 70% on test. What's the likely issue and how should they adjust hyperparameters?

What you now know

Decision trees recursively partition the feature space. Random forests bag many randomly-decorrelated trees to reduce variance. Gradient boosting sequentially corrects residuals via tree ensemble for higher accuracy. SHAP / permutation importance interpret black-box predictions but don't prove causation. §9.4 next: calibration and probability outputs — making ML predictions trustworthy as probabilistic statements.

References

  • Breiman, L. (2001). "Random forests." Machine Learning 45(1), 5–32.
  • Friedman, J.H. (2001). "Greedy function approximation: A gradient boosting machine." Annals of Statistics 29(5), 1189–1232.
  • Chen, T., Guestrin, C. (2016). "XGBoost: A scalable tree boosting system." KDD.
  • Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y. (2017). "LightGBM: A highly efficient gradient boosting decision tree." NeurIPS.
  • Lundberg, S.M., Lee, S.I. (2017). "A unified approach to interpreting model predictions." NeurIPS. (SHAP values.)

This page is prerendered for SEO and accessibility. The interactive widgets above hydrate on JavaScript load.