Trees, random forests, and gradient boosting
Learning objectives
- Define a DECISION TREE as a recursive axis-aligned partition that minimises within-leaf variance (regression) or impurity (classification)
- Recognise the BAGGING principle: average many bootstrap-trained trees to reduce variance
- Define RANDOM FORESTS (Breiman 2001) as bagging + random feature subsetting
- Define GRADIENT BOOSTING (Friedman 2001) as sequential bias-correction via tree ensemble
- Compare random forests vs gradient boosting in terms of variance/bias trade-offs
Tree-based methods are the workhorse of tabular ML. They partition the feature space recursively, fit constant predictions in each leaf, and naturally handle non-linearities, interactions, and mixed data types (numeric + categorical). ENSEMBLE methods — random forests and gradient boosting — combine many trees to overcome single-tree limitations.
The decision tree
A regression tree recursively partitions the feature space:
- At each node, consider all possible (feature, threshold) splits.
- Choose the split that minimises (regression) or impurity (classification).
- Recursively split until depth, leaf-size, or min-improvement criterion is reached.
The output is a STEP FUNCTION over the feature space — axis-aligned, easy to interpret, prone to overfitting in single-tree form (high variance: small training-set changes can dramatically alter the tree).
Bagging: variance reduction via ensemble
Bagging (Breiman 1996): "bootstrap aggregating".
- Draw B bootstrap samples from the data.
- Fit a tree on each.
- Average the predictions (for regression) or majority-vote (for classification).
Variance of the ensemble is approximately for uncorrelated trees and decreases as B grows. In practice, trees are CORRELATED (same data o similar trees), so the practical gain is less than full reduction. Out-of-bag samples (left out of each bootstrap) give a cheap internal estimate of generalisation error.
Random forest (Breiman 2001)
Random forest = bagging + RANDOM FEATURE SUBSETTING. At each split, consider only a random subset of features (typically for classification or for regression). This DECORRELATES the trees, allowing fuller variance reduction. Random forest is the modern default for tabular ML when interpretability is not paramount.
Hyperparameters: B (typically 500-5000), max depth, min leaf size, feature subset size m. Cross-validation tunes these.
Gradient boosting (Friedman 2001)
Bagging reduces variance; boosting reduces BIAS by SEQUENTIAL bias correction:
- Fit an initial weak tree (small depth, high bias).
- Compute residuals (or gradient of loss).
- Fit a NEW tree to the residuals (or gradients).
- Add a small fraction of the new tree to the model (learning rate η, typically 0.01-0.1).
- Repeat for many iterations.
Each tree corrects the residuals of the previous. Result: an ENSEMBLE that fits very well. Modern implementations (XGBoost, LightGBM, CatBoost) include regularisation (L1/L2 on tree weights), column subsampling, early stopping via validation set. Currently dominate Kaggle competitions and tabular ML applications.
Random forest vs gradient boosting
- Random forest: parallel trees, low BIAS individually but high variance, ensembling reduces variance. Less prone to overfit. Easier to tune. Good baseline.
- Gradient boosting: sequential trees, each correcting residuals. Both can fit very well; sensitive to hyperparameters (learning rate, depth, regularisation). Generally HIGHER ACCURACY than random forest at cost of more careful tuning.
In practice: try both. XGBoost/LightGBM usually wins on heavily-tuned applications; random forest is more robust to default settings.
Interpretability tools
- Feature importance: average decrease in impurity contributed by each feature across all trees. Identifies "predictive" features.
- Permutation importance: shuffle one feature's values, observe drop in predictive accuracy. Tests how much the model depends on each feature.
- Partial dependence plots: show the marginal effect of one feature on the prediction, averaging over the others.
- SHAP values (Lundberg-Lee 2017): game-theoretic attribution of each prediction to each feature. Modern best practice for ML interpretability.
Critical: feature importance ≠ causal effect. These tools tell you what the MODEL uses, not what the WORLD has caused.
Try it
- Defaults: depth 4, B = 30, noise = 0.3. The single tree (red) is visibly a step function; the random forest (orange) is smoother, tracking the true sine pattern more closely. RF MSE is typically 30-50% lower than the single tree.
- Crank depth to 10 (very deep tree). The single tree OVERFITS — it follows individual data points. RF still smooths because averaging over bootstrap samples cancels overfitting. RF is more robust to depth choice.
- Crank B from 30 to 200. The RF prediction stabilises and gets smoother. Diminishing returns — at large B, additional trees barely change predictions. Typical B for production: 100-500.
- Set depth to 1 (stump). Single tree is just a horizontal line — high bias. RF averages many stumps — still under-fits. Boosting (sequential, not implemented here) handles this case better.
- Crank noise to 1.0 (very noisy). RF's smoothing advantage over single tree is even more pronounced. The classical bias-variance trade-off: at high noise, variance reduction via ensemble pays off most.
A data scientist trains an XGBoost model and observes 99% accuracy on training but 70% on test. What's the likely issue and how should they adjust hyperparameters?
What you now know
Decision trees recursively partition the feature space. Random forests bag many randomly-decorrelated trees to reduce variance. Gradient boosting sequentially corrects residuals via tree ensemble for higher accuracy. SHAP / permutation importance interpret black-box predictions but don't prove causation. §9.4 next: calibration and probability outputs — making ML predictions trustworthy as probabilistic statements.
References
- Breiman, L. (2001). "Random forests." Machine Learning 45(1), 5–32.
- Friedman, J.H. (2001). "Greedy function approximation: A gradient boosting machine." Annals of Statistics 29(5), 1189–1232.
- Chen, T., Guestrin, C. (2016). "XGBoost: A scalable tree boosting system." KDD.
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y. (2017). "LightGBM: A highly efficient gradient boosting decision tree." NeurIPS.
- Lundberg, S.M., Lee, S.I. (2017). "A unified approach to interpreting model predictions." NeurIPS. (SHAP values.)